Parts of this article were generated with the help of artificial intelligence
1. Background
Since the public launch of the generative AI tool ChatGPT at the end of November last year, educational institutions all over the world are trying to understand how such technology will impact their teaching and learning and overall assessment strategies. While good generative AI tools have been publically available for several years, ChatGPT has caused massive media attention because of its ability to do a wide range of tasks using a simple chat interface. The availability of such advanced technology provides unknown areas of opportunity but also poses some key challenges, in particular when it comes to assessment integrity. The following is our evaluation of the current situation and how Inspera Assessment can help you ensure assessment integrity, both long and short-term.
What is generative AI?
At its core, generative artificial intelligence (AI) is a term to describe data models that are trained to replicate the patterns and structures present in the training data to produce new, high-quality derivative data that is similar to the original. Think of a digital writer with unlimited capacity that is reviewing all texts available online, then tries to write new texts based on the existing texts, and then getting feedback on the writer did a good job or not. Then repeat a billion times until a human-like text creation ability is achieved.
It is important to note there are also numerous other generative AI tools beyond Chat GPT, for example:
- Create realistic images and art from a description in natural language with Dall-E 2, Midjourney, Stable Diffusion
- Create content with MarkCopyAI, Jasper
- Create your own music with AmperAI
- If you write code, the GitHub Copilot can likely speed up your coding drastically
What has happened with assessment integrity in recent months
There are many reports of exam submissions in the past months that passed and even received top marks across universities, where even the most seasoned academics struggle to differentiate AI from human responses. Examples include Chat GPT passing the bar exam or scoring a perfect score on some of the questions in a business class exam experiment at Wharton. To make it more complicated to detect, the AI tools do a fairly good job of not plagiarising, making traditional plagiarism detection less valuable.
Some universities have prohibited the use of generative AI tools, likening it to misconduct such as collusion or plagiarism, while others compare it to the introduction of the calculator. The challenge is to find ways to distinguish between assessments of skills that normally would be supported by such tools in professional practice and those that require independent human performance skills.
2. Options for our universities to ensure assessment integrity
There are many strategies that can be chosen to ensure assessment integrity in response to generative AI, but only some of them, in our view, actually improve integrity rather than merely offering false confidence.
The strategic options described below in 2.2 – 2.3 are those Inspera currently supports out of the box. Option 2.4 is a short-term solution, while option 2.5 presents a long-term strategy that we are working on together with our partners.
The right solution for you depends on your teaching, learning, and assessment strategy and context. As with all digital transformation, the answer comes from students, faculty, learning technologists, learning and teaching specialists, and administrators collaborating to design authentic, reliable, and valid assessments at your institution.
2.1 Use the ChatGPT detection tool with your open-book assessments
You could use OpenAI’s own AI-content detection tool or other detection tools, with caution.
OpenAI is very clear and open about the fact that detection is very unreliable. Their own detector has a true positive rate of only 26%, and the model will be incorrect in flagging texts as being written by AI in at least 9% of cases.
https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text/
We believe AI content detection tools are not a viable long-term strategy. In the coming months, big and small actors have and will continue to provide features for detecting AI-generated content. This will trigger an AI arms race where generative AI will be trained on its ability to fool AI-detecting AI. This is a losing game, and it will always be easier to generate human-like content than to prove that content is AI-generated.
2.2 Design your Open-Book Assessments to assess the desired skills
Open-book assessments can be designed to test higher-order thinking skills, such as analysis, synthesis, and evaluation, rather than just recalling factual information, which will reduce the impact of generative AI on assessment integrity.
An effective open-book assessment has to be written with the knowledge that students have access to a rich variety of resources. Text-based generative AI can reason and write compelling arguments. It handles tasks such as creating a summary, expanding on text, or writing about a text very well. At present, ChatGPT is not connected to the internet and so only “knows” what information was available when it was trained. For ChatGPT, that means that it does not know anything beyond 2021 at the time of writing. Generative AI is thus more likely to give factually wrong answers than make incoherent text or provide poor reasoning.
Tasks | Performance* | Examples | Comments |
Text summary | Very good | Language test asked to answer or summarise text | This is the primary use case of generative AI |
Text generation on qualitative subjects | Very good | Write essay on a topic or given some text write an essay to argue for or against | This is the primary use case of generative AI. The more open-ended an assignment, the better it is likely to do. |
Art or visual design | Very good | Create drawn images, photography, icons and styles | This is the primary use case of generative AI. It’s quick to create new content that can be indistinguishable from the real thing. |
Code generation | Good | Code assignments.
Maths markup-language (LaTex) |
Code is likely to run but may not do exactly what was asked.
Can be used to generate LaTex code required in multiple STEM subjects. |
Text generation quantitative subjects | Medium | STEM assignments | While output is likely convincing it may struggle on reasoning for very subject specific matters (e.g. Chemistry, Physics calculations) or get terms wrong. It will generate convincing content and so the errors may be hard to spot for a non-subject matter expert. |
Mathematical operations in text | Poor | Physics and Maths assignments | Will typically fail with applying appropriate physical units (e.g. struggle determining 1 cm + 1m) and concrete mathematical operations (e.g. arithmetics). Struggles with operations such as summarising numbers in a text. |
* As of the time of writing. OpenAI is explicitly focusing on improving the handling of quantitative reasoning, and we expect to see significantly improved performance in this area in the next 3-12 months.
An article recently published by Times Higher Education on the Do’s and Don’ts of Open-Book Exams, explains that open-book assessments are more suitable for developing and testing the higher levels of Bloom’s taxonomy, such as analysing and evaluating and are associated with lower levels of student anxiety.
A further consideration, however, is the authenticity of an assessment. If it can be answered using generative AI, then is it a question worth asking a student? Should the assessment instead be to find the best-generated answer? Or is it an indication that outside of this assessment, a student would be unlikely to have to recall such information without recourse to any resource, AI or otherwise? If factual knowledge needs to be tested, then perhaps it’s an indication of an open-book exam being an unsuitable vehicle.
2.3 Add a locked browser and remote proctoring to your open-book assessments
Covid-19 forced many institutions to use locked browsers and remote proctoring services to protect the integrity of their existing assessments.
The purpose of a locked browser is to close down the device that the student is using to sit the exam so that only permitted resources can be accessed. Remote proctoring services can also be used to ensure that the individual taking the assessment remotely is not impersonating the actual student and that there’s no external support.
Inspera Assessment, together with Inspera Integrity Browser and Inspera Proctoring (remote proctoring service), allow universities to conduct open book assessments with the integrity elements of closed book assessments. By enabling the open security mode, students’ screens, audio and video can be captured while still allowing students access to the same resources as they would for any open-book assessment.
2.4 Combine open-book assessments with an oral exam
Although the previous approach with the open security locked environment, combined with physical invigilation or virtual proctoring, is a solution Inspera can deliver now, there are still institutions that are worried that some students may view proctoring as an invasion of their privacy.
A potential solution that likely will resonate better with these institutions is an open-book assessment combined with an oral exam. Oral exams have long been a way of allowing students to demonstrate their comprehension and critical thinking in real-time, allowing an examiner to distinguish between superficial and in-depth learning.
The threat of generative AI can be mitigated by combining an open-book assessment with an oral exam. Students would take their assessment in two parts; first, in open book conditions but with the safeguard of a screen recording only. At the conclusion of the open book portion of the assessment, the student would be connected to their examiner, a subject matter expert, who can conduct an oral exam to further validate the student’s critical thinking and in-depth learning.
This is possible with Inspera compared to other vendors because we don’t use third-party proctors. We place you in control of who your proctor is. This means your subject matter experts can conduct the oral exam.
We are currently working on Assessment Paths (currently in closed beta) that allow universities to set up a series of assessments to be taken individually but consecutively. Each assessment can have its own duration, and you are free to decide how it is configured at the delivery stage. The subject matter experts are allocated to live oral assessment sessions where they see the student’s open-book assessments, which helps to drive the conversation and enables them to mark the oral examination immediately.
2.5 Incorporate Authorship Verification into your Open-Book Assessments
An originality product should not be concerned about whether an essay was generated by AI – it should be concerned about whether the student wrote their essay themselves. ChatGPT detectors do not address other forms of ghostwriting. There is no tool to detect a student’s older cousin writing their essay for them or if an essay was procured from a professional essay mill. Inspera believes the solution to this problem is Authorship Verification (AV).
Authorship Verification uses GPTs to verify that a student authored their own content. With as little as 200 authentic words from a student, an authorship verification model can confidently assess whether a new essay is also written by that same student. New innovations in these models also allow us to extract information on the critical features for marking the assessment. Being able to explain the results is key when starting a conversation with the students about possible cases of academic misconduct.
Inspera is confident that Authorship Verification is a long-term solution to any form of ghostwriting, and we are actively researching this as a future solution.
3. Recommendations for the way forward
Generative AI is here to stay. From the pocket calculator to internet-enabled mobile phones, technological advances act as enablers and disruptors. Pedagogically sound assessment design has withstood challenges and will continue to do so. We are at the precipice, however, of one of the biggest challenges it has faced. But this comes at a time when institutions are considering their broader assessment strategy, including the crucial aspects of validity and authenticity. In this evolution, generative AI is just one factor, albeit a sizeable one.
Designing open-book assessments to be open-book so that they are not susceptible to academic misconduct and AI is a way to address the problem now. As is the greater use of locked browsers to replicate traditional exam hall conditions, but with the considerable benefits of digital. To follow the metaphor, the exam hall invigilator can also be replaced by a remote counterpart using Inspera Proctoring when there is no physical invigilation. Inspera Assessment, with its siblings, Inspera Integrity Browser and Inspera Proctoring, can be used to solve the problem right now.
Looking into the near future, building on coming functionality to combine open book assessments with an oral exam is a direction of travel we believe many institutions want to explore for the medium term as it leverages the power of digital assessment to operationalise what is otherwise administratively complex in an analogue world. Instead of having to place students in an exam hall or several rooms, then quarantine each before they are taken by someone to another room to have their oral exam with an examiner, students can be in any convenient space to take both parts of their assessment. This saves a considerable amount on rooms, people, and logistics.
Authorship verification is another promising approach that does address not only generative AI but also concerns of contract cheating.
At the same time, we should acknowledge that generative AI can be of benefit to the institution in generating questions and scenarios, taking the heavy lifting away from the educator on otherwise repetitive tasks, giving them more time to craft feedback, teach students, and engage their students in critical evaluation in their chosen field; the areas where they provide the most significant value.
In the long term, the way we assess understanding and reasoning may change. Essays, for example, may no longer be the primary method of assessment. Instead, new skills requiring training in the use of generative AI for text, image, video, audio, and perhaps other forms of media may become necessary.
Additionally, we may see a new revolution in interfaces, similar to the way the smartphone’s capacitive touchscreen interface revolutionised the way we interact with technology. The next revolution could be the transition into natural language interfaces (verbal, visual, or written), making it even more effortless to communicate with and interact with technology.
It may be that a lot of the reasoning and communication we consider universal skills today can be facilitated or completely outsourced by AI in a 10-year timespan. As such, it may be more important to have skills to effectively communicate with and utilise AI. As AI becomes more advanced, it may become increasingly necessary to leverage it in almost all aspects of our lives. Where AI has the skill to do the job as well as, or better than, a human, we have to expect the value of that skill to plummet.
In conclusion, while it is hard to say what the future holds, it is clear that AI will play an increasingly important role in our lives, and learning to effectively use it may become a key skill for success in the future.