EducationMixed Methods

GenAI and Assessment: The End of the Essay Exam or Its Renaissance?

Generative AI has rendered traditional assessment obsolete overnight—but the evidence suggests AI-generated work is already indistinguishable from student work in most rubrics. The real question is not how to detect AI use but how to redesign assessment for a world where AI is ubiquitous.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

In the spring of 2023, universities worldwide scrambled to ban ChatGPT. By fall 2024, many of those same institutions had reversed course, integrating generative AI into their curricula. By 2025, the question has shifted entirely: from "How do we prevent students from using AI?" to "How do we assess learning in a world where AI is an ambient capability?"

This shift is not a capitulation. It reflects a dawning recognition that the assessment practices most threatened by generative AI—take-home essays, literature reviews, code-from-scratch assignments, and unsupervised online exams—were already pedagogically suspect before AI arrived. They assessed product rather than process, rewarded recall rather than reasoning, and measured output rather than understanding. Generative AI did not break assessment. It exposed fractures that were already there.

The Empirical Landscape

Kofinas, Tsay, and Pike (2025), provide a rigorous empirical assessment of AI's impact on authentic assessments in higher education. Working across two UK-based universities, they submitted AI-generated responses to the same assessments that students completed, without informing the marking teams. The results are striking:

AI-generated work received grades comparable to student submissions across multiple assessment types.
Markers, in general, were not able to distinguish assessments that had GenAI input from assessments that did not—even on tasks designed as "authentic" assessments resistant to automation.

The implication is worth noting: assessments that cannot distinguish AI output from student output are not assessing what they claim to assess. If an AI can produce a "good" case study analysis without understanding the case, then the assessment measures writing quality and surface-level analytical structure, not genuine business understanding.

Usher (2025) extends this analysis by comparing three feedback sources—instructor, peer, and AI—in a study of 76 undergraduate students. This study has become a reference point for assessment redesign. The findings indicate that:

AI chatbots consistently assigned higher grades than human assessors, suggesting a leniency bias.
AI chatbot feedback generally provided higher-quality feedback compared to peers, offering detailed insights and specific guidance for improvement, though it occasionally included irrelevant or contradictory information.
However, peer feedback was more personalized and context-sensitive than chatbot feedback.
The findings highlight the importance of human judgment, suggesting that integrating chatbot-based assessments with traditional methods can leverage their complementary strengths.

Framework for AI-Era Assessment

Ilieva, Yankova, and Ruseva (2025) propose a comprehensive redesign framework. Their three-branch, multi-level model is structured around the responsibilities of three key stakeholders:

Branch 1: Instructors. Teaching staff design adaptive, AI-informed assessment tasks and provide feedback that accounts for AI capabilities. This means crafting assessments where AI tools can be used transparently and where the assessment criteria evaluate higher-order thinking rather than output generation.

Branch 2: Students. Learners engage with AI tools transparently, with clear guidelines on acceptable use and expectations for demonstrating their own understanding alongside AI-assisted work.

Branch 3: Control Authorities. Institutional bodies ensure accountability through compliance standards, policies, and audits—creating the governance infrastructure that makes AI-integrated assessment trustworthy.

The framework's strength is its holistic approach: rather than treating AI in assessment as purely an instructor problem or a policing problem, it distributes responsibility across the entire educational ecosystem. This suggests that the most promising assessment approaches in the AI era may combine elements of resistance (tasks AI cannot complete, such as oral examinations and live problem-solving), integration (transparent AI use with metacognitive reflection), and transformation (assessment that evaluates students' ability to critically evaluate AI output rather than produce original content).

The SOUR Exam Crisis

Newton and Draper (2025) document a concerning development: the extensive use of Summative Online Unsupervised Remote (SOUR) examinations in UK higher education. Using Freedom of Information requests across UK universities, they find that SOUR exams remain widely used as a significant assessment component, despite mounting evidence of high levels of cheating—and that generative AI has made detection increasingly difficult.

The paper identifies a governance failure: university quality assurance committees approved SOUR exams during the COVID emergency as temporary measures, but institutional inertia, cost savings, and student preference have made them permanent. Quality assurance frameworks that were designed to evaluate in-person assessment have not adapted to evaluate the integrity of unsupervised remote assessment, creating what Newton and Draper call an "integrity vacuum."

Claims and Evidence

Claim	Evidence	Verdict
AI-generated work is indistinguishable from student work in standard assessments	Kofinas et al. (2025): markers generally unable to distinguish AI-input from non-AI assessments	✅ Supported
AI feedback is comparable to instructor feedback in accuracy	Usher (2025): AI chatbots assign higher grades than human assessors; consistency differs from accuracy	⚠️ Uncertain
AI-proof assessments are scalable	Ilieva et al. (2025): viable but require examiner time proportional to cohort size	❌ Refuted
AI-transparent assessment develops higher-order skills	Theoretical argument supported by pilot studies; no large-scale RCT	⚠️ Uncertain
SOUR exams maintain academic integrity	Newton & Draper (2025): high cheating levels, AI makes detection increasingly difficult	❌ Refuted

Open Questions

Is AI detection a dead end? Current AI detection tools achieve 70–85% accuracy with high false positive rates. As language models improve, the accuracy gap will widen. Should universities abandon detection entirely and focus exclusively on assessment redesign?

What assessment skills become more valuable in the AI era? If AI can produce competent first drafts, the premium shifts to evaluation, synthesis, judgment, and the ability to identify what is missing—precisely the skills that Bloom's taxonomy places at its apex.

How do we assess process when only product is visible? AI-transparent assessment requires insight into how students interact with AI, but current LLMs do not provide auditable interaction logs in a standard format. Should assessment platforms mandate interaction logging?

What happens to students who lack AI access? If assessment assumes AI use, students without reliable internet, current devices, or paid API access are disadvantaged. AI-integrated assessment may create a new digital divide.

Can we measure learning that AI cannot replicate? Embodied knowledge, ethical judgment, relational skills, creative vision—these may be the assessment frontier. But they are also the hardest to assess reliably.

Implications

The evidence points to an unavoidable conclusion: the traditional assessment toolkit of higher education—essays, exams, reports—is no longer fit for purpose in a world where AI can produce competent versions of all of them. This is not a temporary disruption; it is a permanent shift in the epistemological foundations of assessment.

The institutions that will thrive are those that treat this moment not as a threat but as an invitation to redesign assessment around what humans do that AI cannot: exercise judgment under genuine uncertainty, integrate knowledge across disciplinary boundaries, create rather than reproduce, and take ethical responsibility for the consequences of their decisions.

A particularly promising assessment approach in the AI era may be among the simplest: sit across from a student, give them a novel problem, and ask them to think out loud. No AI can fake the live, embodied demonstration of understanding that occurs in real-time intellectual dialogue. The irony is that the oldest form of assessment—the Socratic oral examination—may prove to be the most AI-resistant.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 특정 연구 결과, 통계 및 주장은 원본 논문을 통해 반드시 검증해야 한다.

GenAI와 평가: 에세이 시험의 종말인가, 르네상스인가?

2023년 봄, 전 세계 대학들은 앞다투어 ChatGPT를 금지했다. 2024년 가을, 바로 그 대학들 중 상당수가 방침을 바꾸어 생성형 AI를 교육과정에 통합하기 시작했다. 2025년에 이르러 핵심 질문은 완전히 달라졌다. "학생들이 AI를 사용하지 못하도록 어떻게 막을 것인가?"에서 "AI가 주변 어디서나 활용 가능한 세계에서 어떻게 학습을 평가할 것인가?"로 전환된 것이다.

이러한 전환은 굴복이 아니다. 이는 생성형 AI로 인해 가장 큰 위협을 받는 평가 방식들—과제형 에세이, 문헌 검토, 처음부터 코드를 작성하는 과제, 그리고 비감독 온라인 시험—이 AI가 등장하기 이전부터 이미 교육학적으로 문제가 있었다는 인식이 서서히 확산된 결과이다. 이러한 평가들은 과정이 아닌 결과물을 평가하고, 추론이 아닌 암기에 보상을 주며, 이해가 아닌 산출을 측정했다. 생성형 AI가 평가 체계를 망가뜨린 것이 아니다. AI는 이미 존재하던 균열을 드러냈을 뿐이다.

실증적 연구 현황

Kofinas, Tsay, Pike(2025)는 고등교육의 실제적 평가(authentic assessment)에 대한 AI의 영향을 엄밀하게 실증적으로 분석한다. 이들은 영국 소재 두 개 대학에서, 채점팀에게 알리지 않은 채 학생들이 완료한 동일한 평가 과제에 AI가 생성한 답변을 제출하는 방식으로 연구를 수행했다. 그 결과는 주목할 만하다.

AI가 생성한 결과물은 여러 평가 유형에 걸쳐 학생 제출물과 비슷한 수준의 점수를 받았다.
채점자들은 일반적으로 GenAI가 개입된 평가물과 그렇지 않은 평가물을 구별하지 못했다—심지어 자동화에 저항하도록 설계된 '실제적' 평가 과제에서도 마찬가지였다.

이 시사점은 주목할 가치가 있다. AI 산출물과 학생 산출물을 구별하지 못하는 평가는 자신이 평가한다고 주장하는 것을 실제로는 평가하지 못하고 있는 것이다. 만약 AI가 사례를 이해하지 않고도 '우수한' 사례 분석을 생성할 수 있다면, 해당 평가는 진정한 비즈니스 이해력이 아니라 작문 품질과 표면적 분석 구조를 측정하는 것에 불과하다.

Usher(2025)는 76명의 학부생을 대상으로 한 연구에서 교수자, 동료, AI라는 세 가지 피드백 출처를 비교함으로써 이 분석을 확장한다. 이 연구는 평가 재설계의 참고 기준점이 되었다. 연구 결과는 다음을 보여준다.

AI 챗봇은 인간 평가자보다 일관되게 더 높은 점수를 부여하여, 관대성 편향이 있음을 시사한다.
AI 챗봇 피드백은 상세한 통찰과 개선을 위한 구체적인 지침을 제공하는 측면에서 일반적으로 동료 피드백보다 더 높은 품질의 피드백을 제공했으나, 간혹 관련성이 없거나 상충되는 정보를 포함하기도 했다.
그러나 동료 피드백은 챗봇 피드백보다 더 개인화되어 있고 맥락에 민감했다.
이 연구 결과는 인간적 판단의 중요성을 강조하며, 챗봇 기반 평가를 전통적 방식과 통합하면 각각의 상호 보완적 강점을 활용할 수 있음을 제안한다.

AI 시대의 평가 프레임워크

Ilieva, Yankova, Ruseva(2025)는 포괄적인 평가 재설계 프레임워크를 제안한다. 이들의 3개 부문, 다단계 모델은 세 핵심 이해관계자의 책임을 중심으로 구조화된다.

부문 1: 교수자. 교육 담당자는 적응적이고 AI를 반영한 평가 과제를 설계하고, AI 역량을 고려한 피드백을 제공한다. 이는 AI 도구를 투명하게 활용할 수 있는 평가를 설계하고, 평가 기준이 결과물 생성이 아닌 고차원적 사고를 평가하도록 하는 것을 의미한다.

부문 2: 학생. 학습자는 AI 도구를 투명하게 활용하며, 허용 가능한 사용 범위에 대한 명확한 지침과 AI 보조 작업과 병행하여 자신의 이해를 입증하는 것에 대한 기대를 준수한다.

부문 3: 관리 기관. 기관 차원의 주체들은 규정 준수 기준, 정책, 감사를 통해 책임성을 확보하며, AI 통합 평가를 신뢰할 수 있도록 하는 거버넌스 인프라를 구축한다. 이 프레임워크의 강점은 총체적 접근 방식에 있다. 즉, 평가에서의 AI를 순전히 교수자의 문제나 단속의 문제로 다루는 대신, 전체 교육 생태계에 걸쳐 책임을 분산시킨다. 이는 AI 시대에 가장 유망한 평가 접근 방식이 저항(구술 시험 및 실시간 문제 해결과 같이 AI가 완수할 수 없는 과제), 통합(메타인지적 성찰을 동반한 투명한 AI 활용), 변환(독창적인 콘텐츠를 생성하는 능력보다 AI 산출물을 비판적으로 평가하는 학생의 능력을 측정하는 평가)의 요소를 결합할 수 있음을 시사한다.

SOUR 시험의 위기

Newton과 Draper(2025)는 우려스러운 현상을 기록한다. 바로 영국 고등교육에서 SOUR(Summative Online Unsupervised Remote, 총괄적 온라인 비감독 원격) 시험이 광범위하게 활용되고 있다는 점이다. 영국 대학들을 대상으로 한 정보공개 청구를 통해, 이들은 SOUR 시험이 상당한 수준의 부정행위가 만연하다는 증거가 축적되고 있음에도 불구하고 여전히 중요한 평가 요소로 널리 사용되고 있으며, 생성형 AI로 인해 탐지가 점점 더 어려워지고 있음을 밝혀낸다.

이 논문은 거버넌스 실패를 규명한다. 대학 교육 질 관리 위원회는 COVID 비상 상황에서 SOUR 시험을 임시 조치로 승인했으나, 제도적 관성, 비용 절감, 학생 선호도로 인해 이것이 영구적으로 정착되었다. 대면 평가를 평가하도록 설계된 교육 질 관리 프레임워크는 비감독 원격 평가의 무결성을 평가하는 방향으로 적응하지 못했으며, 이는 Newton과 Draper가 "무결성 진공(integrity vacuum)"이라고 부르는 상태를 초래했다.

주장과 근거

주장	근거	판정
AI 생성 결과물은 표준 평가에서 학생의 결과물과 구별이 불가능하다	Kofinas et al.(2025): 채점자들이 AI 개입 여부를 일반적으로 구별하지 못함	✅ 지지됨
AI 피드백은 정확도 면에서 교수자 피드백과 비교 가능하다	Usher(2025): AI 챗봇이 인간 평가자보다 더 높은 점수를 부여하며, 일관성은 정확성과 다름	⚠️ 불확실
AI-방지 평가는 확장 가능하다	Ilieva et al.(2025): 실행 가능하나 집단 규모에 비례하는 시험관 시간이 필요함	❌ 반박됨
AI-투명 평가는 고차원적 기능을 계발한다	파일럿 연구로 뒷받침된 이론적 논거; 대규모 RCT 없음	⚠️ 불확실
SOUR 시험은 학문적 무결성을 유지한다	Newton & Draper(2025): 높은 부정행위 수준, AI로 인해 탐지가 점점 더 어려워짐	❌ 반박됨

미해결 쟁점

AI 탐지는 막다른 길인가? 현재 AI 탐지 도구는 높은 위양성률을 동반하여 70~85%의 정확도를 달성한다. 언어 모델이 발전함에 따라 정확도 격차는 더욱 벌어질 것이다. 대학은 탐지를 완전히 포기하고 평가 재설계에만 집중해야 하는가?

AI 시대에 어떤 평가 역량이 더 가치 있어지는가? AI가 유능한 초안을 생성할 수 있다면, 가치는 평가, 종합, 판단력, 그리고 누락된 것을 식별하는 능력으로 이동한다. 이는 정확히 Bloom의 분류 체계에서 최상위에 위치하는 기능들이다.

산출물만 보이는 상황에서 과정을 어떻게 평가하는가? AI-투명 평가는 학생이 AI와 어떻게 상호작용하는지에 대한 통찰을 필요로 하지만, 현재의 LLM은 표준 형식으로 감사 가능한 상호작용 기록을 제공하지 않는다. 평가 플랫폼은 상호작용 기록을 의무화해야 하는가?

AI 접근성이 없는 학생들은 어떻게 되는가? 평가가 AI 활용을 전제로 한다면, 안정적인 인터넷, 최신 기기, 또는 유료 API 접근이 없는 학생들은 불이익을 받는다. AI 통합 평가는 새로운 디지털 격차를 초래할 수 있다.

AI가 복제할 수 없는 학습을 측정할 수 있는가? 체화된 지식, 윤리적 판단, 관계적 역량, 창의적 비전—이것들이 평가의 새로운 지평이 될 수 있다. 그러나 이는 또한 신뢰할 수 있는 방식으로 평가하기 가장 어려운 것들이기도 하다.

시사점

증거는 피할 수 없는 결론을 가리킨다: 에세이, 시험, 보고서로 이루어진 고등교육의 전통적인 평가 도구 체계는 AI가 이 모든 것의 충분한 수준의 결과물을 생성할 수 있는 세계에서 더 이상 그 목적에 부합하지 않는다. 이것은 일시적인 혼란이 아니라 평가의 인식론적 토대에서 일어나는 영구적인 변화이다.

번영할 기관들은 이 순간을 위협이 아니라 평가를 재설계할 초대장으로 받아들이는 곳들이다. 즉, AI가 할 수 없는 인간 고유의 능력, 다시 말해 진정한 불확실성 속에서 판단을 내리고, 학문 분야의 경계를 넘어 지식을 통합하며, 재현이 아닌 창조를 하고, 자신의 결정이 가져오는 결과에 대해 윤리적 책임을 지는 능력을 중심으로 평가를 재설계하는 기관들이다.

AI 시대에 특히 유망한 평가 방식은 가장 단순한 것 중 하나일 수 있다: 학생과 마주 앉아 새로운 문제를 제시하고 큰 소리로 생각하도록 요청하는 것이다. 어떤 AI도 실시간 지적 대화에서 일어나는, 살아있고 체화된 이해의 시연을 흉내 낼 수 없다. 아이러니하게도, 가장 오래된 평가 형식인 소크라테스식 구술 시험이 AI에 가장 강한 저항력을 지닌 것으로 판명될 수 있다.

References (6)

[1] Kofinas, A.K., Tsay, C., & Pike, D. (2025). The Impact of Generative AI on Academic Integrity of Authentic Assessments Within a Higher Education Context. British Journal of Educational Technology, 56(5).

DOI Scholar

[2] Usher, M. (2025). Generative AI vs. Instructor vs. Peer Assessments: A Comparison of Grading and Feedback in Higher Education. Assessment & Evaluation in Higher Education, 50(4).

DOI Scholar

[3] Ilieva, G., Yankova, T., Ruseva, M., & Kabaivanov, S. (2025). A Framework for Generative AI-Driven Assessment in Higher Education. Information, 16(6), 472.

DOI Scholar

[4] Newton, P. & Draper, M. (2025). Widespread Use of Summative Online Unsupervised Remote Examinations in UK Higher Education: Ethical and Quality Assurance Implications. Quality in Higher Education, 31(1).

DOI Scholar

[5] Li, Y. & Xie, M. (2025). Navigating International Challenges of Quality Assurance in Higher Education: A Synergy of Gen-AI and Human-Made Solutions. Chinese Frontiers of Social Psychology and Sociology.

DOI Scholar

Usher, M. (2025). Generative AI vs. instructor vs. peer assessments: a comparison of grading and feedback in higher education. Assessment & Evaluation in Higher Education, 50(6), 912-927.

DOI Scholar

GenAI and Assessment: The End of the Essay Exam or Its Renaissance?

The Empirical Landscape

Framework for AI-Era Assessment

The SOUR Exam Crisis

Claims and Evidence

Open Questions

Implications

GenAI와 평가: 에세이 시험의 종말인가, 르네상스인가?

실증적 연구 현황

AI 시대의 평가 프레임워크

SOUR 시험의 위기

주장과 근거

미해결 쟁점

시사점

References (6)

Explore this topic deeper