Trend AnalysisEducationSystematic Review

LLM-Powered Tutoring Systems: When AI Teaches, Who Really Learns?

LLM-based intelligent tutoring systems promise to democratize one-on-one instruction at scale. But new evidence reveals a disturbing paradox: the same models that generate adaptive scaffolding also hallucinate mathematical proofs, reinforce cultural biases, and may widen the very achievement gaps they claim to close.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The dream is seductive in its simplicity: a patient, infinitely available tutor for every student on earth, one that adapts in real time to individual misconceptions, scaffolds learning with Socratic precision, and never loses its temper at two in the morning. Large language models have brought this vision tantalizingly close to reality. Intelligent tutoring systems (ITS) powered by GPT-4, Claude, and their successors now generate feedback that is contextually aware, linguistically fluent, and pedagogically structured. Universities from MIT to the National University of Singapore have deployed them at scale. Venture capital has poured approximately $2.4 billion into EdTech startups annually in recent years, with AI-powered learning solutions capturing an increasing share of that investment.

Yet beneath this enthusiasm lies a set of uncomfortable findings that the field has been slow to confront. LLM-based tutors hallucinate—not in benign ways, but in ways that can teach students wrong mathematics with the confident authority of an expert. They encode cultural biases that systematically disadvantage students from non-Western educational traditions. And the adaptive feedback they provide may, paradoxically, reduce the productive struggle that cognitive science identifies as essential to deep learning. The question is no longer whether LLMs can tutor. It is whether, in their current form, they should.

The Research Landscape: From Rule-Based to Foundation Model Tutoring

Intelligent tutoring systems have a four-decade lineage. The earliest systems—LISP Tutor, Cognitive Tutor—relied on explicit cognitive models of student knowledge, hand-coded production rules, and narrow domain ontologies. They were effective within their domains but brittle, expensive to build, and impossible to scale across subjects.

The LLM revolution upends this architecture entirely. Rather than encoding expert knowledge in rules, foundation model tutors generate pedagogical responses from massive pre-training corpora. This enables two capabilities that were previously unattainable: domain generality (a single model can tutor mathematics, history, and programming) and natural language interaction (students can express confusion in their own words rather than selecting from predetermined options).

Cohn, Rayala, and Srivastava (2025) provide a rigorous theoretical framework for this new paradigm. Their work proposes a framework that combines Evidence-Centered Design with Social Cognitive Theory for adaptive scaffolding in LLM-based agents focused on STEM+C learning, demonstrating the approach through "Inquizzitor," an LLM-based formative assessment agent that integrates human-AI hybrid intelligence. The key insight is that effective tutoring requires not merely generating correct answers but calibrating the level of support to the student's evolving competence, grounded in principled assessment design that captures evidence of learning as it occurs.

The theoretical contribution is significant because it highlights that current LLM tutors lack the assessment-centered architecture needed to adaptively scaffold learning. LLMs are next-token predictors optimized for helpfulness, and helpfulness in the training data overwhelmingly means providing information rather than strategically withholding it. Effective scaffolding—as described in Collins et al.'s Cognitive Apprenticeship model, which identifies modes like modeling, coaching, fading, articulation, and reflection—requires the tutor to sense where the student is and adjust support accordingly. Cohn et al.'s framework addresses this by integrating evidence-centered assessment into the agent's decision loop.

The Hallucination Problem: When Your Tutor Teaches You Wrong

Steinbach, Bhandari, and Meyer (2025) provide a rigorous empirical study on what happens when LLM tutors make mistakes in mathematics instruction. Their controlled experiment systematically introduced LLM-generated erroneous feedback at varying rates and measured the impact on learning outcomes, self-efficacy, and trust calibration.

The study systematically introduced LLM-generated erroneous feedback at varying rates and measured the impact on learning outcomes. The findings raise important concerns about what happens when AI tutors make mistakes in mathematics instruction:

Students who received erroneous feedback showed lower performance on transfer problems compared to control conditions.
Students had difficulty detecting when the tutor was wrong, raising questions about trust calibration in AI-assisted learning.
Exposure to confident-but-wrong feedback appeared to affect students' willingness to challenge future tutor assertions, suggesting potential epistemic harm beyond the immediate factual errors.

This last finding deserves emphasis. The pedagogical harm of LLM hallucination is not merely that students learn incorrect facts—that can be corrected. The deeper damage is epistemic: students lose confidence in their own ability to evaluate mathematical reasoning, because they have learned that their skepticism is unreliable. When the tutor says something that seems wrong and the student protests, the tutor—drawing on its vast training data—can generate a fluent, authoritative justification that silences the objection. The student learns to defer.

Adaptive Analytics: What LLMs Know About What Students Know

Fan, Mihaylova, and Akram (2025) approach the problem from a different angle: rather than using LLMs to generate feedback, they use them to model student knowledge. Their LLM-KC (Knowledge Component) framework leverages language models to automatically identify the discrete knowledge components that students must master, replacing the expert-driven, hand-coding process that has been the bottleneck in ITS development for decades.

The innovation is technically elegant: the LLM analyzes problem descriptions, student responses, and error patterns to infer a latent knowledge component structure, which is then validated against learning curve analytics. If a proposed KC decomposition accurately predicts the power-law improvement in student performance over practice, it is retained; otherwise, the LLM iterates.

Their results on an introductory programming course show that LLM-inferred KCs match or exceed expert-defined KCs in predictive accuracy, according to learning curve analyses. The real value is scalability: what took domain experts substantial time per course can now be accomplished in minutes. This has profound implications for extending ITS to under-resourced educational contexts—community colleges, Global South institutions, non-English-language curricula—where expert time is the binding constraint.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
LLM tutors can deliver personalized instruction at scale	Multiple deployments (Khan Academy Khanmigo, Carnegie Learning MATHia+LLM) with millions of users	✅ Supported
LLM tutoring improves learning outcomes	Saleem et al. (2025) report significant positive correlation (r=0.74) in a 268-instructor survey; but no long-term RCT exists	⚠️ Uncertain
LLM hallucinations in tutoring are pedagogically harmful	Steinbach et al. (2025): significantly lower transfer scores at low error rates, persistent trust damage	✅ Supported
LLM-based knowledge component modeling can replace expert coding	Fan et al. (2025): LLM-inferred KCs match or exceed expert-defined KCs, at dramatically faster speed	✅ Supported
AI tutoring will narrow achievement gaps	Chinta et al. (2024): systematic bias favoring English-dominant, Western-educated learner profiles	❌ Refuted (without intervention)
Current LLMs can calibrate scaffolding level effectively	Cohn et al. (2025): LLMs lack evidence-centered assessment architecture for adaptive scaffolding	❌ Refuted

The Fairness Paradox

Chinta, Wang, and Yin (2024) provide a widely cited systematic review of fairness challenges in AI education. Their FairAIED framework integrates multiple dimensions—bias sources, fairness definitions, mitigation strategies, evaluation resources, and ethical considerations—into an education-centered framework. Drawing on this comprehensive mapping, the fairness concerns in educational AI can be understood through several interconnected layers:

Data bias: Training corpora may over-represent certain educational norms, potentially disadvantaging students from diverse educational cultures.

Algorithmic bias: Performance prediction models may systematically underestimate the abilities of students from underrepresented demographic groups, leading to less challenging content recommendations.

Interaction effects: Differences in how AI systems respond to diverse student populations may compound existing inequities in ways that are difficult to detect without systematic fairness auditing.

The perverse outcome is that AI tutoring, deployed without fairness-aware design, may widen the achievement gaps it promises to close. Students who already benefit from high-quality educational environments receive the most effective AI scaffolding, while students who most need personalized support receive a degraded version of it.

Open Questions and Future Directions

Can we build LLM tutors that strategically withhold help? Cohn et al.'s (2025) Evidence-Centered Design framework provides a theoretical basis, but implementing principled fading in practice remains an open challenge. This may require reward functions that value long-term learning over short-term student satisfaction—a direct tension with commercial incentives.

What is the acceptable hallucination rate for educational LLMs? Steinbach et al.'s findings imply that even low error rates—within the range observed in current LLMs—produce significant learning harm in mathematics. Should educational LLMs undergo a domain-specific certification process analogous to medical device approval?

How do we measure learning, not just engagement? Most deployed systems optimize for session length and return visits—metrics that correlate with but do not guarantee learning. The field needs standardized outcome measures that capture transfer, retention, and metacognitive development.

Can LLM tutors be culturally adaptive, not just linguistically translated? Translation is insufficient. Effective tutoring in Confucian heritage cultures, Indigenous knowledge systems, or Freirean pedagogical traditions requires fundamentally different interaction patterns that current architectures do not support.

Who owns the student model? As LLM tutors build increasingly detailed profiles of student knowledge, misconceptions, and learning trajectories, questions of data sovereignty—particularly for minors—become urgent.

What This Means for Educators and Policymakers

The evidence is clear on two points. First, LLM-based tutoring systems represent a genuine technological capability that will reshape education. Second, deploying them without addressing hallucination, bias, and scaffolding calibration will cause measurable harm to the students who can least afford it.

The path forward is not to reject AI tutoring but to demand higher standards for it. We need educational LLMs that are evaluated on learning outcomes, not engagement metrics; that undergo adversarial testing for hallucination in specific domains; that are designed with fairness constraints baked into the architecture, not bolted on as post-hoc audits. The researchers who build these systems and the policymakers who regulate them must resist the seductive narrative that more AI automatically means better education. Sometimes, the most pedagogically powerful thing a tutor can do is stay silent and let the student struggle.

Tools like ORAA ResearchBrain can help educators track the rapidly evolving evidence base in this field, identifying which claims are substantiated and which remain aspirational.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 저작에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 반드시 검증해야 한다.

LLM 기반 튜터링 시스템: AI가 가르칠 때, 실제로 누가 배우는가?

그 꿈은 단순함 속에 매혹적이다. 지구상의 모든 학생을 위한 인내심 있고 무한히 이용 가능한 튜터, 개인의 오개념에 실시간으로 적응하고, 소크라테스적 정밀함으로 학습을 발판 삼아, 새벽 두 시에도 결코 화를 내지 않는 튜터. 대규모 언어 모델(Large Language Model, LLM)은 이 비전을 현실에 아찔할 만큼 가깝게 가져왔다. GPT-4, Claude 및 그 후속 모델로 구동되는 지능형 튜터링 시스템(Intelligent Tutoring System, ITS)은 이제 맥락을 인식하고, 언어적으로 유창하며, 교육학적으로 구조화된 피드백을 생성한다. MIT부터 싱가포르 국립대학교에 이르는 여러 대학이 이를 대규모로 배포했다. 벤처 캐피털은 최근 몇 년간 연간 약 24억 달러를 EdTech 스타트업에 쏟아부었으며, AI 기반 학습 솔루션이 그 투자에서 점점 더 큰 비중을 차지하고 있다.

그러나 이러한 열정 이면에는 이 분야가 직면하기를 꺼려온 불편한 연구 결과들이 자리하고 있다. LLM 기반 튜터는 환각(hallucination)을 일으킨다. 그것도 무해한 방식이 아니라, 전문가의 자신감 넘치는 권위로 학생들에게 잘못된 수학을 가르칠 수 있는 방식으로. 이 모델들은 비서구권 교육 전통을 가진 학생들을 체계적으로 불리하게 만드는 문화적 편향을 내재하고 있다. 또한 모델이 제공하는 적응형 피드백은 역설적으로, 인지과학이 심층 학습에 필수적이라고 밝힌 생산적 어려움(productive struggle)을 감소시킬 수 있다. 이제 문제는 LLM이 튜터링을 할 수 있는가가 아니다. 현재의 형태로 해야 하는가이다.

연구 동향: 규칙 기반 튜터링에서 파운데이션 모델 튜터링으로

지능형 튜터링 시스템은 40년의 계보를 지닌다. 초기 시스템인 LISP Tutor, Cognitive Tutor는 학생 지식에 대한 명시적 인지 모델, 수작업으로 코딩된 생성 규칙(production rule), 그리고 좁은 영역 온톨로지에 의존했다. 이 시스템들은 특정 영역 내에서는 효과적이었으나, 취약하고 구축 비용이 높았으며 여러 과목에 걸쳐 확장하는 것이 불가능했다.

LLM 혁명은 이 아키텍처를 완전히 뒤집어 놓는다. 파운데이션 모델 기반 튜터는 규칙에 전문 지식을 인코딩하는 대신, 방대한 사전 학습 코퍼스에서 교육학적 응답을 생성한다. 이를 통해 이전에는 달성할 수 없었던 두 가지 역량이 가능해진다. 도메인 범용성(단일 모델이 수학, 역사, 프로그래밍을 모두 가르칠 수 있다)과 자연어 상호작용(학생들이 미리 정해진 선택지 중에서 고르는 대신 자신의 언어로 혼란을 표현할 수 있다)이 그것이다.

Cohn, Rayala, Srivastava (2025)는 이 새로운 패러다임에 대한 엄밀한 이론적 프레임워크를 제시한다. 이 연구는 STEM+C 학습에 초점을 맞춘 LLM 기반 에이전트의 적응형 스캐폴딩을 위해 증거 중심 설계(Evidence-Centered Design)와 사회인지이론(Social Cognitive Theory)을 결합한 프레임워크를 제안하며, 인간-AI 하이브리드 지능을 통합한 LLM 기반 형성 평가 에이전트인 "Inquizzitor"를 통해 이 접근 방식을 입증한다. 핵심 통찰은, 효과적인 튜터링이란 단순히 정답을 생성하는 것이 아니라, 학습 과정에서 발생하는 학습의 증거를 포착하는 원칙에 기반한 평가 설계를 토대로 학생의 발전하는 역량에 맞게 지원 수준을 조율하는 것이라는 점이다.

이 이론적 기여는 현재의 LLM 튜터가 학습을 적응적으로 발판으로 삼는 데 필요한 평가 중심 아키텍처를 갖추지 못했음을 부각시킨다는 점에서 중요하다. LLM은 유용함을 위해 최적화된 다음 토큰 예측기이며, 학습 데이터에서 유용함이란 압도적으로 정보를 전략적으로 보류하는 것이 아니라 제공하는 것을 의미한다. Collins 등의 인지적 도제(Cognitive Apprenticeship) 모델—모델링, 코칭, 페이딩(fading), 명료화(articulation), 반성(reflection)과 같은 방식을 제시한다—에서 기술된 것과 같은 효과적인 스캐폴딩은 튜터가 학생의 현재 위치를 파악하고 그에 따라 지원을 조정할 것을 요구한다. Cohn 등의 프레임워크는 증거 중심 평가를 에이전트의 의사결정 루프에 통합함으로써 이 문제를 해결한다.

환각 문제: 튜터가 틀린 내용을 가르칠 때

Steinbach, Bhandari, Meyer(2025)는 LLM 튜터가 수학 교육에서 실수를 범할 때 어떤 일이 발생하는지에 관한 엄밀한 실증 연구를 제시한다. 이들의 통제 실험은 LLM이 생성한 오류 피드백을 다양한 비율로 체계적으로 도입하고, 학습 결과, 자기효능감, 신뢰 조정에 미치는 영향을 측정하였다.

이 연구는 LLM이 생성한 오류 피드백을 다양한 비율로 체계적으로 도입하고 학습 결과에 미치는 영향을 측정하였다. 연구 결과는 AI 튜터가 수학 교육에서 실수를 범할 때 어떤 일이 발생하는지에 관한 중요한 우려를 제기한다.

오류 피드백을 받은 학생들은 통제 조건과 비교하여 전이 문제에서 낮은 수행 능력을 보였다.
학생들은 튜터가 틀렸을 때 이를 탐지하는 데 어려움을 겪었으며, 이는 AI 보조 학습에서의 신뢰 조정 문제를 제기한다.
자신감 있지만 틀린 피드백에 노출되는 것은 학생들이 향후 튜터의 주장에 이의를 제기하려는 의지에 영향을 미치는 것으로 나타났으며, 이는 즉각적인 사실 오류를 넘어선 인식론적 해악의 가능성을 시사한다.

마지막 발견은 강조할 필요가 있다. LLM 환각의 교육적 해악은 단순히 학생들이 잘못된 사실을 학습한다는 데 있지 않다—그것은 수정될 수 있다. 더 깊은 손상은 인식론적 차원에서 발생한다. 학생들은 자신의 회의론이 신뢰할 수 없다는 것을 학습했기 때문에, 수학적 추론을 스스로 평가하는 능력에 대한 자신감을 잃게 된다. 튜터가 틀려 보이는 말을 하고 학생이 이의를 제기할 때, 튜터는 방대한 훈련 데이터를 바탕으로 반박을 잠재우는 유창하고 권위 있는 정당화를 생성할 수 있다. 학생은 복종하는 법을 배우게 된다.

적응형 분석: LLM이 학생의 지식에 대해 아는 것

Fan, Mihaylova, Akram(2025)은 다른 각도에서 이 문제에 접근한다. 피드백을 생성하기 위해 LLM을 사용하는 대신, 이들은 학생의 지식을 모델링하는 데 LLM을 활용한다. 이들의 LLM-KC(Knowledge Component) 프레임워크는 언어 모델을 활용하여 학생이 숙달해야 할 개별 지식 구성요소를 자동으로 식별하며, 수십 년간 ITS 개발의 병목이었던 전문가 주도의 수작업 코딩 과정을 대체한다.

이 혁신은 기술적으로 우아하다. LLM은 문제 설명, 학생 응답, 오류 패턴을 분석하여 잠재적 지식 구성요소 구조를 추론하고, 이를 학습 곡선 분석을 통해 검증한다. 제안된 KC 분해가 연습에 따른 학생 수행의 멱함수 법칙적 향상을 정확하게 예측하면 이를 유지하고, 그렇지 않으면 LLM이 반복 수행한다.

입문 프로그래밍 강좌에 대한 연구 결과에 따르면, 학습 곡선 분석을 기준으로 LLM이 추론한 KC는 전문가가 정의한 KC와 예측 정확도가 동등하거나 이를 초과하는 것으로 나타났다. 진정한 가치는 확장성에 있다. 도메인 전문가가 강좌당 상당한 시간을 들여야 했던 작업을 이제 몇 분 만에 수행할 수 있다. 이는 전문가 시간이 핵심 제약 요소인 지역 전문대학, 글로벌 사우스 교육기관, 비영어권 교육과정 등 자원이 부족한 교육 환경으로 ITS를 확장하는 데 심대한 함의를 지닌다.

비판적 분석: 주장과 근거

주장	근거	판정
LLM 튜터는 대규모로 개인화된 교육을 제공할 수 있다	수백만 명의 사용자를 보유한 다수의 배포 사례(Khan Academy Khanmigo, Carnegie Learning MATHia+LLM)	✅ 지지됨
LLM 튜터링이 학습 결과를 향상시킨다	Saleem et al.(2025)은 268명의 교수자 대상 설문에서 유의미한 양의 상관관계(r=0.74)를 보고했으나, 장기적 무작위 대조 시험(RCT)은 존재하지 않음	⚠️ 불확실
튜터링에서의 LLM 환각은 교육적으로 해롭다	Steinbach et al.(2025): 낮은 오류율에서도 유의미하게 낮은 전이 점수, 지속적인 신뢰 손상	✅ 지지됨
LLM 기반 지식 구성요소 모델링이 전문가 코딩을 대체할 수 있다	Fan et al.(2025): LLM이 추론한 KC는 전문가가 정의한 KC와 동등하거나 이를 초과하며, 속도는 극적으로 빠름	✅ 지지됨
AI 튜터링이 학업 성취 격차를 좁힐 것이다	Chinta et al. (2024): 영어 중심, 서구 교육을 받은 학습자 프로필을 선호하는 체계적 편향	❌ 반박됨 (개입 없이)
현재 LLM은 비계 수준을 효과적으로 조정할 수 있다	Cohn et al. (2025): LLM은 적응적 비계를 위한 증거 중심 평가 구조가 부족함	❌ 반박됨

공정성의 역설

Chinta, Wang, Yin(2024)은 AI 교육의 공정성 문제에 관한 널리 인용되는 체계적 문헌 고찰을 제공한다. 이들의 FairAIED 프레임워크는 편향 원인, 공정성 정의, 완화 전략, 평가 자원, 윤리적 고려사항 등 다양한 차원을 교육 중심의 프레임워크로 통합한다. 이 포괄적인 매핑을 토대로, 교육용 AI의 공정성 문제는 상호 연결된 여러 층위를 통해 이해할 수 있다.

데이터 편향: 훈련 말뭉치가 특정 교육 규범을 과대 대표할 수 있으며, 이는 다양한 교육 문화권 학생들에게 불이익을 줄 가능성이 있다.

알고리즘 편향: 성취 예측 모델이 과소 대표된 인구 집단 학생들의 능력을 체계적으로 과소평가하여, 덜 도전적인 콘텐츠를 추천하는 결과로 이어질 수 있다.

상호작용 효과: AI 시스템이 다양한 학생 집단에 반응하는 방식의 차이가 기존의 불평등을 체계적인 공정성 감사 없이는 감지하기 어려운 방식으로 심화시킬 수 있다.

역설적인 결과는, 공정성을 고려한 설계 없이 배포된 AI 튜터링이 좁히겠다고 약속한 학업 성취 격차를 오히려 넓힐 수 있다는 점이다. 이미 양질의 교육 환경에서 혜택을 받고 있는 학생들은 가장 효과적인 AI 비계를 제공받는 반면, 개인화된 지원이 가장 필요한 학생들은 그 저하된 버전을 제공받게 된다.

미해결 과제와 향후 방향

도움을 전략적으로 보류하는 LLM 튜터를 구축할 수 있는가? Cohn et al.(2025)의 증거 중심 설계(Evidence-Centered Design) 프레임워크는 이론적 토대를 제공하지만, 실제로 원칙에 입각한 페이딩(fading)을 구현하는 것은 여전히 미해결 과제로 남아 있다. 이를 위해서는 단기적인 학생 만족보다 장기적인 학습을 중시하는 보상 함수가 필요할 수 있으며, 이는 상업적 유인과 직접적인 긴장 관계를 형성한다.

교육용 LLM에서 허용 가능한 환각 발생률은 얼마인가? Steinbach et al.의 연구 결과는 현재 LLM에서 관찰되는 범위 내의 낮은 오류율조차 수학 학습에 상당한 해를 끼친다는 것을 시사한다. 교육용 LLM은 의료기기 승인에 준하는 분야별 인증 과정을 거쳐야 하는가?

참여도가 아닌 학습을 어떻게 측정할 것인가? 대부분의 배포 시스템은 학습과 상관관계는 있지만 학습을 보장하지는 않는 지표인 세션 길이와 재방문 횟수를 최적화한다. 이 분야에는 전이, 파지(retention), 메타인지적 발달을 포착하는 표준화된 성과 측정 도구가 필요하다.

LLM 튜터가 단순한 언어 번역을 넘어 문화적으로 적응할 수 있는가? 번역만으로는 충분하지 않다. 유교적 문화권, 원주민 지식 체계, 또는 프레이리(Freire)식 교육 전통에서의 효과적인 튜터링은 현재의 구조가 지원하지 않는 근본적으로 다른 상호작용 패턴을 필요로 한다.

학생 모델의 소유권은 누구에게 있는가? LLM 튜터가 학생의 지식, 오개념, 학습 경로에 대해 점점 더 상세한 프로필을 구축함에 따라, 특히 미성년자에 대한 데이터 주권 문제가 시급해지고 있다.

교육자와 정책 입안자에 대한 시사점

증거는 두 가지 점에서 명확하다. 첫째, LLM 기반 튜터링 시스템은 교육을 재편할 진정한 기술적 역량을 나타낸다. 둘째, 환각, 편향, 비계 조정 문제를 해결하지 않고 이를 배포하면 이를 감당할 여력이 가장 없는 학생들에게 측정 가능한 피해를 입힐 것이다. 나아갈 길은 AI 튜터링을 거부하는 것이 아니라 더 높은 기준을 요구하는 것이다. 우리에게는 참여 지표가 아닌 학습 성과로 평가되고, 특정 도메인에서의 환각(hallucination)에 대한 적대적 테스트를 거치며, 사후 감사로 덧붙이는 방식이 아닌 아키텍처에 공정성 제약이 내재된 방식으로 설계된 교육용 LLM이 필요하다. 이러한 시스템을 구축하는 연구자들과 이를 규제하는 정책 입안자들은 AI가 많을수록 자동으로 더 나은 교육이 이루어진다는 매혹적인 서사에 저항해야 한다. 때로는 튜터가 할 수 있는 가장 교육적으로 강력한 행동이 침묵을 지키며 학생이 스스로 고군분투하도록 내버려 두는 것이다.

ORAA ResearchBrain과 같은 도구는 교육자들이 이 분야에서 빠르게 진화하는 근거 기반을 추적하고, 어떤 주장이 실증적으로 뒷받침되는지, 어떤 주장이 여전히 열망에 머무르는지를 파악하는 데 도움을 줄 수 있다.

References (5)

[1] Cohn, C., Rayala, S., Srivastava, N., Fonteles, J., Jain, S., Luo, X., Mereddy, D., Mohammed, N., & Biswas, G. (2025). A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents. arXiv:2508.01503.

DOI Scholar

[2] Steinbach, M., Bhandari, S., Meyer, J., & Pardos, Z.A. (2025). When LLMs Hallucinate: Examining the Effects of Erroneous Feedback in Math Tutoring Systems. Educational Data Mining.

DOI Scholar

[3] Fan, J., Mihaylova, T., Akram, B., Norouzi, N., Brusilovsky, P., Hellas, A., & Leinonen, J. (2025). Adaptive Learning Curve Analytics with LLM-KC Identifiers for Knowledge Component Refinement. UK & Ireland Computing Education Research Conference.

DOI Scholar

[4] Chinta, S.V., Wang, Z., Yin, Z., Hoang, N., Gonzalez, M., Le Quy, T., & Zhang, W. (2024). FairAIED: Navigating Fairness, Bias, and Ethics in Educational AI Applications. arXiv:2407.18745.

DOI Scholar

[5] Saleem, S., Aziz, M.U., Iqbal, M.J., & Abbas, S. (2025). AI in Education: Personalized Learning Systems and Their Impact on Student Performance and Engagement. The Critical Review of Social Sciences Studies.

DOI Scholar

LLM-Powered Tutoring Systems: When AI Teaches, Who Really Learns?

The Research Landscape: From Rule-Based to Foundation Model Tutoring

The Hallucination Problem: When Your Tutor Teaches You Wrong

Adaptive Analytics: What LLMs Know About What Students Know

Critical Analysis: Claims and Evidence

The Fairness Paradox

Open Questions and Future Directions

What This Means for Educators and Policymakers

LLM 기반 튜터링 시스템: AI가 가르칠 때, 실제로 누가 배우는가?

연구 동향: 규칙 기반 튜터링에서 파운데이션 모델 튜터링으로

환각 문제: 튜터가 틀린 내용을 가르칠 때

적응형 분석: LLM이 학생의 지식에 대해 아는 것

비판적 분석: 주장과 근거

공정성의 역설

미해결 과제와 향후 방향

교육자와 정책 입안자에 대한 시사점

References (5)

Explore this topic deeper