Trend AnalysisEducationSystematic Review

LLM-Powered Tutoring Systems: When AI Teaches, Who Really Learns?

LLM-based intelligent tutoring systems promise to democratize one-on-one instruction at scale. But new evidence reveals a disturbing paradox: the same models that generate adaptive scaffolding also hallucinate mathematical proofs, reinforce cultural biases, and may widen the very achievement gaps they claim to close.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The dream is seductive in its simplicity: a patient, infinitely available tutor for every student on earth, one that adapts in real time to individual misconceptions, scaffolds learning with Socratic precision, and never loses its temper at two in the morning. Large language models have brought this vision tantalizingly close to reality. Intelligent tutoring systems (ITS) powered by GPT-4, Claude, and their successors now generate feedback that is contextually aware, linguistically fluent, and pedagogically structured. Universities from MIT to the National University of Singapore have deployed them at scale. Venture capital has poured approximately $2.4 billion into EdTech startups annually in recent years, with AI-powered learning solutions capturing an increasing share of that investment.

Yet beneath this enthusiasm lies a set of uncomfortable findings that the field has been slow to confront. LLM-based tutors hallucinateโ€”not in benign ways, but in ways that can teach students wrong mathematics with the confident authority of an expert. They encode cultural biases that systematically disadvantage students from non-Western educational traditions. And the adaptive feedback they provide may, paradoxically, reduce the productive struggle that cognitive science identifies as essential to deep learning. The question is no longer whether LLMs can tutor. It is whether, in their current form, they should.

The Research Landscape: From Rule-Based to Foundation Model Tutoring

Intelligent tutoring systems have a four-decade lineage. The earliest systemsโ€”LISP Tutor, Cognitive Tutorโ€”relied on explicit cognitive models of student knowledge, hand-coded production rules, and narrow domain ontologies. They were effective within their domains but brittle, expensive to build, and impossible to scale across subjects.

The LLM revolution upends this architecture entirely. Rather than encoding expert knowledge in rules, foundation model tutors generate pedagogical responses from massive pre-training corpora. This enables two capabilities that were previously unattainable: domain generality (a single model can tutor mathematics, history, and programming) and natural language interaction (students can express confusion in their own words rather than selecting from predetermined options).

Cohn, Rayala, and Srivastava (2025) provide a rigorous theoretical framework for this new paradigm. Their work proposes a framework that combines Evidence-Centered Design with Social Cognitive Theory for adaptive scaffolding in LLM-based agents focused on STEM+C learning, demonstrating the approach through "Inquizzitor," an LLM-based formative assessment agent that integrates human-AI hybrid intelligence. The key insight is that effective tutoring requires not merely generating correct answers but calibrating the level of support to the student's evolving competence, grounded in principled assessment design that captures evidence of learning as it occurs.

The theoretical contribution is significant because it highlights that current LLM tutors lack the assessment-centered architecture needed to adaptively scaffold learning. LLMs are next-token predictors optimized for helpfulness, and helpfulness in the training data overwhelmingly means providing information rather than strategically withholding it. Effective scaffoldingโ€”as described in Collins et al.'s Cognitive Apprenticeship model, which identifies modes like modeling, coaching, fading, articulation, and reflectionโ€”requires the tutor to sense where the student is and adjust support accordingly. Cohn et al.'s framework addresses this by integrating evidence-centered assessment into the agent's decision loop.

The Hallucination Problem: When Your Tutor Teaches You Wrong

Steinbach, Bhandari, and Meyer (2025) provide a rigorous empirical study on what happens when LLM tutors make mistakes in mathematics instruction. Their controlled experiment systematically introduced LLM-generated erroneous feedback at varying rates and measured the impact on learning outcomes, self-efficacy, and trust calibration.

The study systematically introduced LLM-generated erroneous feedback at varying rates and measured the impact on learning outcomes. The findings raise important concerns about what happens when AI tutors make mistakes in mathematics instruction:

  • Students who received erroneous feedback showed lower performance on transfer problems compared to control conditions.
  • Students had difficulty detecting when the tutor was wrong, raising questions about trust calibration in AI-assisted learning.
  • Exposure to confident-but-wrong feedback appeared to affect students' willingness to challenge future tutor assertions, suggesting potential epistemic harm beyond the immediate factual errors.
This last finding deserves emphasis. The pedagogical harm of LLM hallucination is not merely that students learn incorrect factsโ€”that can be corrected. The deeper damage is epistemic: students lose confidence in their own ability to evaluate mathematical reasoning, because they have learned that their skepticism is unreliable. When the tutor says something that seems wrong and the student protests, the tutorโ€”drawing on its vast training dataโ€”can generate a fluent, authoritative justification that silences the objection. The student learns to defer.

Adaptive Analytics: What LLMs Know About What Students Know

Fan, Mihaylova, and Akram (2025) approach the problem from a different angle: rather than using LLMs to generate feedback, they use them to model student knowledge. Their LLM-KC (Knowledge Component) framework leverages language models to automatically identify the discrete knowledge components that students must master, replacing the expert-driven, hand-coding process that has been the bottleneck in ITS development for decades.

The innovation is technically elegant: the LLM analyzes problem descriptions, student responses, and error patterns to infer a latent knowledge component structure, which is then validated against learning curve analytics. If a proposed KC decomposition accurately predicts the power-law improvement in student performance over practice, it is retained; otherwise, the LLM iterates.

Their results on an introductory programming course show that LLM-inferred KCs match or exceed expert-defined KCs in predictive accuracy, according to learning curve analyses. The real value is scalability: what took domain experts substantial time per course can now be accomplished in minutes. This has profound implications for extending ITS to under-resourced educational contextsโ€”community colleges, Global South institutions, non-English-language curriculaโ€”where expert time is the binding constraint.

Critical Analysis: Claims and Evidence

<
ClaimEvidenceVerdict
LLM tutors can deliver personalized instruction at scaleMultiple deployments (Khan Academy Khanmigo, Carnegie Learning MATHia+LLM) with millions of usersโœ… Supported
LLM tutoring improves learning outcomesSaleem et al. (2025) report significant positive correlation (r=0.74) in a 268-instructor survey; but no long-term RCT existsโš ๏ธ Uncertain
LLM hallucinations in tutoring are pedagogically harmfulSteinbach et al. (2025): significantly lower transfer scores at low error rates, persistent trust damageโœ… Supported
LLM-based knowledge component modeling can replace expert codingFan et al. (2025): LLM-inferred KCs match or exceed expert-defined KCs, at dramatically faster speedโœ… Supported
AI tutoring will narrow achievement gapsChinta et al. (2024): systematic bias favoring English-dominant, Western-educated learner profilesโŒ Refuted (without intervention)
Current LLMs can calibrate scaffolding level effectivelyCohn et al. (2025): LLMs lack evidence-centered assessment architecture for adaptive scaffoldingโŒ Refuted

The Fairness Paradox

Chinta, Wang, and Yin (2024) provide a widely cited systematic review of fairness challenges in AI education. Their FairAIED framework integrates multiple dimensionsโ€”bias sources, fairness definitions, mitigation strategies, evaluation resources, and ethical considerationsโ€”into an education-centered framework. Drawing on this comprehensive mapping, the fairness concerns in educational AI can be understood through several interconnected layers:

  • Data bias: Training corpora may over-represent certain educational norms, potentially disadvantaging students from diverse educational cultures.
  • Algorithmic bias: Performance prediction models may systematically underestimate the abilities of students from underrepresented demographic groups, leading to less challenging content recommendations.
  • Interaction effects: Differences in how AI systems respond to diverse student populations may compound existing inequities in ways that are difficult to detect without systematic fairness auditing.
  • The perverse outcome is that AI tutoring, deployed without fairness-aware design, may widen the achievement gaps it promises to close. Students who already benefit from high-quality educational environments receive the most effective AI scaffolding, while students who most need personalized support receive a degraded version of it.

    Open Questions and Future Directions

  • Can we build LLM tutors that strategically withhold help? Cohn et al.'s (2025) Evidence-Centered Design framework provides a theoretical basis, but implementing principled fading in practice remains an open challenge. This may require reward functions that value long-term learning over short-term student satisfactionโ€”a direct tension with commercial incentives.
  • What is the acceptable hallucination rate for educational LLMs? Steinbach et al.'s findings imply that even low error ratesโ€”within the range observed in current LLMsโ€”produce significant learning harm in mathematics. Should educational LLMs undergo a domain-specific certification process analogous to medical device approval?
  • How do we measure learning, not just engagement? Most deployed systems optimize for session length and return visitsโ€”metrics that correlate with but do not guarantee learning. The field needs standardized outcome measures that capture transfer, retention, and metacognitive development.
  • Can LLM tutors be culturally adaptive, not just linguistically translated? Translation is insufficient. Effective tutoring in Confucian heritage cultures, Indigenous knowledge systems, or Freirean pedagogical traditions requires fundamentally different interaction patterns that current architectures do not support.
  • Who owns the student model? As LLM tutors build increasingly detailed profiles of student knowledge, misconceptions, and learning trajectories, questions of data sovereigntyโ€”particularly for minorsโ€”become urgent.
  • What This Means for Educators and Policymakers

    The evidence is clear on two points. First, LLM-based tutoring systems represent a genuine technological capability that will reshape education. Second, deploying them without addressing hallucination, bias, and scaffolding calibration will cause measurable harm to the students who can least afford it.

    The path forward is not to reject AI tutoring but to demand higher standards for it. We need educational LLMs that are evaluated on learning outcomes, not engagement metrics; that undergo adversarial testing for hallucination in specific domains; that are designed with fairness constraints baked into the architecture, not bolted on as post-hoc audits. The researchers who build these systems and the policymakers who regulate them must resist the seductive narrative that more AI automatically means better education. Sometimes, the most pedagogically powerful thing a tutor can do is stay silent and let the student struggle.

    Tools like ORAA ResearchBrain can help educators track the rapidly evolving evidence base in this field, identifying which claims are substantiated and which remain aspirational.

    References (5)

    [1] Cohn, C., Rayala, S., Srivastava, N., Fonteles, J., Jain, S., Luo, X., Mereddy, D., Mohammed, N., & Biswas, G. (2025). A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents. arXiv:2508.01503.
    [2] Steinbach, M., Bhandari, S., Meyer, J., & Pardos, Z.A. (2025). When LLMs Hallucinate: Examining the Effects of Erroneous Feedback in Math Tutoring Systems. Educational Data Mining.
    [3] Fan, J., Mihaylova, T., Akram, B., Norouzi, N., Brusilovsky, P., Hellas, A., & Leinonen, J. (2025). Adaptive Learning Curve Analytics with LLM-KC Identifiers for Knowledge Component Refinement. UK & Ireland Computing Education Research Conference.
    [4] Chinta, S.V., Wang, Z., Yin, Z., Hoang, N., Gonzalez, M., Le Quy, T., & Zhang, W. (2024). FairAIED: Navigating Fairness, Bias, and Ethics in Educational AI Applications. arXiv:2407.18745.
    [5] Saleem, S., Aziz, M.U., Iqbal, M.J., & Abbas, S. (2025). AI in Education: Personalized Learning Systems and Their Impact on Student Performance and Engagement. The Critical Review of Social Sciences Studies.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’