AI Models Struggle Where Students Need Help Most: New Research on K-12 Math Education

When AI Gets the Answer Right by Ignoring the Student

Artificial intelligence is advancing at breakneck speed. Models are acing general intelligence benchmarks, solving complex problems, and even passing professional exams. But new research from our team reveals a troubling gap: AI models perform significantly worse when analyzing the very thing that matters most in education—student work that contains errors.

In other words, the students who need the most help are the ones AI is least equipped to support.

Our own Ryan Knight, Team Lead, Data Solutions, and Albert Zhang, Data Solutions Engineer, were part of the research team that uncovered these critical findings.

The Research: DrawEduMath Benchmark Update

Our team recently published a preprint updating DrawEduMath, a benchmark designed to evaluate how well AI models understand handwritten student work in K-12 math. Over the past year, we tested 11 new models and uncovered findings that should give educators and edtech developers pause.

Here’s what we found:

Models Perform Worse on Student Work Containing Errors

Even when researchers painstakingly redrew images to be clean and legible, models struggled to answer basic factual questions about student work that contained mistakes. This isn’t a handwriting recognition problem—it’s a comprehension problem.

Models Struggle to Assess Correctness

AI models performed worse on questions about whether student work was correct than on straightforward factual questions. Why? Because the models often hallucinate the correct mathematical response when they encounter student errors.

This makes sense from a training perspective. Most AI models are trained on correct math to become expert problem-solvers. The unintended consequence? They may actually be worse in educational contexts where the errors are the important part—the very moments when a teacher needs to intervene.

Progress is Flat

Perhaps most striking: the trendline on performance in these domains is essentially flat over the past year. While general intelligence benchmarks show exponential improvement, progress in understanding student work with errors has stalled.

We can’t just wait for models to get better. They haven’t meaningfully improved in this area over the last 12 months.

Imagine an AI tutoring system that gets 80% accuracy on a benchmark—not by understanding student thinking, but by simply guessing the correct answer and ignoring what the student actually wrote. That’s a real risk if benchmarks aren’t carefully designed.

The students who make errors—the ones who need support most—are being systematically underserved by current AI systems.

This research highlights a critical need:

Better Benchmarks in Education

Education-specific benchmarks need to be structured around what actually matters for learning. If 20% of student work contains errors, the benchmark should focus on that 20%—not reward models for ignoring it.

DrawEduMath is one of the few pedagogical benchmarks maintained over time, and the flat trendline is a wake-up call. We need more benchmarks like this, and we need them to measure the hard problems, not just the easy wins.

Smarter AI Training for Learning Contexts

AI models trained to be expert mathematicians may not be well-suited for education without intentional design. Teaching requires understanding how students think, not just what the right answer is. The field needs models trained on messy, real-world student work—errors and all.

What’s Next?

This research represents important work at the intersection of AI and education—and we’re proud that Ryan Knight and Albert Zhang from Insource’s Data & AI team contributed to it. Working alongside co-authors Lucy Li, Kyle Lo, and Nathan Anderson, and with support from organizations including the Allen Institute for AI, ASSISTments, Teaching Lab, The Learning Agency, and others, this research is an important step in building AI that actually works for learners.

As AI continues to reshape education, we need to ensure these tools are designed with students—especially struggling students—at the center.

Read the full preprint here.

About the Research

This work updates the DrawEduMath benchmark, which evaluates AI model performance on handwritten K-12 math student work. The study tested 11 models over the past year and analyzed performance on student work containing errors versus correct work.

Questions or want to discuss this research? Reach out to us at insource@insourceservices.com or 781-235-1490.

AI Models Struggle Where Students Need Help Most: New Research on K-12 Math Education

Related Insights

(Not) Reinventing the Wheel: How to Write an Emplo...

AI Adoption Isn’t About Replacing People. It...

Integrating Custom Accounting Solutions with Exist...