🔬 AI Research

From 70% to 86% on MMLU: AI's Reasoning Leap—or Illusion?

OpenAI's GPT-4 hit 86.4% on MMLU—16 points above GPT-3.5—sparking claims of emergent reasoning. But dig into the data, and Theory of Mind tests reveal the cracks.

Line graph of MMLU, HellaSwag, and ARC scores climbing from GPT-3.5 to GPT-4 and Gemini Ultra

⚡ Key Takeaways

  • GPT-4's MMLU score leaped 16 points, signaling prompted reasoning gains across benchmarks. 𝕏
  • Chain-of-thought and self-consistency boost accuracy 10-60%, mimicking System 2 thinking. 𝕏
  • Theory of Mind progress is real but brittle—novel scenarios expose pattern-matching limits. 𝕏
Published by

theAIcatchup

Community-driven. Code-first.

Worth sharing?

Get the best Open Source stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.