Summary:
Despite rapid innovation, today's AI models, especially large language models (LLMs) and large reasoning models (LRMs), face fundamental limits in reasoning depth, with reliability plateauing at roughly 90% for complex tasks. While AI excels at routine queries, it falters on multi-step, edge-case, and heuristic-rich problems. This paper synthesizes insights from recent research, including Apple’s 'The Illusion of Thinking' (Shojaee et al., 2025), with empirical findings from my stress-testing of leading AI chatbots, to illuminate these critical gaps. I argue that integrating human-like heuristic reasoning with algorithmic rigor is essential for the next leap in AI capability. Strategies for effective AI evaluation, educational implications are discussed, alongside a detailed appendix on AI’s coding limitations.
1. Introduction: The Dual Nature of AI Progress
The past decade has witnessed astonishing advances in artificial intelligence, with large language models (LLMs) and large reasoning models (LRMs) now integral to daily life, education, and industry. Yet, as AI’s reach expands, so too do questions about its true reasoning capabilities. Recent research, including Apple’s “The Illusion of Thinking” (Shojaee et al., 2025), has cast new light on the persistent gap between AI’s computational prowess and its ability to reason through complex, multi-step problems, especially those requiring human-like heuristics. My own deep dives into AI chatbot performance, beginning with DeepSeek R1 in January 2025, confirmed and extend these findings.
2. The Illusion of Thought: Stress-Testing AI’s Limits
2.1. Empirical Failures in Mathematical Reasoning
In a series of controlled experiments, I challenged top AI chatbots with two demanding mathematical problems: a Math Olympiad question and a non-trivial integration. Despite repeated, targeted hints, often based on the models’ own chain-of-thought outputs, most chatbots failed to arrive at correct solutions. Common failure modes included basic algebraic errors and a stubborn inability to abandon initial, incorrect solution paths, even when confronted with clear evidence or guidance.
2.2. The Tower of Hanoi Ceiling
Apple’s paper illustrates this phenomenon with the Tower of Hanoi puzzle. While its algorithmic solution is well known, leading chatbots consistently faltered when the puzzle exceeded seven disks. This “reasoning depth ceiling” is systemic. Namely, beyond this threshold, models become “lost,” unable to self-correct even with external hints. This pattern aligns with my own findings across mathematical and coding domains.
2.3. Coding Edge Cases and AI Hallucination
Similar limitations emerged in coding tasks (see Appendix). When tasked with developing a Python class to handle nuanced grade calculations, chatbots performed well on basic requirements but repeatedly failed on edge cases, such as mapping composite grade averages or calculating dot products. Even with explicit feedback, the models struggled to generalize or extrapolate solutions, revealing a persistent gap in handling boundary conditions.
2.4. LLM Efficacy in Routine Applications
It is crucial to balance these observed limitations with the profound capabilities of current AI models in less complex scenarios. While the 3 shortcomings above primarily emerge with highly complex, multi-step problems, large language models (LLMs) demonstrably excel for the vast majority of everyday questions, often delivering accurate results well over 90–95% of the time. This high performance on routine informational and language understanding tasks is widely observed and reflected in leading models' scores on general knowledge benchmarks such as Measuring Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2021). I routinely query chatbots to sharpen my argument and writings.
For instance, when I queried chatbots about global AI policies or other general-interest topics, they consistently provided detailed, accurate summaries that would have taken hours to compile manually. This dichotomy, between robust performance on common tasks and significant challenges with complex reasoning, highlights the specific areas where current AI paradigms falter and where future innovation is most needed.
3. The Heuristic Gap: What’s Missing in Modern AI
3.1. Lessons from Early Expert Systems
Early AI expert systems, despite severe computational constraints, achieved surprising success by leveraging heuristic, symbolic reasoning. This “art of reasoning” mirrors the intuitive, creative strategies human experts use to tackle difficult problems, often yielding elegant, efficient solutions. In contrast, today’s large language models (LLMs) and large reasoning models (LRMs), empowered by vast compute, tend to favor brute-force statistical methods, often at the expense of this heuristic dimension.
3.2. AlphaGo and the Myth of AI Creativity
AlphaGo’s 2016 victory over Lee Sedol is often cited as a triumph of AI creativity. Yet, a closer look reveals that its success was rooted in computational depth, exploring game states far beyond human reach rather than genuine, human-like intuition. AlphaGo’s strategy exploited human tendencies (such as prioritizing early territory over precise stone counting), integrating these into its play. While advances in transfer learning (Kurzweil, 2024) are impressive, they do not erase the fundamental difference between AI’s algorithmic reasoning and the heuristic, adaptive thinking of human experts.
3.3. The Value of Human Heuristics
Ray Kurzweil has argued that AI should appear “less intelligent,” suggesting that instant, perfect solutions would make AI too easily distinguishable from humans. However, this view overlooks the deeper issue: the process of reasoning. Human experts rely on heuristics, or efficient, intuitive shortcuts. Students often value and learn from these elegant steps more than the final answer itself. Evaluating the chain-of-reasoning, not just the output, is thus a more meaningful way to distinguish AI from human thought.
4. Educational Implications: The AI Fluency Imperative
The integration of AI into education is accelerating, as evidenced by Ohio State University’s 2025 mandate for AI fluency. As large language models (LLMs) become capable of completing entire homework assignments, the challenge of distinguishing human from AI output grows. To foster genuine learning, assignments must increasingly require detailed, heuristic-rich reasoning chains, transforming them into non-trivial, thought-provoking tasks. This shift is vital for maintaining academic integrity and for preparing students to work alongside, and beyond, AI.
5. Strategies for Effective AI Deep Dives
Clearly, any powerful technology necessitates a thorough "deep dive" to fully grasp its capabilities and inherent limitations. For AI, some key strategies include:
5.1. Targeting Complex Problems
Begin with challenging problems for which solutions are well-understood. This approach immediately reveals AI limitations and failure points. My own initial chatbot testing with non-trivial math problems effectively highlighted these “hallucinations” from the outset.
5.2. Gradual Difficulty Adjustment
Once weaknesses are identified, systematically lower problem complexity to pinpoint where accuracy returns. The Tower of Hanoi puzzle, with its easily adjustable complexity, is a representative testbed.
5.3. Scrutinizing Edge Cases
My coding experiments highlighted that while chatbots are competent with basic logic, they consistently fail at complex edge cases, such as mapping composite grade averages. This inherent difficulty with edge cases is often encapsulated by the '80/20 rule' (of the Pareto Principle) in software engineering: approximately 80% of the effort in solving complex problems is dedicated to correctly handling boundary conditions, while only 20% involves the core logic. Even with explicit prompting, models struggle to resolve these. The ultimate human-derived fix by extending the grade mapping with additional intermediate grades exemplifies how heuristic reasoning can overcome AI's current limitations in handling nuanced boundary conditions..
5.4. Cross-Model Validation
As in the Apollo missions, where outputs were cross-checked by multiple onboard computers, we should likewise validate and synthesize AI responses across several models to ensure reliability and accuracy, expanding our learning.
6. The Role of Expert QA and Human Oversight
High-quality AI Quality Assurance (QA) testing by experienced professionals is indispensable. Like expert mentors, these testers can identify where models break down and where heuristic reasoning must be reintroduced. By systematically analyzing AI’s failures and successes, especially in edge cases, we can guide the evolution of models toward greater adaptability and insight.
7. Mastering AI: The Imperative of Deep Dives
To truly leverage AI’s transformative potential, we must move beyond superficial use and engage in rigorous, expert-led deep dives into its strengths and weaknesses. While large language models (LLMs) excel at routine queries, their performance degrades sharply on complex, multi-step problems and nuanced edge cases. Mastery requires a strategic approach to testing, beginning with known solutions and mapping the boundaries of model capabilities. My own coding challenges, particularly the persistent failures with grade calculation edge cases, further underscore the critical need for ongoing human oversight and heuristic intervention.
This process is not merely about highlighting flaws; it is about applying the best of human analytical thinking to guide AI’s evolution. The notion that AI will soon replace all complex software engineering is overstated. Experienced engineers, with their intuitive grasp of edge cases, will remain essential. For our upcoming young generation, deep understanding of AI’s capabilities and limitations is not optional; it is foundational for shaping, rather than simply using, the technology.
8. Conclusion: Pushing AI’s Frontiers Together
By committing to expert-led, heuristic-driven deep dives, we not only reveal AI’s current frontiers but actively participate in pushing them forward. The future of AI hinges on our ability to seamlessly blend computational power with the art of reasoning, fostering systems that are not merely fast and accurate, but also genuinely insightful, creative, and resilient.
Hashtags: #AIResearch #Heuristics #LLM #BoswellTest #PeterLuh168
Acknowledgment:
I thank Walter Luh for alerting me to Apple’s latest AI research. I am grateful to Gemini 2.5 Flash, Grok 3, and Perplexity AI for helping to refine and sharpen my analysis.
References:
Zhao, W. et al., (2023). A Survey of Large Language Models, arxiv.org; May 2023; see also Large Language Models, en.wikipedia.org. Jun 2025.
Xu, L. et al. (2025). Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models; arxiv.org; Jan 2025.
Shojaee, P. et al. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity; machinelearning.apple.com; June 2025.
Luh, P. (2025). DeepSeek, Claude and 4 others' AI Review: How DeepSeek, Claude, and other AI's are solving tough Math problems? peterl168.substack.com; Jan 2025.
Luh, P. (2025). Heuristics In AI Chain-of-Reasoning? Global AI’s Math Problem Solving Review on DeepSeek, ChatGPT, le Chat, Qwen, peterl168.substack.com; Feb 2025.
Kubiya Games. (2025). Mastering the Tower of Hanoi: a Puzzle and of Logic and Strategy, kubiyagames.com; Feb 2025; see also Tower of Hanoi, en.wikipedia.org; June 2025.
Provo ai, (2025). Expert Systems in AI: Pioneering Applications, Challenges, and Lasting Legacy, provoai.com, March 2025; see also Wikipedia, Expert System, en.wikipedia.org; June 2025.
Hendrycks, D. et. al., (2021). Measuring Massive Multitask Language Understanding, arxiv.org; Jan 2021.
Olson, P. (2016). Google’s Stunning Win For A.I.: AlphaGo Beats Human Champion In Third, Crucial Match, forbes.com; March 2016; see also Alpha Go versus Lee Sedol, en.wikipedia.org; May 2025.
Kurzweil, R. (2024). The Singularity Is Nearer: When We Merge with AI, penguinrandomhouse.com; Viking, June 2024; see also The Singularity Is Nearer, en.wikipedia.org; May 2025.
Yu, Y. et al., (2025). Chain-of_reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective, arxiv.org; Jan 2025.
Ohio State University. (2025). Ohio State launches bold AI fluency initiative to redefine learning and innovation, news.osu.edu; June 2025.
Haley, C. (2022). Explaining the 80-20 Rule with the Pareto Distribution, dlab.berkeley.edu, March 2022; see also Better Explained, Understanding the Pareto Principle (The 80/20) Rule, betterexplained.com; 2014.
Wilkin, G., (2025). “Debugging: from Art to Science” A Case Study on a Debugging Course and Its Impace on Student Performance and Confidence”, dl.acm.org; Feb 2025.
DSRQzKqSsFHwocOZzkPG, (2025). The Future of Programming: AI as Coding Partner, Not a Replacement, softstorm.world; April 2025.
Luh, P., (2025). The Boswell Test – the Next Defining AI Milestone: Details of my Boswell Test talk at ARIN 2025, peterl168.substack.com; March 2025; see also Luh and Wilhelm, Boswell Test: Measuring Chatbot Indispenability: an Intelligent Assessment of Global AI Policies, aircconline.com; March 2025.
Appendix: AI Coding a Python Letter-Grade Class Object
For my ARIN 2025 virtual presentation, I demonstrated how ten AI chatbots graded each other’s responses to a public-interest question on global AI policies, assigning letter grades from ‘A+’ to ‘C,’ with occasional composite grades like ‘A-/B+.’ Statistical analysis required calculating medians and interquartile ranges of these grades. I tasked several chatbots with creating a Python LetterGrade class with the following requirements:
Map standard letter grades to GPA digits: (A+, A, A-) = (4.25, 4.0, 3.75), (B+, B, B-) = (3.25, 3.0, 2.75), (C+, C, C-) = (2.25, 2.0, 1.75), (D+, D, D-, F) = (1.25, 1.0, 0.75, 0.0).
Map digits in [0, 4.25] to the nearest single grade (e.g., 4.0 → A) or a composite of two grades (e.g., 3.625 → A/B+, as (4.0 + 3.25)/2 = 3.625), listing the higher grade first in composites. For ties between composites (e.g., 3.5 as A/B = 3.5 or A-/B+ = 3.5), select the composite with the narrowest gap. Limit composites to two grades, separated by a slash.
Support arithmetic operations (+ and -) with numbers or LetterGrade instances, clamping results to the range, [0, 4.25].
Implement dot products between a list of weights (floats) and a list of LetterGrade objects, returning a LetterGrade if in [0, 4.25], else a float.
For a list of LetterGrade objects, compute the numeric median and return the nearest single or composite grade, plus provide five L1 tuples for minimum, Q1, median, Q3, and maximum.
All chatbots handled the basic class structure well but faltered on edge cases, such as mapping a GPA of 3.4375 to ‘A-/B+’ or computing a dot product between [0.6, 0.3, 0.1] and [‘A-/B+’, ‘A-/B+’, ‘A+’]. A straightforward fix of extending the grade mapping to include three extrapolated GPAs (3.5, 2.5, 1.5 for ‘A-/B+’, ‘B-/C+’, ‘C-/D+’) was obvious to humans but elusive to models. This may reflect humans’ direct experience with grade assignments, absent in AI. The Python code for this class is available.