Heuristics in AI Chain-of-Reasoning?

Global AI's Math Problem Solving Review on DeepSeek, ChatGTP, le Chat, Qwen, ...

Feb 10, 2025

Introduction

We’re fortunate to be witnessing the AI revolution unfold before our eyes. AI has become remarkably adept at handling general intelligence in conversational settings (or chats), with some even claiming it could pass the Turing test. With AGI (Artificial General Intelligence) on the horizon, DeepSeek's claim of being capable of solving math and science problems at a Ph.D.-level intelligence with self-improvement capabilities is both staggering and intriguing. Building on my previous post, I delve into whether AI systems’ chain-of-reasoning employ heuristics when tackling math problems.

The emulation of heuristic or rule-of-thumb reasoning was a hallmark of early AI expert systems, as expertise often involves well-horned heuristic reasoning. Since the Alpha-Go’s triumph over human Go experts, there has been a shift towards emphasizing computational prowess in neural networks and reinforcement learning, often at the expense of heuristic reasoning. In this post, I explore whether heuristic reasoning still holds potential for advancing scientific discovery or addressing NP-complete (or computationally intractable) problems, particularly within specialized AI chatbots.

Methodology

I test with four benchmark math problems. My previous 2 problems were intentionally selected for their high level of difficulty to challenge the latest AI tools. Most AI systems struggled with these 2 problems, primarily due to a lack of heuristics in breaking down complex problems into more manageable, smaller parts. In this post, I expand my benchmark math problems to include 2 additional problems in logic and combinatorics to further test AI’s heuristic inference capabilities. I’ve also refined a query from my previous post to make it more precise and less open-ended.

Instead of providing a lengthy summary of each AI's responses as I did previously, I now annotate the heuristic reasoning used in each problem, which serves as a basis for evaluating AI chain-of-reasoning responses. Additionally, I’ve listed all my queries regarding these 4 problems in Appendix, encouraging readers to test these queries themselves and compare their findings with my ratings. Such testing is recommended since AI responses can vary depending on the time and the session of interaction.

The 8 AI free-tiers used are OpenAI ChatGPT 4o mini, free Perplexity AI, DeepSeek V3, Google Gemini 1.5 flash, x.AI Grok 2, Meta Llama 3.1, Le Chat by Mistral, and Alibaba Qwen 2.5. Note the name changes from Anthropic Claude 3 to free Perplexity AI based on version queries over a span of 10 days despite I used the same stored session URL in my queries. Any paid pro-versions might outperform their free counterparts. I invite comments on these paid-tier results. Again, with rapid pace of AI innovation, this review captures just a single moment in time.

Benchmark Math Problems

The 4 math problems for testing the 8 AI tools are:

1. Algebra: Given 2 equal-signed integers, a and b, if

\(\begin{equation} \frac {a^2 + b^2}{ab + 1} = k \end{equation}\)

is an integer, then k is also a perfect square (29th IMO, 1988)

2. Integration:

\(\int\frac{sin^4(x)\: dx}{sin^6(x) + 1}\)

3. Statistics: Given a sequence of length n such that its mean, median, and mode are all equal to b, and its minimum is a. Find its maximum possible c.

4. Combinatorics: Arrange 4 pairs of blocks, where each pair is labeled with a unique letter from ‘A’ to ‘D’, ensuring:

The 'A' blocks are separated by exactly one other block.
The 'B' blocks are separated by exactly two other blocks.
The ‘C' blocks are separated by exactly three other blocks.
The 'D' blocks are separated by exactly four other blocks. (Martin Gardner’s 1967 Mathematical Games).

Heuristics in Chain-of-Reasoning

For contextual clarity in discussing various AI responses, I graded the AI on their responses whether their chain-of-reasoning contained any mathematical heuristics (see Appendix for my breakdown on how I graded each problem). So if you'd like to try solving those problems yourselves first, feel free to skip to the next section.

1. 1988 IMO math problem 6

The key reasoning is to require a resulting discriminant, D = (k b)^2 - 4(b^2 - k), to be a perfect square. As previously discussed, a heuristic inference leap of b^2 - k = 0 ensures a = 0 and a = k b = b^3 with general integer solutions of

\((a, b) = (a, 0) = (0, b)= (b^3, b) = (a, a^3)\: \)

We shall omit any proof that no other integer (a, b) solutions exist for other k's that yield perfect-square D.

To ensure high-quality AI responses, which rely on well-crafted queries, I've added a third query that clarifies to seek integer solutions rather than a general proof in my previous post for the same problem.

All AI chain-of-reasoning correctly responded with the necessary condition for D to be a perfect square except for Llama, but all cannot produce any general solution except for Qwen. Some reasoning steps included special cases of (1, 1) or (2, 8) but then responded with "... too complex a proof to answer." Only Qwen answered correctly with a = b^3 and b = a^3 on its initial query. Grok on a repeat query correctly responded with a = b^3. Both Grok and Qwen when reusing the same chat session URL appear to be using past queries to improve their responses. le Chat responded with a specific case of (1, 7) and a rational k = 5/2.

2. Indefinite integral

The 3 heuristic reasonings that are used in evaluating the responses of this integration are:

(a) applying an effective substitution of u = tan(x);

(b) correctly factoring a 6th-degree into three 2nd-degree polynomials in the denominator of the substitution; and

(c) subsequently calculating correct coefficients of partial fraction decomposition so resulting three integrals can be straightforwardly integrated.

All began with an inferior substitution of u = sin(x) or sin^2(x). Only DeepSeek used the tan(x) substitution during its chain-of-reasoning steps with a second substitution attempt, but then unfortunately kept making algebraic errors and producing wrong solutions. None could integrate correctly. Some just gave up, whereas those that continued made algebraic errors either during substitutions, factorizations, or partial fraction expansions. Even after I provided the correct denominator in a query to help chat sessions to find partial fraction expansion coefficients, calculated coefficients were wrong. DeepSeek perhaps due to its high usage unfortunately balked to continue with "server is busy. Please try again later." Its paid version might respond better.

3. Statistics of sequence

All the tools knew the correct statistical definitions of mean, median and mode. The heuristic reasoning thus should start with a constraint that more than half of the sequence must be b in order to satisfy the condition that both the median and the mode are the same. Since the sum of the remaining sequences must be (n - [(n+1)/2]) b to ensure the mean remains b, the next key reasoning is to deduce that a maximum is possible only if all but one remaining element are minimum. Surprisingly, most AI systems failed this second heuristic reasoning.

Some incorrectly assumed one element each for the minimum and the maximum. ChatGTP, Grok, and Qwen responded correctly. Some failed to differentiate between odd and even n. When queried on special cases of n = 12, 13, 14, 15, 16 and (a, b) = (3, 4), some failed to respond correctly with c = 9, 9, 10, 10, 11, let alone the general formula

\(c = \left[\frac{n+1}{2}\right] b - \left[\frac{n}{2} - 1\right] a\)

where [ . ] is the integer floor function.

4. Combinatorics

This is one of famous Martin Gardner's problems based on Langford pairing for number of blocks equal to 2 n = 8. Not all n’s have solutions. Since the number of all possible arrangements, which is n!/2^(n/2), grows exponentially, heuristic proof for general n is complicated with multiple different arrangements. I only checked to see if any AI tools could either find one satisfactory arrangement or invoke the Langford Pairings for n = 4, the latter of which ChatGPT, DeepSeek, and Qwen knew and helped boosting their respective rating scores.

Discussion and Conclusion

The final scores, tallied in the Appendix, are Qwen 2.5, ChatGPT 4o, DeepSeek V3, Grok 2, Google Gemini 1.5, Perplexity AI, Meta Llama 3.1, and le Chat by Mistral in descending order with the highest scoring 50/100, a failing grade in any class. This review shows, however, that in solving math problems, AI chat technology in China may have surpassed or at least on par with US AI chats. All except DeepSeek claims themselves a proprietary or close-source model.

The AI chain-of-reasoning predominantly leverages computational power over heuristic reasoning when solving problems. For example, none of the AI systems seem capable of factoring simple quadratic or cubic polynomials heuristically by recognizing that adding or subtracting a similar term can readily transform these into well-known factorable forms of the differences of squares or cubes. Instead, those that successfully factored a cubic equation employed root searching, which, while correct, indicates lack of heuristic understanding. This deficiency in heuristic reasoning might be a sign of failure in passing the Turing test within scientific context.

Moreover, AI tools frequently do not verify their answers, similar to how they might produce ‘hallucination’ responses in general queries. They can also commit significant algebraic errors. Despite their knowledge of statistical definitions in problem 3, AI tools appear to lack the heuristic reasoning necessary to simultaneously consider multiple conditions to reach a coherent logical conclusion.

However, these cutting-edge chatbots are also on the verge of solving complex math problems, although currently reliance on software like Mathematica is still common for such tasks. A notable downside is that AI systems may undermine students’ great learning experience in doing traditional homework assignments themselves in honors high-school or university math and science courses, somewhat diminishing their educational value.

Hashtags: #PeterL168 #AIResearch #BoswellTest

References:

AlphaGo versus Lee Sedol, Wikiedia, January 2025.
Aschenbrenner, L., From AGI to Superintelligence: the Intelligence Explosion, Situational Awareness, June 2024.
Edwards, D., The Turing Test: Origins, significance, and controversies, Robotics & Automation News, December 2024.
Gardner, M., Mathematical Games, Scientific American, November and December 1967.
Gibney, E., Scientists flock to DeepSeek: how they’re using the blockbuster AI model, Nature, January 2025.
Gigerenzer G. and Gaissmaier W., Heuristic Decision Making, Annual Review of Psychology. 62: 451–482, 2011.
Heuristics for Problem Solvers, mathstunners.org, 1980.
Krishnan, N., Model Distillation, towardsai.net, January 2025.
Langford Pairing, Wikipedia, November 2024.
Leffer, L., In the Race to Artificial General Intelligence, Where’s the Finish Line?, Scientific America, June 2024.
Luh, P., DeepSeek, Claude, and 4 others’ AI Review: How DeepSeek, Claude and other AI’s are solving Math Problems (January 2025), Substack, January 2025.
Miller, J., Langford's Problem, Remixed, Gathering 4 Gardner, March 2014.
Pal, S., Langford Pairing, susam.net, September 2011.
Yu, Y., et al., Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective, arxiv.org, January 2025.

Appendix: Query Questions

Here are a few caveats I found using AI tools. Although session recall or memoization can be important in AI self improvement, each new chat session is a fresh page to ensure user privacy, so at present there is no memoization. Whenever possible, I recalled the previous chat URL to maintain the same session. Both Grok and Qwen seemed to be able to rehash old queries to improve their responses on past queries, where Grok improved its responses in problem 3 on subsequent queries in the same session URL. DeepSeek clears chat session URL after a time out although it can be recalled after login via left-hand side history tab. Both le Chat and Llama provide no chat session URL. Not all allow print output or download.

0. AI tools’ correct name and version:

How should I uniquely identify you with official name and version number?
Why not identify yourself with specific version like DeepSeek V3 or chatGPT 4o?

It’s important to periodically query the tool’s name because they change even within the same session tabs. DeepSeek R1 (or V1.5) of my last post is now DeepSeek V3. Meta LLama 3 is now LLama 3.1.

Some AI tools are extremely hesitant to reveal their true identities, recanting their earlier versions and names. The previously identified Anthropic Claude 3 and le Chat 7B (after much initial prodding) now prefer to be called free Perplexity AI and le Chat by Mistral without any version associations. I think their version disavowals may be to obscure use of other AI sources, such as older ChatGPT 3.5, distillations, in their responses, or perhaps the companies’ natural reaction to stiff global AI innovations.

1. 1988 IMO math problem 6: (25%)

Prove (a*a + b*b)/(a*b + 1) can be perfect square (this query is an incomplete and too general).
If a, b, and k = (a^2+b^2)/(a*b+1) are all integers, then k is a perfect square (IMO problem statement).
Find integer pairs (a, b) such that if k = (a*a + b*b)/(a*b + 1) is both an integer and a perfect square (specialized restatement of query 2).

20% for any (b^3, b) and symmetric solutions. Additional 5% for (0, b) symmetric solution. 5% for identifying perfect-square Discriminant or correct Vieta jumping relation. 5% for providing computer search code or specific solutions in chain-of-reasoning steps.

ChatGPT 4o mini 5, Perplexity AI 10, DeepSeek V3 10, Gemini 1.5 5, Grok 2 10, le Chat 5, Llama 3.1 5, Qwen 2.5 25

2. Indefinite integral: (25%)

Integrate sin^4(𝑥)/(𝑠𝑖𝑛^6(𝑥)+1).
Try 𝑢 = tan(𝑥) substitution.
Try factoring 2𝑢^6+3𝑢^4+3𝑢^2+1 in the denominator.
Find coefficients of Partial Fraction Decomposition.

8% for correct substitution result; 4% for the tan(x) substitution but with algebraic error. 8% for factoring into 3 quadratic terms; 4% if only factoring in 2 terms. 8% for correctly finding all partial fraction expansion coefficients.

ChatGPT 4o mini 5, Perplexity AI 4, DeepSeek V3 0, Gemini 1.5 0, Grok 2 0, le Chat 0, Llama 3.1 0, Qwen 2.5 12

3. Statistics of sequence: (25%)

For a sequence of length n with its mean, median, and mode all equal to 𝑏 and its minimum 𝑎, what is the sequence's maximum possible c?
Why limit to just one element for a? Doesn't this decrease largest possible c?
For (𝑎, 𝑏) = (3, 4), what are c for 𝑛 = 12, 13, 14, 15, 16?
You should have gotten c = 9, 9, 10, 10, 11. Your formula does not work for odd n.
Write a general formula of c for both even and odd 𝑛.

13% for correct formula for even n, and 12% for odd n.

ChatGPT 4o mini 0, Perplexity AI 0, DeepSeek V3 0, Gemini 1.5 0, Grok 2 25, le Chat 0, Llama 3.1 0, Qwen 2.5 13

4. Combinatorics: (25%)

Arrange 4 pairs of blocks, each pair’s block labeled with a distinct number from A to B, such that:
- The pair of 'A' blocks are separated by exactly one other block.
- The pair of 'B' blocks are separated by exactly two other blocks.
- The pair of 'C' blocks are separated by exactly three other blocks.
- The pair of 'D' blocks are separated by exactly four other blocks.
Try Langford pairing for n = 4.

25% for a correct arrangement. 6% after the query, “try Langford pairing for n = 4.”

ChatGPT 4o mini 25, Perplexity AI 6, DeepSeek V3 25, Gemini 1.5 25, Grok 2 0, le Chat 0, Llama 3.1 0, Qwen 2.5 0

5. Total scores:

ChatGPT 4o mini 35%

Perplexity AI free 20%

DeepSeek V3 35%

Gemini 1.5 flash 30%

Grok 2 35%

le Chat by Mistral 5%

Llama 3.1 5%

Qwen 2.5 50%

Above are the total scores for the 8 AI tools tested at this writing.

Failed to render LaTeX expression — no expression found

Peter’s Substack

Discussion about this post