DeepSeek, Claude and 4 others' AI Review
How DeepSeek, Claude, and other AI's are solving tough Math problems? (Jan 2025)
Introduction
With a plethora of freely available AI tools for public consumption and testing, the focus often lies in general queries and searches. Since an ultimate AGI (Artificial General Intelligence) application is to reason and advance new scientific knowledge, I'm focusing here on how 6 well-known AI tools, ChatGPT 4o mini, Anthropic Claude 3, DeepSeek r1, Google Gimini 1.5 flash, x.AI Grok 2, and Meta AI Llama 3.0, are performing in advanced mathematical reasoning. I used only the freely available public tier with minimal sign-in in this post. With rapid AI innovations in both software and hardware capabilities, this review only represents one time point and may easily differ in future.
Two non-trivial math problems
The following 2 problems are my benchmark to test how well the 6 AI tools can solve non-trivial math problems:
Given 2 equal-signed integers, a and b, if
\(\begin{equation} \frac {a^2 + b^2}{ab + 1} = k \end{equation}\)is an integer, then k is also a perfect square (29th IMO, 1988)
2. Integrate
For contextual clarity in discussing various AI responses, I provide the solutions to the 2 problems first before my reviews. So if you'd like to try solving these 2 enticing problems yourselves first, you might like to temporarily suspend further scrolling.
An algebraic solution to the 1988 IMO math problem 6
The first problem is the well-known 1988 IMO problem 6 in Number Theory. Many have provided solutions (see Notes 2, 8 and 9 in Vieta jumping. Since I prefer clear and concise solutions, I struggled for a couple of days to find a simple proof involving only algebra.
The proof begins with
Now find z(b) such that
The discriminant of the resulting quadratic equation in z is
where
By inspection, k = b^2 yields D = k^2 so z = 0 and b^2, satisfying the desired condition. Of course, were D not a perfect square, z would consist of a fraction with an irrational numerator, violating the condition of integer a and b.
Thus, the general solution, by symmetry, is
and
This proof above rests on a heuristic leap of k = b^2 in the discriminant D to produce the desired result. All 6 AI tools currently failed in such rule-of thumb inference technique that is often embedded in the expert systems of the first AI generation of the late 1970's and the 1980 decade; see, for example, also Building Expert Systems.
How AI is solving the IMO problem 6
a. Prove (a^2 + b^2)/(a*b + 1) is a perfect square:
I began with an incomplete problem statement of "prove (a^2 + b^2)/(a*b + 1) is a perfect square." ChatGPT, Claude, DeepSeek and Grok answered better than Gemini and Llama.
ChatGPT found special case of k = 1 and concluded only that the discriminant D had to be a perfect square, followed by appealing to solving Pell-like equations.
Claude found k = b^2 correctly right away but missed the trivial case of one of a and b can also be 0.
DeepSeek provided a partially correct answer of (b, b^3) but missed the symmetrical solutions even though it correctly identified its symmetry and produced special cases of 0 and 1 during its reasoning.
Grok responded with ambiguous k = [(a - b)/(a + b)]^2 as a solution. Upon questioning why, Grok unsatisfactorily gave circular explanations. When asked to use substitution of X = a^2 + b^2 and Y = a b, Grok's response contained z = a/b but again concluded with perfect square (z^2 + 1)/(z + 1/b^2) = [(z - 1)/(z + 1)]^2 similar to its initial ambiguous response.
Gemini simply punted, stating “… general solution is too complex …[but] k is likely to be perfect square.”
Llama, unfortunately, responded with expressions containing algebraic errors, which upon prompting for corrections, continued to produce erroneous proofs but without deviating from its final predetermined conclusion with erroneous right-hand side, such as "Therefore, we have proved that (a^2 + b^2)/(a b + 1) is indeed a perfect square: (a^2 + b^2)/(a b + 1) = [((a - b) + sqrt(a b + 1))/(sqrt(2) (a b + 1))]^2.”
b. Query with the exact problem statement 1
When queried with the complete problem statement above, all failed to provide complete solutions. ChatGPT, DeepSeek and Grok gave more verbose chain-of-thought responses than Gemini and Llama. Verbose chain-of-thought reasonings show new or alternate approaches upon one approach becoming inconclusive or too complex to proceed. Claude surprisingly failed to respond with the valid answer that was provided to an incomplete problem statement. It’s as if the more exacting the problem statement is, the more limited become the Claude inference engine.
ChatGPT responded with "structure of the discriminant D ... enforces that k must have the form n^2 to satisfy integer solutions for a and b."
Claude responded with hallucinating conclusion of "... k = m^2 - 4. Since m is an integer, we conclude that k is a perfect square."
DeepSeek did find partial correct answers of 0 and +/- 1 during its reasoning and invoked Vieta jumping to confirm 0 and +/- 1 but without providing any general solution seen earlier.
Grok like Claude surprisingly responded worse than its earlier response to the simple but incomplete problem statement, concluding with erroneous logic of assuming a and b as co-primes.
Gemini responded almost verbatum to that from the earlier simple incomplete problem statement.
Llama again responded with factorization and somehow found k = 1, concluding it as a proof of perfect square.
c. Discussion on AI inferences in solving problem 1
Querying with the same questions on my macbook versus iPhone apps resulted in different answers with the latter being brief and less than comprehensive reasoning. For any non-trivial problems, I urge using desktop computers, not phone apps, so you'll get verbose chain-of-thought responses that can be further queried for clarifications or hints added.
All tried to always conclude as if the problem statement was correct, akin to hallucination responses often found in general queries or searches. On name-dropping, ChatGPT invoked advanced Pell equations; DeepSeek likewise invoked Diophantine equation and Vieta jumping. This seems to be related to query statement being more specific and precise such as constraining "2 equal-signed integers a and b," which might have thrown off AI inference engines.
Another concern was whether responses would be the same upon querying the same question on different sessions. Querying on different days and sessions with the same query statement resulted in answer variations, though more testing may be needed.
The responses were generally limited to viewing only. All except Llama displayed session URL which can be saved for recall later, although it's unclear how long each URL remains valid. Despite Llama's lacking session URL, Llama does allow full output printing that others lack. So did Claude but copy-and-paste of any complicated equations into a text file became unintelligible. Other output print generally resulted in either only 1-page output or output with corrupted equations. Gemini and Llama, to their credit, allow saving of their query responses via copy-and-paste into a document file.
My initial reaction of the 6 tools on this IMO problem is that both Claude and DeepSeek's responses are far better than the other 4. Both ChatGPT and Grok seem better than Gemini and Llama. Verbose chain-of-thought responses by ChatGPT, DeepSeek, and Grok also allow further detailed probing with hints such as "Try proving it by using a variable substitution of z = a/b" or "What do you think of the socratic's proof?" to help guide toward meaningful responses.
Solution of the definite integral
I came up with this definite integral only in passing while surfing web on Landau theoretical minimum math problems a few years ago. Since I no longer can find a link associated with the problem, I can't vouch where this problem really came from. But it's an exceptionally good problem for any calculus student to tackle since integrating it requires good algebraic insight in variable substitution, factorization, and partial fraction expansion. These are not complex but straightforward algebraic steps and can be fun to solve. I've included one "how to integrate" reference in the References.
By transforming the sin(x) into tan(x), we can readily substitute u = tan(x).
The next simplification is factoring the denominator and simplifying with partial fraction expansions.
Now the 3 definite integrals can be straightforwardly integrated.
Can AI crack this algebraically intensive integration?
Here's how the 6 AI tools are solving the integration. Only DeepSeek succeeded partially, but all others either completely failed or made algebraic errors. The main deficiency lies in their inability to factor complicated polynomials:
ChatGPT used an inferior u = sin(x) substitution and couldn't continue. Upon my suggesting "try u = tan(x) substitution,” it couldn't factor the resulting fraction and concluding that it "... cannot be solved in elementary terms. However, it can be evaluated numerically or expressed in terms of special functions such as elliptic integrals or hypergeometric functions."
Claude used a subpar u = sin^2(x) substitution and then made algebraic error in resulting substitution. Upon pointing out multiple algebraic errors out on repeated queries for corrections, I find Claude simply can't factor any complicated equations. It responded that "... correct factorization of the denominator 1 + 3u^2 + 3u^4 + 2u^6 is not straightforward and I incorrectly assumed a simpler form." To its credit, Claude then deferred to "... using advanced mathematical software or consulting with an expert in complex integration techniques for an accurate solution to this problem."
DeepSeek used the correct u = tan(x) substitution, then tried further substitution of t = u^2, but failing to continue further. Upon my prompting to factor the denominator of the first substitution, DeepSeek produced
\(\frac{1}{3\sqrt{2}} \arctan(\sqrt{2} \tan(x)) - \frac{1}{6} \arctan\left(\frac{2\tan^2(x) - 1}{\sqrt{3}}\right) + C.\)The first term is correct. The second lengthy term is incorrect because DeepSeek also made algebraic error in numerator coefficients in its partial fraction expansion algorithm; instead of
\(\frac{u^2 - 1}{3(u^4 + u^2 + 1)},\)it produced erroneous numerator in
\( \frac{2 - u^2}{3(u^4 + u^2 + 1)}.\)Grok used the inferior u = sin^2(x) substitution and then concluded that "... not having a simple closed form that can be derived from standard trigonometric identities or substitutions without extensive manipulation or possibly using special functions or elliptic integrals." Upon suggesting the u = tan(x) substitution, Grok made algebraic error after substitution. Upon giving the correct t^4/(2 t^6 + 3 t^4 + 3 t^2 + 1) to retry, it couldn't factor the denominator "... nicely over the reels." Like Claude and DeepSeek, Grok is also not good at factoring. I suggested Grok to use Mathematica, to which Grok responded with incorrect Mathematica input containing erroneous fraction from an earlier response. Grok might not be able to directly query Mathematica.
Gemini used the correct u = tan(x) substitution but gave up on further simplification. Upon suggesting "Why not factoring the denominator in Step 4?", it agreed yet quickly concluded that "While factoring the denominator is a valuable approach for simplifying integrals, it might not always be feasible, especially for higher-order polynomials."
Llama immediately factored the original fraction in terms of sin^2(x) and then performed partial fraction expansion incorrectly, resulting in wrong answer. Upon querying again a day later, different erroneous result came back, again due to algebraic errors.
Conclusion
One of my motivation on how AI is currently solving non-trivial problems is that many students now use AI to help with their course works. It's obvious that most current AI cannot yet compete with Mathematica in solving any advanced non-trivial math problems. The most egregious, however, is AI often committing straightforward algebraic errors in their responses without a hitch. I don't know if the errors are from scraping bad web sites or from their embedded hallucinating algorithm to not deviating from some predetermined final positive response they are locked on. In general, responses also include some qualifying excuses on appealing to using more advanced techniques or functions. I think for any tough math and science problems, AI should transparently respond that its inference engine is currently deficient and remain to be improved.
Another concern is that asking the same query on different days or web sessions results in different responses. On math problem queries, the resulting answers should be more or less always the same except possibly for different equivalent forms. So it's good to see each response is at least based on its internal reasoning steps. DeepSeek, however, at least on these 2 problems show reasoning steps more advanced than the other 5. I don't know if this might be attributed to more rigorous math training of DeepSeek engineers or its use of a better inference engine. Among the responses, Anthropic Claude provided most succinct responses, including references, that are relatively easy to follow whereas DeepSeek's verbose chain-of-thought answer is useful yet seems unnecessarily lengthy.
As is well known, AI tools currently, at least on my free tier usage, all lack memoization capability. This is unfortunate since this suggests they cannot learn from their reasoning mistakes and improve. However, session URL can be saved and recalled, though sign-in is usually required of different user if still valid. It seems continued query on the same recalled session seems possible though I'm not fully certain. Due to this memoization deficiency, AGI is still far from here.
Another caution to be noted is that AI queries are also a part of data mining in AI testing. Current AI challenge lies in training and inference, both of which needs to advance in tandem. Due to a sudden and explosive interest in DeepSeek, I'm adding another comparison study between DeepSeek and ChatGPT in the References.
Hashtags: #PeterL168 #AIResearch #BoswellTest
References
AoPS Forum – One of my favourites problems, yeah!, Artofproblemsolving.com, Retrieved 2023.
Arthur Engel, Problem Solving Strategies, Problem Books in Mathematics. Springer. p. 127., 1998.
Brown, K. S., "N = (x^2 + y^2)/(1+xy) is a Square", MathPages.com, Retrieved 2016.
Caswell, Amanda, I tested ChatGPT vs DeepSeek with 7 prompts — here’s the surprising winner, Tom's Guide, January 2025.
Expert System, Wikipedia; December 2024.
How to integrate, *math.stackexchange*, December 2019.
Jones, D.D. and J.R. Barrett., Building Expert Systems, in J.R. Barrett and D.D. Jones (eds). Knowledge Engineering in Agriculture. ASAE Monograph No. 8, ASAE, 1989.
Wolfram Mathematica, Mathematica, January 2025.
Vieta jumping, Wikipedia; February 2024.
ChatGpt session, January 2025.
Claude session, January 2025.
DeepSeek session 1. January 2025.
DeepSeek session 2. January 2025.
Gemini session, January 2025.
Grok session, January 2025.