Apple researchers find ‘major’ flaws in AI reasoning models ahead of WWDC 2025

3 hours ago 13

ARTICLE AD BOX

A newly published

Apple Machine Learning Research

study has challenged the prevailing idea that large-language models (LLMs) like OpenAI's o1 and Claude's thinking variants truly possess "reasoning" capabilities. The study indicates fundamental limitations in these AI systems. For this study, Apple researchers designed controllable puzzle environments, such as the Tower of Hanoi and the River Crossing. This approach avoided standard math benchmarks, which are susceptible to data contamination. According to the researchers, these custom environments allowed for a precise analysis of both the final answers produced by the LLMs and their internal reasoning traces across different complexity levels.

What Apple researchers have found out from this study

According to a report by MacRumors, the reasoning models tested by Apple’s Research team, including o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet, saw their accuracy collapse entirely once problem complexity crossed certain thresholds. Success rates dropped to zero even though the models had sufficient computational resources. Surprisingly, as problems became harder, the models reduced their reasoning effort. This points to fundamental scaling limitations rather than a lack of resources.

Even more revealing, the models still failed at the same complexity points even when researchers provided complete solution algorithms. This indicates that the limitation lies in basic logical step execution, not in choosing the right problem-solving strategy.The models also showed puzzling inconsistencies. They were able to solve problems requiring over 100 moves but failed on simpler puzzles that needed only 11 moves.The study identified three performance patterns. Standard models unexpectedly performed better than reasoning models on low-complexity problems. Reasoning models had an advantage at medium complexity. Both types failed at high complexity. Researchers also discovered that models exhibited inefficient "overthinking" patterns, often discovering correct solutions early but wasting computational effort exploring incorrect alternatives.The key takeaway is that current "reasoning" models rely heavily on advanced pattern matching, not true reasoning. These models do not scale their reasoning the way humans do. They tend to overthink easy problems and think less when faced with harder ones.It is worth noting that this research surfaced just days before WWDC 2025. According to Bloomberg, Apple is expected to focus on new software designs rather than headline-grabbing AI features at this year’s event.

Read Entire Article