Contrary to popular belief, pioneering computer scientist Alan Turing did not claim that his famous ‘Imitation Game’ test proved machine intelligence. What he actually said was that a sufficiently programmed computer might be able to carry on a remote ‘conversation’ which seemed indistinguishable from remotely conversing with a human being.
A few decades later, philosopher John Searle’s Minds, Brains, and Programs paper, also called the ‘Chinese Room’ argument, elegantly debunked claims that this was the same as machine intelligence. Using the thought-experiment of a non-Chinese speaker locked in a box, with a table of Chinese characters to consult and the appropriate characters to respond with, Searle showed that a programmed machine could appear to communicate flawlessly in Chinese without ever actually understanding a word of Chinese.
This is where we’re at with so-called ‘AI’ models today. Models such as ChatGPT can imitate human conversation – but imitation is all it is. Like the person in the Chinese Box, all they’re doing is scanning for patterns and responding as they’re programmed.
In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.
The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.
To test so-called ‘large reasoning models’, which attempt to simulate a logical reasoning process, the researchers presented against four classic logic puzzles:
Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks) – scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).
As the researchers point out, the enthusiasts of ‘machine intelligence’ have been focusing solely on ‘final answer accuracy’. If the machine produced the right answer, they assumed the machine reasoned it out. But what if it was simply pattern-matching to maths or coding problems that were already in its training data? As anyone who’s done high school maths knows, ‘showing your reasoning’ is as important as getting to the right answer.
The results suggest that, given problems outside their training data, these ‘reasoning’ machines fumble.
Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under five per cent on novel mathematical proofs, with only one model reaching 25 per cent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.
I’m put in mind of a niece who went to primary school during the height of the ‘Whole Language’ reading fad. This was the theory that involved ‘immersing’ children in books, rather than teaching them to systematically ‘sound out’ new words. During a reading session, it became obvious that she was simply memorising the text – a not-inconsiderable feat of memorisation, but not ‘reading’. For instance, if I pointed to a particular word and asked her what it was, she would recite that page of text while moving a finger along the words. When she ‘counted’ to the appropriate point in the sentence, she’d stop. So, she could ‘find’ the word, but she didn’t actually understand the word on the page.
AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.
“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve – a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”
Popular AI models like Chat GPT do even worse.
On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.
Which tends to blunt the objections of ‘AI’ defenders, who argue that the failure is ‘measuring engineered constraints’. It’s not that the machines need more time – especially not given their blindingly fast computational speed – it’s that they just can’t do it.
The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle – despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational […]
What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.
Or, far more likely, that computer programs are simply not sufficient for intelligence.