To hear boosters such as OpenAI CEO Sam Altman tell it, artificial-intelligence technology has already become so capable that smarter-than-human systems are within sight, if not imminent.
But some recent research papers, including one from Apple that resonated widely online, appear to have poked a few holes in such hype. The papers’ conclusion: Far from achieving some kind of super intelligence, even the most advanced generative AI systems — the so-called “reasoning” or “thinking” models — struggle to reason beyond a certain level of complexity.
And there are strong indications that to the extent that such systems show signs of reasoning, it’s a kind of parlor trick; they are simply regurgitating the data upon which they’ve been trained. When asked to apply even simple algorithms to problems they haven’t seen before, they struggle — or fail completely.
Even many in the technical community apparently have started to believe the hype that super-intelligent artificial intelligence is almost here, said Subbarao Kambhampati, a professor in Arizona State’s School of Computing and Augmented Intelligence. But the Apple paper in particular essentially says “no we are not,” he said.
Unlike AI models such as OpenAI’s o3, which have been trained on vast amounts of data, individual people don’t know “even a fraction of the world’s knowledge,” said Kambhampati, who co-authored another one of the recent papers. Even so, “we still are more intelligent than these systems in multiple ways.”
The actual and potential capabilities of such generative AI systems are a matter of huge concern for San Francisco’s tech scene. The City has become ground zero for AI development, attracting an outsized amount of venture capital and serving as the home for the two best-funded and most valuable generative AI startups in the world — OpenAI and Anthropic.
Much of that investment has been predicated on the notion that AI systems will be intelligent and capable enough to take on a wide variety of jobs, either displacing human workers or enhancing their productivity.
Anthropic CEO Dario Amodei has predicted that half of all entry-level white-collar jobs would be wiped out by artificial-intelligence technology within the next five years.
FABRICE COFFRINI
Along those lines, at a small press event in San Francisco last month, Eric Kutcher, chairman of the North American division of consulting firm McKinsey, said the CEO of an unnamed company with a $200 billion market capitalization had set a goal of shrinking its 2,500-person marketing department by 98% within the next two years, relying on AI-powered digital agents to take over much of the transactional work currently handled by people today.
Helping spur such predictions has been the development of large-reasoning models over the last year. Such models — which include OpenAI’s o1, o3 and o4 systems, Anthropic’s Claude 3.7 and 5 “thinking,” and Deep Seek’s R1 — build on earlier large-language models such as OpenAI’s GPT-4.
The reasoning or thinking models are designed to take more time to consider their answers before responding to prompts. They also can incorporate additional data and are trained with so-called reinforcement learning, in which systems are encouraged to provide correct or better answers.
The results have been encouraging to AI enthusiasts. The reasoning models have performed much better on math, reasoning and software-coding tests than their large-language model predecessors.
But the new papers suggest that despite being called thinking or reasoning models, such systems aren’t really doing much real thinking or reasoning at all. Indeed, the Apple researchers highlighted their skepticism of the models’ real capabilities in the beginning of the title of their paper: “The Illusion of Thinking.”
In that pre-print report, published on Apple’s website early last month, the research team documented how it tested the thinking capabilities of some of the latest and most prominent reasoning models available, including those from OpenAI, Google, Anthropic and DeepSeek. The team asked each model to contend with four different puzzle games for which the researchers could easily adjust the complexity. Each puzzle of the same type was governed by the same rules and required the same basic logic to solve.
The researchers found that each reasoning model easily solved simple iterations of the games. In some cases, though, they found that a reasoning model’s corresponding language model did just as well with less computing power. The reasoning models continued to perform well on at least some of the puzzles as they became somewhat more complex, although at the cost of rapidly increasing computing power.
But at a certain level of complexity — which varied from puzzle to puzzle — the models weren’t able to solve the problems at all. And the effort the models put into solving the problems actually declined.
The researchers also found that the performance of DeepSeek and Claude’s reasoning models didn’t improve if they provided those models with the algorithm needed to solve one of the puzzles. Instead, they found that even with that guide in hand, the models’ ability to solve the game collapsed after it reached a certain level of complexity.
Additionally, the researchers found that those two models — which they were able to examine in more detail because the developers make their models’ so-called chains of thought accessible — tended to “overthink” simple iterations of the puzzles. Although the models tended to find a solution early on, they spent additional time and computing power exploring incorrect solutions. By contrast, when given the much more complex versions of the games, the models explored numerous incorrect solutions without arriving at the correct answer.
Taken together, the results indicate that the models aren’t capable of taking what they know about how to solve one problem to solve similar but complex ones, the researchers said in their paper.
“Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds,” the researchers wrote.
“These insights challenge prevailing assumptions about [large-reasoning model] capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning,” they wrote.
The Apple research team did not respond to a request for comment.
The mayor’s proposal to add more than 80 beds to the site had sparked fierce pushback from Supervisor Shamann Walton, who represents the area
The American Industrial Center is filled with an eclectic mix of businesses
Of San Francisco’s handful, two are in Bayview-Hunters Point
Instead of discovering something novel, the Apple paper highlighted known limitations of language and reasoning models, OpenAI spokesman Laurance Fauconnet said in an email. The amount of data such models can process or spit out at any one time is constrained, and the more complex versions of the problems the researchers tested exceeded those limits, he said.
“The fact that performance drops on extremely complex tasks doesn’t mean the models can’t reason — often, they’re just running out of space to show their work,” Fauconnet said. “Like others in the field, we’re actively working to improve how models reason under practical constraints, including context windows, output budgets, and evaluation formats that might not reflect underlying capabilities.”
But other AI experts challenged that assertion. While some of the more complex versions of the puzzles exceeded the models’ data limits, the models began failing before they hit those limits, according to a blog post by Gary Marcus, a professor emeritus at New York University and the author of “Taming Silicon Valley: How We Can Ensure That AI Works for Us.”
Regardless, those limits illustrate the shortcomings of the models. AI systems using an older and different version of the technology can easily solve these kinds of puzzles, he said.
If these kinds of reasoning models “can’t reliably execute something as basic as [one of the puzzles the Apple researchers test], what makes you think it is going to compute military strategy (especially with the fog of war) or molecular biology (with many unknowns) correctly?” Marcus said in the post. “What the Apple team asked for was way easier than what the real world often demands.”
Representatives of Anthropic and DeepSeek did not respond to requests for comment about the Apple paper. Google representatives did not immediately respond to a request for comment.
When the researchers tested DeepSeek-R1, Claude-3.7-Sonnet, OpenAI-o3-mini and OpenAI-o4-mini, they saw the models’ accuracy decline — sometimes sharply, sometimes gradually — as the problems became more complex. The models tended to overthink problems, spending too much time verifying or double-checking answers when they already had the solution and wasting computing power in “loops of errors” when they failed to find answers. And as problems became more complex, the models spent less time and computing power actually trying to calculate them and more time essentially making guesses.
Reinforcement learning has made reasoning models more capable but only to a point, the researchers said.
“These findings underscore a fundamental limitation: while RL can amplify the breadth and depth of problems that LLMs solve, they do not by themselves foster the creative leaps needed for true transformational reasoning,” they said.
Dawn Song, a computer-science professor at UC Berkeley who was a member of the research team, did not respond to an email.
In the other paper, a team of researchers at Arizona State that included Kambhampati examined the ability of two OpenAI reasoning models — o1-preview and o1-mini to handle planning and scheduling tasks.
The Arizona State researchers found that those models performed much better than earlier LLMs at a battery of assessments that tested their planning and scheduling abilities. On a test called Blocksworld, for example, o1-preview got 98% correct and o1-mini 57%. OpenAI’s GPT-4 scored just 34%.
But the models’ scores dropped markedly when the researchers tested them on versions of the same assessment in which the wording of the problems had been changed. On the version of the test most different from the original, o1-preview got 37% correct and o1 mini only 3.5%.
Like the Apple researchers, the Arizona State team found that the models’ ability to solve even regular Blocksworld problems dropped sharply to near zero beyond a certain level of complexity. And on another planning test called Sokoban, the models performed poorly, only able to solve the simplest versions.
By contrast, they noted, an AI technology using a different version of the technology was able to solve all of the tests 100% of the time, generally in a small fraction of the time.
According to Kambhampati, the Apple paper and his own research suggest that the reasoning models aren’t really reasoning — at least not in the way humans do. They do well on problems they recognize, ones that they have been trained on, but not on ones that are new to them, he said.
“They are not learning the algorithm and just executing it the way you would expect,” Kambhampati said.
It’s hard to know exactly what’s going on, because OpenAI and the other major model developers don’t release their training data, he said. But what seems to be happening is that as developers train ever-larger models on ever more data, they seem to be including in that data more complex versions of the same problems the models are being tested on. So, newer models can handle more complicated problems — but only because those problems are in their training data.
“This is an inefficient way to do things, but this is pretty much what happens in LLM reasoning,” Kambhampati said.
The other — and what Kambhampati considers bigger — concern is that even well-trained models don’t guarantee 100% accuracy, he said. Instead of retrieving the actual, accurate data from a database, these systems essentially make stuff up based on probabilities, he said.
That could be a big problem in life-threatening situations, he said.
“If, in fact, it’s a mission-critical plan, if safety matters, then you have to double check the answer,” he said.
Copyright for syndicated content belongs to the linked Source link