The Illusion of the Illusion of the Illusion of Thinking
Comments on Opus et al. (2025)
G. Pro V. Dantas
June 16, 2025
Abstract
arXiv:submit/6549099 [cs.AI] 17 Jun 2025
A recent paper by Shojaee et al. (2025), The Illusion of Thinking, presented evidence of an
“accuracy collapse” in Large Reasoning Models (LRMs), suggesting fundamental limitations
in their reasoning capabilities when faced with planning puzzles of increasing complexity. A
compelling critique by Opus and Lawsen (2025), The Illusion of the Illusion of Thinking,
argued these findings are not evidence of reasoning failure but rather artifacts of flawed
experimental design, such as token limits and the use of unsolvable problems. This paper
provides a tertiary analysis, arguing that while Opus and Lawsen correctly identify criti-
cal methodological flaws that invalidate the most severe claims of the original paper, their
own counter-evidence and conclusions may oversimplify the nature of model limitations. By
shifting the evaluation from sequential execution to algorithmic generation, their work illu-
minates a different, albeit important, capability. We conclude that the original “collapse”
was indeed an illusion created by experimental constraints, but that Shojaee et al.’s un-
derlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in
sustained, high-fidelity, step-by-step execution. The true illusion is the belief that any sin-
gle evaluation paradigm can definitively distinguish between reasoning, knowledge retrieval,
and pattern execution.
1 Introduction
The debate over the true reasoning capabilities of Large Language and Reasoning Models (LLMs
and LRMs) is central to the field of artificial intelligence. A significant contribution to this
discourse came from Shojaee et al. [1], who used controlled puzzle environments to test state-
of-the-art LRMs. Their primary finding was a dramatic “accuracy collapse” on tasks like the
Tower of Hanoi and River Crossing as complexity scaled, which they interpreted as evidence of
fundamental limitations in generalizable reasoning.
This conclusion was swiftly challenged by Opus and Lawsen [2] in a direct commentary. They
contended that the observed failures were illusory, stemming directly from experimental design
choices. Specifically, they identified three major issues: (1) Tower of Hanoi experiments exceeded
model output token limits, (2) the evaluation framework misclassified output truncation as
reasoning failure, and (3) the River Crossing benchmarks included mathematically impossible
instances for which models were unfairly penalized.
This paper, The Illusion of the Illusion of the Illusion of Thinking, provides a critical analysis
of Opus and Lawsen’s rebuttal. We find their critique of the original paper’s methodology to
be largely sound and vital for the research community. However, we argue that their own
conclusions and the alternative evaluation they propose—while insightful—may inadvertently
shift the goalposts of what is being measured. This analysis aims to synthesize the findings
from both papers, arguing that the truth is more nuanced than either “fundamental failure”
or “simple artifact.” We will show that while the dramatic “collapse” is indeed an illusion, the
original experiments, when viewed through the lens of the critique, still point toward genuine
and important limitations in the execution of complex, sequential tasks.
1
2 Deconstructing the Critique
The commentary by Opus and Lawsen [2] provides an essential course correction for evaluation
methodology in AI. Their findings are clear, verifiable, and address the most striking claims of
the original work.
2.1 Points of Agreement
• Unsolvable Problems: The most definitive flaw identified is the use of unsolvable River
Crossing problems (N ≥ 6 with boat capacity b = 3). Scoring a model as a failure for not
solving an impossible puzzle is a fundamental error in experimental design [5].
• Token Limits: Opus and Lawsen correctly demonstrate that the “accuracy collapse” in
the Tower of Hanoi task correlates with where the token requirement for the full solution
exceeds the models’ output limits. This is a crucial practical constraint that was not
adequately controlled for in the original study.
• Mischaracterization of Complexity: The critique rightly points out that solution
length (Shojaee et al.’s primary metric for complexity) is not equivalent to computational
difficulty. The Tower of Hanoi has an exponential solution length but a trivial O(1)
decision process for each move, whereas River Crossing has a shorter solution but is NP-
hard, requiring complex search and constraint satisfaction.
2.2 Critique of the Rebuttal
Despite the validity of these points, the conclusions drawn by Opus and Lawsen warrant further
scrutiny, particularly regarding their primary piece of counter-evidence.
2.2.1 Changing the Task: Execution vs. Generation
Opus and Lawsen’s key experiment involved prompting models to “Output a Lua function that
prints the solution” for the Tower of Hanoi with 15 disks. They report very high accuracy,
which they present as evidence of “intact reasoning capabilities.”
However, this is a fundamentally different task. The original experiment, for all its flaws, was
designed to test sustained, sequential execution and state tracking. The alternative experiment
tests algorithmic knowledge and code generation. A model can successfully generate a recursive
function if it has seen the Tower of Hanoi algorithm in its training data—which is exceptionally
likely for such a canonical computer science problem. This demonstrates an ability to retrieve
and represent a compressed, abstract solution. It does not, however, prove that the model
could successfully execute or trace the 32,767 moves required for N = 15 without error if given
an unlimited token budget. The original paper’s finding that performance also collapsed even
when the algorithm was explicitly provided in the prompt suggests the bottleneck may indeed
be related to execution fidelity, not just solution discovery.
2.2.2 The “Choice” of a Model and Experimental Rigor
Opus and Lawsen suggest the evaluation should distinguish between “cannot solve” and “choose
not to enumerate exhaustively.” This reframing implies a level of intentionality that may be
misleading. Model outputs like “to avoid making this too long, I’ll stop here” are more likely
learned responses from RLHF to gracefully handle API limits, rather than a conscious decision
to truncate. The model still failed the task as specified : to output the full sequence of moves.
Furthermore, the authors of the critique acknowledge their own limitations, stating, “Due
to budget constraints, we were unable to conduct enough trials for a highly powered statistical
2
sample.” This weakens their otherwise strong position. They critique a detailed (though flawed)
experimental paper with a preliminary test of their own.
3 Synthesis: A Nuanced View of Model Limitations
By integrating the findings of both papers, a more nuanced picture emerges. The “accuracy
collapse” described by Shojaee et al. [1] is not a sudden cliff in a model’s core reasoning faculty.
It is, as Opus and Lawsen [2] argue, a failure induced by hitting practical API and token limits.
However, these limits are not entirely divorced from a model’s capabilities. The phenomena
observed by Shojaee et al. are still informative:
1. The Fragility of Sustained Execution: The fact that models fail at high-iteration
sequential tasks, even when the underlying logic is simple (like Tower of Hanoi), points
to a weakness in sustained, step-by-step processing. While the hard token limit is the
ultimate cause of failure in the experiment, the enormous token cost itself is a symptom
of how LLMs represent and execute such problems. A system with more robust internal
state-tracking might execute the steps more efficiently.
2. Unexplained Behavioral Ticks: Opus and Lawsen’s critique does not fully account
for one of Shojaee et al.’s most intriguing findings: that near the collapse point, LRMs
“begin reducing their reasoning effort (measured by inference-time tokens)”. If the issue
were merely hitting a hard output limit, one might expect models to consistently reason
until that limit is reached. This counter-intuitive decline in effort on harder problems,
also noted in other contexts [3], suggests a more complex behavioral scaling property that
warrants further investigation.
3. Data Contamination and Generalization: Shojaee et al. observed that models could
handle a 100+ move Tower of Hanoi problem but failed a far shorter River Crossing
problem. They speculate this is due to the prevalence of the former in training data. This
highlights a key challenge in evaluation: distinguishing true, generalizable reasoning from
sophisticated pattern matching of familiar problems, a core issue in compositionality [4].
Opus and Lawsen’s “generate a function” test, when applied to a very common problem,
falls into this same ambiguity.
4 Conclusion
The back-and-forth between The Illusion of Thinking and its critique serves as a powerful case
study in the challenges of evaluating AI reasoning.
Opus and Lawsen [2] convincingly demonstrate that the dramatic “accuracy collapse” re-
ported by Shojaee et al. [1] is an illusion, primarily created by ignoring token limits and testing
on unsolvable puzzles. Their work provides invaluable lessons on the importance of meticulous
and fair experimental design.
However, in deconstructing this illusion, they may have created another: that the identified
flaws entirely dismiss the notion of limitations. By shifting the task from methodical execution to
abstract code generation, their counter-argument measures a different, and likely more familiar,
skill for the models.
The most accurate conclusion lies in the synthesis of both views. LRMs do not suffer from a
hard “collapse” of a general reasoning faculty. They are, however, brittle when it comes to sus-
tained, high-fidelity logical execution over many steps. Their performance is heavily influenced
by practical constraints (like token limits) and the familiarity of the problem domain (as seen in
3
the Tower of Hanoi vs. River Crossing disparity). The original paper, despite its methodologi-
cal errors, successfully highlighted the symptoms of these limitations. The subsequent critique
correctly diagnosed the immediate cause but may have understated the underlying condition.
Future work must move beyond this debate by designing evaluations that can disentangle al-
gorithmic knowledge, long-chain execution fidelity, and true out-of-distribution problem-solving,
all while rigorously controlling for the practical realities of the platforms being tested.
References
[1] Shojaee, P., Mirzadeh, I., Alizadeh, K., et al. (2025). The Illusion of Thinking: Understand-
ing the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.
arXiv:2501.12948.
[2] Opus, C., & Lawsen, A. (2025). The Illusion of the Illusion of Thinking: A Comment on
Shojaee et al. (2025). arXiv:2506.09250.
[3] Ballon, M., Algaba, A., & Ginis, V. (2025). The relationship between reasoning and per-
formance in large language models-o3 (mini) thinks harder, not longer. arXiv preprint
arXiv:2502.15631.
[4] Dziri, N., Lu, X., Sclar, M., et al. (2023). Faith and fate: Limits of transformers on compo-
sitionality. Advances in Neural Information Processing Systems, 36.
[5] Efimova, E. A. (2018). River Crossing Problems: Algebraic Approach. arXiv:1802.09369.