Advancing Reasoning in Large Language Models. Promising Methods and Approaches
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
Abstract—Large Language Models (LLMs) have succeeded Reasoning in AI broadly encompasses multiple cognitive
remarkably in various natural language processing (NLP) tasks, processes, including deductive, inductive, abductive, and com-
yet their reasoning capabilities remain a fundamental challenge. monsense reasoning [5]–[9]. Unlike retrieval-based knowl-
While LLMs exhibit impressive fluency and factual recall, their
arXiv:2502.03671v1 [cs.CL] 5 Feb 2025
ability to perform complex reasoning—spanning logical deduc- edge synthesis, reasoning requires multi-step logical transfor-
tion, mathematical problem-solving, commonsense inference, and mations, contextual generalization, and structured problem-
multi-step reasoning—often falls short of human expectations. solving. Classical AI approaches have addressed reasoning
This survey provides a comprehensive review of emerging tech- through rule-based symbolic systems [10] [11], yet integrating
niques enhancing reasoning in LLMs. We categorize existing such structured reasoning with the data-driven paradigm of
methods into key approaches, including prompting strategies
(e.g., Chain-of-Thought reasoning, Self-Consistency, and Tree- LLMs remains an ongoing challenge.
of-Thought reasoning), architectural innovations (e.g., retrieval- Recent research has explored diverse methodologies to en-
augmented models, modular reasoning networks, and neuro- hance the reasoning abilities of LLMs. These approaches can
symbolic integration), and learning paradigms (e.g., fine-tuning
with reasoning-specific datasets, reinforcement learning, and categorized into three domains: (1) Prompting Strategies, such
self-supervised reasoning objectives). Additionally, we explore as Chain-of-Thought (CoT) reasoning [12], Self-Consistency
evaluation frameworks used to assess reasoning in LLMs and [13], and Tree-of-Thought [14] methods, which leverage struc-
highlight open challenges, such as hallucinations, robustness, and tured prompts to guide step-by-step reasoning; (2) Architec-
reasoning generalization across diverse tasks. By synthesizing tural Innovations, including retrieval-augmented models [15],
recent advancements, this survey aims to provide insights into
promising directions for future research and practical applica- neuro-symbolic hybrid frameworks [16], and modular reason-
tions of reasoning-augmented LLMs. ing architectures that integrate structured knowledge and logic
Index Terms—Large Language Models (LLMs), Reasoning, [17]; and (3) Learning Paradigms, involving fine-tuning with
Logical Deduction, Mathematical Problem-Solving, Common- specialized datasets [18], reinforcement learning for reasoning
sense Inference, Multi-Step Reasoning, Prompting Strategies, consistency [1], and self-supervised objectives that encourage
Chain-of-Thought Reasoning, Self-Consistency, Tree-of-Thought
logical generalization [19].
Reasoning, Retrieval-Augmented Models, Modular Reasoning
Networks, Neuro-Symbolic Integration, Reinforcement Learning, Among recent advancements, the newly released LLM
Self-Supervised Learning, Hallucinations, AI Reasoning. DeepSeek-R1 [1] has demonstrated superior reasoning per-
The recently released LLM, DeepSeek-R1 [1], excels in formance, particularly in complex domains such as math-
complex tasks such as mathematics and coding, showcas- ematics and coding. By effectively simulating human-like
ing advanced reasoning capabilities. It effectively simulates analytical thinking, DeepSeek-R1 enhances multi-step rea-
human-like analytical thinking, enhancing multi-step reason- soning in mathematical problem-solving, logical inference,
ing in areas like math, logic, and programming. and programming tasks, showcasing the potential of fine-
tuned architectures and novel training paradigms to improve
structured reasoning in LLMs. This survey systematically
I. I NTRODUCTION
reviews these advancements in LLM reasoning, assessing their
Large Language Models (LLMs) have revolutionized the effectiveness, limitations, and applications. It covers evaluation
field of Natural Language Processing (NLP), enabling break- benchmarks, key challenges like adversarial robustness, cross-
throughs in machine translation, text generation, question- domain generalization, and reasoning biases. By synthesizing
answering, and other complex linguistic tasks. Despite their recent progress, we provide a comprehensive overview of
remarkable fluency and knowledge retention, these models promising techniques and future research directions.
often struggle with systematic reasoning—an essential capa- The paper is structured as follows: Section 2 covers the
bility for tasks requiring logical inference, problem-solving, foundations of reasoning, while Section 3 explores prompt-
and decision-making [2]. While LLMs can generate plausible- based reasoning enhancements. Section 4 discusses architec-
sounding responses, they frequently exhibit reasoning errors, tural innovations, and Section 5 examines learning-based ap-
inconsistencies, and hallucinations, limiting their reliability in proaches. Section 6 focuses on evaluation and benchmarking,
critical domains such as scientific discovery, law, and medicine Section 7 highlights challenges and open research directions,
[3] [4]. and Section 8 concludes the paper.
II. F OUNDATIONS OF R EASONING IN AI AND LLM S their reasoning capabilities differ significantly from traditional
AI approaches [5]–[9]:
A. Definitions and Types of Reasoning
• Statistical Learning vs. Symbolic Logic: Unlike sym-
Reasoning is the cognitive process of deriving conclusions bolic AI, which follows explicit logical rules, LLMs
from premises or evidence. It can classified into the following learn probabilistic patterns in language data, making their
types: reasoning implicit and non-deterministic.
• Deductive Reasoning: Drawing specific conclusions • Emergent Reasoning Abilities: Studies suggest that
from general premises. If the premises are true, the scaling LLMs improves their ability to perform multi-
conclusion must be true. This method is fundamental in step reasoning tasks despite the lack of explicit logical
formal logic and automated theorem proving. constraints.
• Inductive Reasoning: Deriving general principles from • Contextual and Prompt-Driven Reasoning: LLMs rely
specific examples or observations. This approach is com- heavily on context windows and external prompt engi-
mon in machine learning for pattern recognition and neering techniques (e.g., Chain-of-Thought prompting) to
forecasting. generate reasoned responses.
• Abductive Reasoning: Inferring the most likely expla- • Limitations in Logical Deduction: While LLMs excel at
nation for a given set of observations, frequently used in recognizing language patterns, they struggle with formal
diagnostics and hypothesis formation. logic, mathematical proofs, and systematically verifying
• Commonsense Reasoning: Applying general world conclusions.
knowledge to infer reasonable conclusions is crucial for
understanding implicit meanings in human communica- D. Challenges of Reasoning in LLMs
tion. Despite their progress, LLMs face several challenges when
• Probabilistic Reasoning: Handling uncertainty in logical
it comes to robust and reliable reasoning [20]–[22]:
inference using probability theory, often implemented in
• Hallucinations: LLMs sometimes generate plausible but
Bayesian networks and Markov models.
incorrect information, leading to unreliable reasoning.
• Lack of Explicit Memory: Unlike knowledge graphs
B. Classical AI Approaches to Reasoning
or rule-based systems, LLMs lack structured long-term
Traditional AI research has long focused on formal reason- memory, making reasoning consistency difficult.
ing techniques incorporating structured knowledge represen- • Difficulty with Multi-Step Reasoning: Although tech-
tations. Some of the key classical approaches include [10], niques like Chain-of-Thought prompting help, LLMs
[11]: often fail to follow multi-step logical structures correctly.
• Symbolic Logic: Formal rule-based systems that use • Bias and Interpretability Issues: Since LLMs train on
first-order logic (FOL) and propositional logic to derive vast text corpora, they inherit biases from data, which can
conclusions. influence reasoning outputs in unpredictable ways.
• Rule-Based Systems: AI models that apply predefined • Limited Generalization Across Domains: LLMs trained
rules to infer logical conclusions, used in expert systems on diverse datasets still struggle with transferring rea-
and decision trees. soning skills across vastly different domains (e.g., legal
• Knowledge Graphs: Structured representations of enti- reasoning vs. scientific inference).
ties and their relationships, supporting reasoning through
graph traversal and inference mechanisms. E. Bridging the Gap Between AI Reasoning and LLMs
• Automated Theorem Proving (ATP): Algorithms de- To enhance reasoning in LLMs, recent research [1], [15],
signed to prove mathematical theorems using logical de- [16], [23] has explored hybrid models that integrate traditional
duction, such as the resolution principle in propositional reasoning techniques with deep learning. Key directions in-
logic. clude :
• Bayesian Networks: Probabilistic graphical models that
• Fine-Tuning with Structured Reasoning Data: Training
enable reasoning under uncertainty by representing de-
LLMs on specialized datasets that explicitly focus on
pendencies between variables.
logical inference and mathematical problem-solving.
While these classical approaches provide strong logical • Retrieval-Augmented Reasoning: Enhancing LLMs
foundations, they struggle with scalability and adaptability with knowledge retrieval mechanisms, allowing them to
when applied to open-ended, unstructured problems such as ground their responses in external facts.
natural language understanding. • Neuro-Symbolic AI: Combining neural networks with
symbolic reasoning frameworks to leverage the strengths
C. Reasoning in Large Language Models of both approaches.
Large Language Models (LLMs) such as GPT-4, PaLM, and • Self-Supervised and Reinforcement Learning Tech-
LLaMA utilize deep learning architectures, primarily trans- niques: Encouraging models to refine their reasoning
formers, to process and generate human-like text. However, through iterative self-training and reward mechanisms.
These advancements aim to push LLMs toward more reli- branching and evaluation at each step, leading to more robust
able, explainable, and human-like reasoning capabilities. and optimal solutions.
III. P ROMPTING -BASED R EASONING E NHANCEMENT
Large Language Models (LLMs) demonstrate emergent
reasoning through structured prompts, bypassing the need for
fine-tuning [3], [24]. This section examines key prompting
techniques, illustrated in Figure 1 and summarized in Table I.
A. Chain-of-Thought (CoT) Reasoning
Chain-of-Thought (CoT) reasoning is a prompting technique
used in large language models (LLMs) to improve their ability
to solve complex reasoning problems. It involves breaking
down a problem into a series of intermediate steps, allowing
the model to reason more effectively and arrive at accurate
conclusions [12]. This technique has been particularly effective
for complex mathematical problem-solving, logical reasoning, Fig. 1. Approaches to Prompting-Based Reasoning Enhancement.
and commonsense inference.
• Step-by-Step Reasoning: Instead of answering immedi- • Structured Exploration: The model explores different
ately, the model generates a sequence of logical steps to paths in a tree-like structure, selecting the optimal rea-
work through the problem, improving accuracy in multi- soning route.
step problem-solving. • Decision Evaluation & Pruning: ToT reasoning is par-
• Intermediate Reasoning: The approach mimics hu- ticularly effective in combinatorial and planning tasks.
man problem-solving by considering subproblems before • Final Answer Selection: The best reasoning path is
reaching the final answer. selected based on a scoring or majority selection process
• Performance Gains: Studies show that CoT prompting [14].
improves performance on arithmetic and logical tasks
D. Program-aided Language Models (PAL)
compared to standard prompting [12].
• Limitations: While CoT enhances interpretability, its Program-Aided Language Models (PAL) is a technique that
effectiveness depends on prompt design and model size. enhances a language model’s reasoning capabilities by allow-
In some cases, models may still generate incorrect inter- ing it to call external computational tools—such as Python
mediate steps [13]. or symbolic solvers—to perform calculations, execute logic-
based steps, or verify solutions. Instead of relying purely on
B. Self-Consistency Prompting internal token-based reasoning, PAL leverages external code
Self-Consistency prompting is an advanced prompting tech- execution for improved accuracy and reliability [25].
nique that improves reasoning accuracy by generating multiple • Execution-Based Verification: The model generates rea-
diverse reasoning paths and selecting the most consistent soning steps in code format, which is executed to verify
answer [13]. This method is useful in complex reasoning tasks correctness.
where a single Chain-of-Thought (CoT) might be prone to • Higher Accuracy in Mathematical Reasoning: PAL
errors. This technique reduces variability in responses and has demonstrated superior performance in tasks requiring
increases accuracy by aggregating outputs. precise calculations.
• Multiple Reasoning Paths: Instead of generating a sin- • Dependence on External Tools: This approach requires
gle step-by-step solution, the model produces multiple integration with external computing environments, limit-
different reasoning chains. ing its scalability [25].
• Diverse Thought Processes: Each reasoning chain might Empirical studies indicate that CoT and self-consistency
follow a different logical approach, reducing biases in a prompting significantly improve reasoning performance, par-
single trajectory. ticularly in structured domains such as mathematics and logic
• Majority Voting on Final Answer: The final response [12], [13].
is determined based on the most frequently occurring
correct answer across generated samples. IV. A RCHITECTURAL I NNOVATIONS FOR E NHANCED
R EASONING
C. Tree-of-Thought (ToT) Reasoning While prompting-based techniques have improved the rea-
Tree-of-Thought (ToT) reasoning is an advanced problem- soning capabilities of Large Language Models (LLMs), ar-
solving framework that extends CoT reasoning by exploring chitectural innovations play a crucial role in enhancing their
multiple possible reasoning paths in a tree-like structure [14]. ability to perform structured and complex reasoning. This
Instead of following a single linear reasoning path, ToT allows section explores various model architectures and modifications
TABLE I
C OMPARISON OF C HAIN - OF -T HOUGHT (C OT), S ELF -C ONSISTENCY C OT (SC-C OT), T REE - OF -T HOUGHT (T OT), AND P ROGRAM -A IDED L ANGUAGE
M ODELS (PAL)
to improve logical inference, multi-step reasoning, and knowl- consistency over long sequences, lifelong learning, and few-
edge integration. shot learning tasks [21].
• Controller (Neural Network Core): A neural network
A. Retrieval-Augmented Generation (RAG)
(typically an RNN or Transformer) that processes in-
Retrieval-Augmented Generation (RAG) is an AI framework puts and manages interactions with memory, determining
that combines information retrieval with text generation. It when and how to read/write data.
enhances LLM reasoning by incorporating external knowledge • External Memory Storage: A structured memory com-
sources. This approach improves the accuracy, relevance, and ponent (e.g., a differentiable memory matrix or key-value
factual grounding of responses compared to relying solely on store) that holds information over time. Unlike standard
parametric memory [15]. RNNs, which rely only on hidden states, MANNs explic-
• Query Processing: The input query is processed and itly retrieve and update memory.
embedded into a vector space. The model searches for • Memory Access Mechanism: Read/write operations in
relevant documents using a retrieval system (e.g., dense memory-augmented neural networks are typically dif-
passage retrieval, BM25). The retrieved documents are ferentiable, enabling gradient-based learning. Addressing
appended to the input. mechanisms include content-based addressing, which re-
• Knowledge-Enhanced Reasoning: RAG-based models trieves memory by assessing similarity to stored data, and
supplement their reasoning process based on both the location-based addressing, which accesses memory based
query and retrieved information. on positional or sequential order.
• Reduction of Hallucinations: By grounding responses
in external data, RAG helps mitigate hallucinations often D. Graph Neural Networks (GNNs) and Knowledge Graphs
observed in purely generative models [26]. Graph Neural Networks (GNNs) offer a structured frame-
work for reasoning by explicitly representing entities and
B. Neuro-Symbolic Hybrid Models their relationships, enabling logical inference and multi-hop
Neuro-Symbolic Hybrid Models combine neural networks question-answering.
(which excel at pattern recognition and learning from data) • Structured Representation: Graph Neural Networks are
with symbolic AI (which enables reasoning, logic, and explicit neural models designed to operate on graph-structured
knowledge representation). This fusion aims to create more data. Unlike traditional deep learning models (which
explainable, generalizable, and robust AI systems [16]. work on grids like images or sequences like text), GNNs
• Integration of Logic and Learning: These models use can model complex relationships between interconnected
neural networks to process unstructured text while em- entities [27].
ploying symbolic logic for rule-based reasoning. Neural • Reasoning over Knowledge Graphs: Knowledge
models extract features, while symbolic systems provide Graphs represent facts as entities and relationships in a
logical inference. structured format, typically as a triple (subject, predicate,
• Enhanced Interpretability: Symbolic components im- object). When GNNs are applied to Knowledge Graphs,
prove transparency, making reasoning steps more explain- they enable reasoning, inference, and discovery of hidden
able. Rule-based systems, knowledge graphs, and formal relationships. [28].
logic enable structured reasoning. • Improvements in Explainability: Knowledge graph-
based reasoning enhances transparency by making infer-
C. Memory-Augmented Neural Networks
ence paths explicit.
Memory-Augmented Neural Networks (MANNs) are AI
models that integrate external memory with neural networks, E. Tool-Use and API Augmentations
enabling them to store, retrieve, and manipulate information LLMs can be augmented with external tools and APIs to
dynamically. MANNs can read from and write to an external improve reasoning capabilities, leveraging specialized compu-
memory module, making them more adaptable for reasoning tational resources beyond language modeling [29].
• Programmatic Reasoning: Models invoke external cal- B. Reinforcement Learning from Human Feedback
culators, theorem solvers, or search engines to validate
reasoning steps. Methods such as Reinforcement Learning from Human
• Dynamic Data Integration: As illustrated in Table II, Feedback (RLHF) train models to align their reasoning with
APIs enable real-time access to updated knowledge, im- human preferences [36]. A PPO-based RLHF training algo-
proving the factual accuracy of reasoning [30]. rithm is Algorithm 1.
• Limitations: Dependence on external services introduces
latency and requires access control mechanisms. • Reward Models for Logical Consistency: RLHF opti-
mizes model outputs based on human evaluators’ feed-
back, reducing errors in logical reasoning [37].
TABLE II • Reward Model (RM) Training: Human annotators as-
C OMMON API T YPES U SED IN AI S YSTEMS
sess multiple model outputs based on preference. A
dedicated neural network, known as the Reward Model, is
API Type Example Use Cases
trained on these rankings to capture human preferences.
Web Search APIs Bing, Google, Weather API for live information The models generate and assess their reasoning steps,
Computation APIs Wolfram Alpha for advanced mathematical reasoning refining correct solutions through iterative learning [18].
Database APIs SQL, NoSQL for structured queries • Reinforcement Learning via Proximal Policy Opti-
Cloud Services APIs AWS, Google Cloud, OpenAI API for cloud mization (PPO): PPO, a reinforcement learning algo-
Automation APIs Zapier, IFTTT for automating workflows rithm, is used to optimize the model while preventing
Financial APIs Stock market APIs (Alpha Vantage, Yahoo Finance) drastic deviations from its base performance [1].
D. Automated Verifiers and Critic Models natural language inference through adversarially gener-
ated reasoning tasks [44].
To further enhance reasoning accuracy, LLMs can be paired
• HellaSwag – A benchmark designed to test common-
with automated verifiers that critically assess their outputs
sense natural language inference. It requires the model to
[39].
predict the most likely ending of a sentence. [33].
• Secondary Verification Models: A separate model eval-
• Measuring Massive Multitask Language Understand-
uates the reasoning output of an LLM, filtering out ing (MMLU) – Evaluates general knowledge and
incorrect inferences. problem-solving abilities across 57 subjects, including
• Formal Proof Checking: Integration with theorem
elementary mathematics, US history, computer science,
provers allows models to verify logical deductions rig- and law. [45].
orously [40].
• Limitations: Automated verification remains challenging B. Metrics for Measuring Reasoning Performance
due to the difficulty of formalizing natural language
Evaluating reasoning in LLMs involves multiple perfor-
reasoning.
mance metrics tailored to different reasoning tasks.
VI. E VALUATION AND B ENCHMARKING OF R EASONING IN • Accuracy: Measures the correctness of model responses,
LLM S often evaluated using Exact Match (EM) and F1-score,
Assessing the reasoning capabilities of Large Language particularly in mathematical and logical reasoning tasks
Models (LLMs) requires systematic evaluation using standard- [32].
ized benchmarks and performance metrics. This section ex- • Logical Consistency: Assesses whether a model’s rea-
plores various evaluation methodologies, including reasoning soning follows coherent logical steps across multiple
queries. Often evaluated using theorem-proving datasets • Cross-Domain Transfer Learning: Current transfer
such as ProofWriter [39]. learning approaches have limitations in maintaining rea-
• Explainability and Interpretability: Evaluates the trans- soning coherence across diverse contexts [19].
parency of reasoning steps, especially in Chain-of- • Open Research Direction: Investigating meta-learning
Thought (CoT) models, by assessing the faithfulness of and continual learning strategies for cross-domain gener-
intermediate steps to the final answer [12]. alization.
• Self-Consistency: Measures reasoning reliability by gen-
erating multiple independent responses to the same query C. Robustness to Adversarial Attacks
and assessing agreement among outputs [13]. LLMs are vulnerable to adversarial perturbations that ex-
• Multi-Hop Reasoning Score: Used in datasets like Hot- ploit reasoning weaknesses, leading to incorrect or misleading
potQA to assess the model’s ability to integrate multiple outputs [44].
pieces of evidence in complex reasoning tasks [35]. • Sensitivity to Input Variations: Small modifications
• Adversarial Robustness: Tests the model’s ability to in prompts can lead to significantly different reasoning
maintain reasoning accuracy under adversarial perturba- outputs, impacting reliability.
tions, as evaluated in the ANLI dataset [44]. • Adversarial Robustness Testing: Existing benchmarks
• Faithfulness and Verifiability: Measures whether the do not sufficiently evaluate LLMs against adversarial
model-generated reasoning steps can be independently reasoning challenges [27].
verified and logically aligned with the final answer [40]. • Open Research Direction: Developing robust adversarial
• Confidence Calibration: Evaluates whether the model’s training techniques to improve resistance to input manip-
confidence in its predictions correlates with correctness, ulations.
commonly measured using log-likelihood scores and
Brier Score [46]. D. Integrating Symbolic and Neural Reasoning
• Reasoning Generalization: Assesses how well the model LLMs rely on statistical pattern recognition rather than for-
performs on out-of-distribution (OOD) reasoning tasks, mal logical reasoning, leading to errors in complex inferencing
testing adaptability beyond its training data [47]. tasks [16].
• Limitations of Purely Neural Approaches: LLMs strug-
VII. C HALLENGES AND O PEN R ESEARCH D IRECTIONS gle with structured logic, formal proofs, and abstract
Despite significant advancements in enhancing the reason- symbolic reasoning [40].
ing capabilities of Large Language Models (LLMs), several • Neuro-Symbolic AI: Combining neural networks with
challenges persist. These limitations hinder their reliability, symbolic reasoning frameworks enhances logical consis-
robustness, and applicability in high-stakes domains. This tency and interpretability [16].
section discusses key challenges and proposes open research • Open Research Direction: Advancing hybrid neuro-
directions to address them. symbolic architectures for reasoning-augmented AI mod-
els.
A. Hallucinations and Misinformation
VIII. C ONCLUSION
One of the critical challenges in LLM reasoning is the
Advancing reasoning in Large Language Models (LLMs)
generation of hallucinated or factually incorrect information
is a key milestone in AI development. Despite improvements
[20].
in prompting, architecture, and learning-based methods, chal-
• Unverified Reasoning Steps: LLMs sometimes generate
lenges remain in logical consistency, generalization, robust-
plausible but incorrect reasoning chains, leading to logical ness, and interpretability. This survey reviews key approaches
inconsistencies [48]. to enhancing LLM reasoning, categorized into prompting tech-
• Fact-Checking Mechanisms: Existing fact-checking
niques, architectural innovations, and learning-driven strate-
techniques fail to filter misinformation in multi-step rea- gies.
soning tasks [30].
• Open Research Direction: Developing automated veri- A. Summary of Key Findings
fiers and integrating LLMs with structured databases to The key takeaways from this survey can be summarized as
improve factual accuracy. follows:
• Prompting Strategies: Techniques such as Chain-of-
B. Generalization Across Domains
Thought (CoT) prompting, Self-Consistency, and Tree-
LLMs often struggle to generalize reasoning capabilities of-Thought (ToT) reasoning have shown significant im-
across different domains, limiting their adaptability to novel provements in structured problem-solving, logical infer-
scenarios [49]. ence, and multi-step reasoning [12]–[14].
• Domain-Specific Overfitting: Fine-tuning on specific • Architectural Innovations: Enhancements such as
reasoning datasets may improve performance in targeted Retrieval-Augmented Generation (RAG), Neuro-
tasks but hinders adaptability to unseen domains [32]. Symbolic AI, Memory-Augmented Models, and Graph
Neural Networks (GNNs) contribute to better structured [7] C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman,
and explainable reasoning [15], [16]. H. Rashkin, D. Downey, S. W.-t. Yih, and Y. Choi, “Abductive com-
monsense reasoning,” arXiv preprint arXiv:1908.05739, 2019.
• Learning-Based Approaches: Fine-tuning on reasoning- [8] X. Zhou, Y. Zhang, L. Cui, and D. Huang, “Evaluating commonsense in
specific datasets, Reinforcement Learning from Human pre-trained language models,” in Proceedings of the AAAI Conference
Feedback (RLHF), self-supervised learning, and auto- on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 9733–9740.
[9] Q. Liu, H. Jiang, A. Evdokimov, Z.-H. Ling, X. Zhu, S. Wei, and Y. Hu,
mated verifiers improve logical consistency and gener- “Probabilistic reasoning via deep learning: Neural association models,”
alization [18], [32], [37]. arXiv preprint arXiv:1603.07704, 2016.
• Evaluation and Benchmarking: Current benchmarks [10] R. R. Yager, “Approximate reasoning as a basis for rule-based expert
systems,” IEEE Transactions on Systems, Man, and Cybernetics, no. 4,
such as GSM8K, MATH, LogiQA, and ARC provide pp. 636–643, 1984.
valuable insights into LLM reasoning capabilities, but ex- [11] R. Sun, “Robust reasoning: integrating rule-based and similarity-based
isting evaluation methodologies require improvements in reasoning,” Artificial Intelligence, vol. 75, no. 2, pp. 241–295, 1995.
[12] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,
adversarial robustness and dynamic reasoning assessment D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large
[31], [32], [41]. language models,” Advances in neural information processing systems,
• Challenges and Open Research Directions: Key chal- vol. 35, pp. 24 824–24 837, 2022.
[13] X. Wang et al., “Self-consistency improves chain of thought reasoning
lenges include hallucinations, reasoning generalization, in language models,” arXiv preprint arXiv:2203.11171, 2022.
adversarial robustness, computational efficiency, ethical [14] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and
considerations, and the need for explainable reasoning K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large
language models,” Advances in Neural Information Processing Systems,
models [16], [20], [49]. vol. 36, 2024.
[15] P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive
B. Final Thoughts nlp tasks,” Advances in Neural Information Processing Systems, 2020.
[16] A. d. Garcez and L. C. Lamb, “Neurosymbolic ai: The 3 rd wave,”
The future of AI reasoning depends on developing models Artificial Intelligence Review, vol. 56, no. 11, pp. 12 387–12 406, 2023.
that generate fluent text while ensuring robust, verifiable, [17] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu,
and adaptable reasoning across domains. Advancements in P. Battaglia, and T. Lillicrap, “A simple neural network module for
relational reasoning,” in Advances in Neural Information Processing
prompting, architecture, and learning can bring LLMs closer Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
to human-like reasoning. However, addressing challenges re- S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates,
quires collaboration among AI researchers, cognitive scien- Inc., 2017.
[18] E. Zelikman, Y. Wu, J. Mu, and N. Goodman, “Star: Bootstrapping
tists, ethicists, and domain experts. The goal is to create AI reasoning with reasoning,” Advances in Neural Information Processing
systems that reason accurately, ethically, and transparently for Systems, vol. 35, pp. 15 476–15 488, 2022.
safer real-world deployment. [19] A. Talmor, O. Tafjord, P. Clark, Y. Goldberg, and J. Berant, “Leap-
of-thought: Teaching pre-trained models to systematically reason over
implicit knowledge,” Advances in Neural Information Processing Sys-
IX. ACKNOWLEDGMENTS tems, vol. 33, pp. 20 227–20 237, 2020.
[20] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen,
We thank the research community for their contributions W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large
to reasoning in LLMs and developing benchmarking datasets. language models: Principles, taxonomy, challenges, and open questions,”
This survey has been informed by a wide range of studies, ACM Transactions on Information Systems, 2024.
[21] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei,
and we acknowledge the valuable work that has advanced the “Augmenting language models with long-term memory,” Advances in
field. Neural Information Processing Systems, vol. 36, 2024.
[22] Z. C. Lipton, “The mythos of model interpretability: In machine
R EFERENCES learning, the concept of interpretability is both important and slippery.”
Queue, vol. 16, no. 3, pp. 31–57, 2018.
[1] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, L. Li, [23] A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh,
Z. Shao, P. Wang et al., “Deepseek-r1: Incentivizing reasoning capability K. Baumli, S. Iqbal, C. Bishop, R. Roelofs et al., “Training lan-
in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, guage models to self-correct via reinforcement learning,” arXiv preprint
2025. arXiv:2409.12917, 2024.
[2] Z. Wu, L. Qiu, A. Ross, E. Akyürek, B. Chen, B. Wang, N. Kim, J. An- [24] J. Wei et al., “Emergent abilities of large language models,” arXiv
dreas, and Y. Kim, “Reasoning or reciting? exploring the capabilities preprint arXiv:2206.07682, 2022.
and limitations of language models through counterfactual tasks,” in [25] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan,
Proceedings of the 2024 Conference of the North American Chapter and G. Neubig, “Pal: Program-aided language models,” in International
of the Association for Computational Linguistics: Human Language Conference on Machine Learning. PMLR, 2023, pp. 10 764–10 799.
Technologies (Volume 1: Long Papers), 2024, pp. 1819–1862. [26] K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval
[3] T. Brown et al., “Language models are few-shot learners,” Advances in augmentation reduces hallucination in conversation,” in Findings of the
Neural Information Processing Systems, 2020. Association for Computational Linguistics: EMNLP 2021, 2021, pp.
[4] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large lan- 3784–3803.
guage models are zero-shot reasoners,” Advances in neural information [27] S. Ji, S. Pan, E. Cambria, P. Marttinen, and S. Y. Philip, “A survey on
processing systems, vol. 35, pp. 22 199–22 213, 2022. knowledge graphs: Representation, acquisition, and applications,” IEEE
[5] P. Clark, O. Tafjord, and K. Richardson, “Transformers as soft reason- transactions on neural networks and learning systems, vol. 33, no. 2,
ers over language,” in Proceedings of the Twenty-Ninth International pp. 494–514, 2021.
Conference on International Joint Conferences on Artificial Intelligence, [28] W. L. Hamilton et al., “Inductive representation learning on large
2021, pp. 3882–3890. graphs,” Advances in Neural Information Processing Systems, 2017.
[6] Z. Yang, L. Dong, X. Du, H. Cheng, E. Cambria, X. Liu, J. Gao, and [29] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, E. Hambro,
F. Wei, “Language models as inductive reasoners,” in Proceedings of L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language
the 18th Conference of the European Chapter of the Association for models can teach themselves to use tools,” Advances in Neural Infor-
Computational Linguistics (Volume 1: Long Papers), 2024, pp. 209–225. mation Processing Systems, vol. 36, pp. 68 539–68 551, 2023.
[30] G. Mialon, R. Dessi, M. Lomeli, C. Nalmpantis, R. Pasunuru, “On the opportunities and risks of foundation models,” arXiv preprint
R. Raileanu, B. Roziere, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, arXiv:2108.07258, 2021.
E. Grave, Y. LeCun, and T. Scialom, “Augmented language models:
a survey,” Transactions on Machine Learning Research, 2023,
survey Certification. [Online]. Available: https://openreview.net/forum?
id=jh7wH2AzKK
[31] K. Cobbe et al., “Training verifiers to solve math word problems,” arXiv
preprint arXiv:2110.14168, 2021.
[32] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang,
D. Song, and J. Steinhardt, “Measuring mathematical problem solving
with the math dataset,” Sort, vol. 2, no. 4, pp. 0–6, 2021.
[33] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi, “Swag: A large-scale
adversarial dataset for grounded commonsense inference,” arXiv preprint
arXiv:1808.05326, 2018.
[34] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,
and O. Tafjord, “Think you have solved question answering? try arc,
the ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018.
[35] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and
C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-
hop question answering,” in Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, 2018, pp. 2369–
2380.
[36] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4
technical report,” arXiv preprint arXiv:2303.08774, 2023.
[37] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language
models to follow instructions with human feedback,” Advances in neural
information processing systems, vol. 35, pp. 27 730–27 744, 2022.
[38] C. Durkan, I. Murray, and G. Papamakarios, “On contrastive learning
for likelihood-free inference,” in International conference on machine
learning. PMLR, 2020, pp. 2771–2781.
[39] O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter: Generating implications,
proofs, and abductive statements over natural language,” in Findings
of the Association for Computational Linguistics: ACL-IJCNLP 2021,
2021, pp. 3621–3634.
[40] E. First, M. N. Rabe, T. Ringer, and Y. Brun, “Baldur: Whole-proof
generation and repair with large language models,” in Proceedings of
the 31st ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering, 2023, pp.
1229–1241.
[41] J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang, “Logiqa:
a challenge dataset for machine reading comprehension with logical
reasoning,” in Proceedings of the Twenty-Ninth International Conference
on International Joint Conferences on Artificial Intelligence, 2021, pp.
3622–3628.
[42] A. Srivastava et al., “Beyond the imitation game: Quantifying and
extrapolating the capabilities of language models,” arXiv preprint
arXiv:2206.04615, 2022.
[43] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan,
H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
language models trained on code,” arXiv preprint arXiv:2107.03374,
2021.
[44] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela,
“Adversarial nli: A new benchmark for natural language understanding,”
in Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. Association for Computational Linguistics,
2020.
[45] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and
J. Steinhardt, “Measuring massive multitask language understanding,”
arXiv preprint arXiv:2009.03300, 2020.
[46] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration
of modern neural networks,” in International Conference on Machine
Learning (ICML), 2017, pp. 1321–1330.
[47] B. Lake and M. Baroni, “Generalization without systematicity: On the
compositional skills of sequence-to-sequence recurrent networks,” in
International conference on machine learning. PMLR, 2018, pp. 2873–
2882.
[48] M. Mitchell and D. C. Krakauer, “The debate over understanding in
ai’s large language models,” Proceedings of the National Academy of
Sciences, vol. 120, no. 13, p. e2215907120, 2023.
[49] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von
Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al.,