0% found this document useful (0 votes)
71 views14 pages

7752 Training Large LLM

This paper introduces COCONUT (Chain of Continuous Thought), a new paradigm for training large language models (LLMs) to reason in a continuous latent space rather than the traditional language space. By using the last hidden state of the LLM as a representation of reasoning, COCONUT allows for more effective reasoning processes, outperforming chain-of-thought (CoT) methods in certain logical reasoning tasks while generating fewer tokens. The findings suggest that this approach could lead to advanced reasoning patterns and insights for future research in latent reasoning methods.

Uploaded by

陈泽铭
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views14 pages

7752 Training Large LLM

This paper introduces COCONUT (Chain of Continuous Thought), a new paradigm for training large language models (LLMs) to reason in a continuous latent space rather than the traditional language space. By using the last hidden state of the LLM as a representation of reasoning, COCONUT allows for more effective reasoning processes, outperforming chain-of-thought (CoT) methods in certain logical reasoning tasks while generating fewer tokens. The findings suggest that this approach could lead to advanced reasoning patterns and insights for future research in latent reasoning methods.

Uploaded by

陈泽铭
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Under review as a conference paper at ICLR 2025

000
001
T RAINING L ARGE L ANGUAGE M ODELS TO R EASON IN
002 A C ONTINUOUS L ATENT S PACE
003
004
005 Anonymous authors
006 Paper under double-blind review
007
008
009 A BSTRACT
010
011 Large language models are restricted to reason in the “language space”, where
012 they typically express the reasoning process with a chain-of-thoughts (CoT) to
013 solve a complex reasoning problem. However, we argue that language space may
014 not be the optimal reasoning space. For example, most word tokens are primarily
015 for textual coherence and not essential for reasoning, while some critical tokens
016 require complex planning and pose huge challenges to LLMs. To explore the
017 potential of LLM reasoning in an unrestricted latent space instead of using hu-
018 man language, we introduce a new paradigm C OCONUT (Chain of Continuous
019
Thought). We utilize the last hidden state of the LLM as a representation of the
reasoning state (termed “continuous thought”). Rather than decoding this into a
020
word token, we feed it back to the LLM as the subsequent input embedding di-
021
rectly in the continuous space. Experiments show that C OCONUT can effectively
022 augment the LLM on several reasoning tasks. It even outperforms CoT in certain
023 logical reasoning tasks that require substantial planning, despite generating fewer
024 tokens during inference. More interestingly, we observe an advanced reasoning
025 patterns emerging from latent reasoning: the continuous thought can encode mul-
026 tiple potential next reasoning steps, allowing the model to perform a breadth-first
027 search (BFS) to solve the problem, rather than prematurely committing to a single
028 deterministic path like CoT. These findings demonstrate the promise of latent rea-
029 soning and offer valuable insights for future research on latent reasoning methods.
030
031
032 1 I NTRODUCTION
033
034 Large language models (LLMs) have demonstrated remarkable reasoning abilities, emerging from
035 extensive pretraining on human language (Dubey et al., 2024; Achiam et al., 2023). While the
036 next token prediction is an effective training objective, it imposes a fundamental constraint pn the
037 LLM as a reasoning machine: the reasoning process of LLMs must be generated in word tokens.
038 For example, a prevalent approach, known as chain-of-thought (CoT) reasoning (Wei et al., 2022),
039
involves prompting or training LLMs to generate solutions step-by-step using natural language.
However, this stands in stark contrast to human cognition. Neuroimaging studies have consistently
040
shown that the language network – a set of brain regions responsible for language comprehension
041
and production – remains largely inactive during various reasoning tasks (Amalric & Dehaene,
042 2019; Monti et al., 2012; 2007; 2009; Fedorenko et al., 2011). More evidence has indicated that
043 human language is optimized for communication rather than reasoning (Fedorenko et al., 2024).
044
045
A significant problem arises when LLMs are required to output language during reasoning: the “rea-
soning amount” behind each token varies greatly, yet current LLM architectures allocate nearly the
046
same computing budget for predicting every token. Most tokens in a reasoning chain are generated
047
solely for fluency, contributing little to the actual reasoning process. On the contrary, some critical
048 tokens require complex planning and pose huge challenges to LLMs. While previous work has at-
049 tempted to fix these problems by prompting LLMs to generate succinct reasoning chains (Madaan
050 & Yazdanbakhsh, 2022), or performing additional reasoning before generating some critical to-
051 kens (Zelikman et al., 2024), these solutions remain constrained within the language space and do
052 not solve the problems fundamentally. Ideally, LLMs should be allowed to reason freely in an un-
053 constrained latent space and only translate the outcomes into language once the reasoning process
is complete.

1
Under review as a conference paper at ICLR 2025

054
055
056
057
058
059
060
061
062
063
064
Figure 1: A comparison of CoT and C OCONUT. In CoT, the model generates the reasoning process
065 as a word token sequence (e.g., [xi , xi+1 , ..., xi+j ] in the figure). C OCONUT (Chain of Continuous
066 Thoughts) regards the last hidden state as a representation of reasoning state (termed “continuous
067 thought”), and directly uses it as the next input embedding. This allows the LLM to reason in an
068 unrestricted latent space instead of language space.
069
070
071 We aim to explore LLM reasoning in the latent space by introducing a novel paradigm, C OCONUT
072 (Chain of Continuous Thought). It involves a simple modification to the traditional CoT process.
073 Instead of mapping between hidden states and language tokens using the language model head and
074
embedding layer, C OCONUT directly feeds the last hidden state (a continuous thought) as the input
embedding for the next token (Figure 1). This modification frees the reasoning from language space,
075
and the architecture can be optimized end-to-end by gradient descent, as continuous thoughts are
076
fully differentiable. To enhance the training of these continuous thoughts, we employ a multi-stage
077 training strategy inspired by Deng et al. (2024), which effectively utilizes language reasoning chains
078 to guide the training process.
079
080
The experiments demonstrate that C OCONUT successfully enhances the reasoning capabilities of
LLMs. Specifically, on math reasoning problems (GSM8k, Cobbe et al., 2021), using more contin-
081
uous thoughts is shown to be beneficial to reasoning accuracy, mirroring the effects of language rea-
082
soning chains. This indicates the potential to scale and solve increasingly challenging problems by
083 chaining more continuous thoughts. On logical reasoning problems including ProntoQA (Saparov &
084 He, 2022), and our newly proposed ProsQA (Section 4.1) which requires stronger planning ability,
085 C OCONUT and some of its variants even surpasses language-based CoT methods, while generating
086 significantly fewer tokens during inference.
087
Interestingly, the removal of language space constraints has led to a novel reasoning pattern. By ma-
088
nipulating the C OCONUT model to switch between latent reasoning and language reasoning, we are
089 able to unveil the latent reasoning process. Unlike language-based reasoning, continuous thoughts
090 in C OCONUT can encode multiple potential next steps simultaneously, allowing for a reasoning pro-
091 cess akin to breadth-first search (BFS). While the model may not initially make the correct decision,
092 it can maintain all possible options within the continuous thoughts and progressively eliminate in-
093 correct paths through reasoning, guided by some implicit value functions. This advanced reasoning
094 mechanism surpasses traditional CoT approaches, even though the model is not explicitly trained
095 or instructed to operate in this manner, as seen in previous works (Yao et al., 2023; Hao et al.,
096 2023). We believe that these findings underscore the potential of latent reasoning and could provide
097
valuable insights for future research.
098
099 2 R ELATED W ORK
100
101
Chain-of-thought (CoT) reasoning. We use the term chain-of-thought broadly to refer to meth-
102 ods that generate an intermediate reasoning process in language before outputting the final answer.
103 This includes prompting LLMs (Wei et al., 2022; Khot et al., 2022; Zhou et al., 2022), or training
104 LLMs to generate reasoning chains, either with supervised fine-tuning (Yue et al., 2023; Yu et al.,
105 2023) or reinforcement learning (Wang et al., 2024; Havrilla et al., 2024; Shao et al., 2024; Yu et al.,
106 2024a). Madaan & Yazdanbakhsh (2022) classified the tokens in CoT into symbols, patterns, and
107 text, and proposed to guide the LLM to generate concise CoT based on analysis of their roles. Re-
cent theoretical analyses have demonstrated the usefulness of CoT from the perspective of model

2
Under review as a conference paper at ICLR 2025

108
expressivity (Feng et al., 2023; Merrill & Sabharwal, 2023; Li et al., 2024). By employing CoT, the
109 effective depth of the transformer increases because the generated outputs are looped back to the
110 input (Feng et al., 2023). These analyses, combined with the established effectiveness of CoT, mo-
111 tivated our exploration of continuous thoughts, in contrast to other latent reasoning methods. While
112 CoT has proven effective for certain tasks, its autoregressive generation nature makes it challeng-
113 ing to mimic human reasoning on more complex problems (LeCun, 2022; Hao et al., 2023), which
114 typically require planning and search. There are works that equip LLMs with explicit tree search
115 algorithms (Xie et al., 2023; Yao et al., 2023; Hao et al., 2023), or train the LLM on search dynamics
116 and trajectories (Lehnert et al., 2024; Gandhi et al., 2024). In our analysis, we find that after remov-
117
ing the constraint of language space, a new reasoning pattern similar to BFS emerges, even though
the model is not explicitly trained in this way.
118
119 Latent reasoning of LLM. Previous works mostly define latent reasoning of LLM as the hidden
120 computing in transformers (Yang et al., 2024; Biran et al., 2024). Yang et al. (2024) constructed
121 a dataset of two-hop reasoning problems and discovered that it is possible to recover the interme-
122 diate variable from the hidden representation of LLMs. Biran et al. (2024) further proposed to
123
intervene the latent reasoning by “back-patching” the hidden representation. Another line of work
has discovered that, even if the model generates a CoT to reason, the model may actually utilize
124
a different latent reasoning process. This phenomenon is known as the unfaithfulness of CoT rea-
125
soning (Wang et al., 2022; Turpin et al., 2024). To enhance the latent reasoning of LLM, previous
126 research proposed to augment it with additional tokens. Goyal et al. (2023) pretrained model by
127 randomly inserting a learnable <pause> tokens to the corpus. This improves LLM’s performance
128 on a variety of tasks, especially when followed by supervised finetuning with <pause> tokens.
129 On the other hand, Pfau et al. (2024) further explored the usage of filler tokens, e.g., “...”, and
130 concluded that they work well for highly parallelizable problems. However, these methods do not
131 extend the expressivity of the LLM like CoT (Pfau et al., 2024); hence, they may not scale to more
132 general and complex reasoning problems. Recently, it has also been found that one can “internalize”
133 the chain of thought reasoning into latent reasoning with knowledge distillation (Deng et al., 2023)
134
or a special training curriculum which gradually shortens CoT (Deng et al., 2024). Yu et al. (2024b)
also proposed to distill a model that can reason latently from data generated with complex reasoning
135
algorithms. These training methods can be combined to our framework, and specifically, we find
136
that breaking down the learning of continuous thoughts into multiple stages, inspired by iCoT (Deng
137 et al., 2024), is very beneficial for the training.
138
139
140 3 C OCONUT: C HAIN OF C ONTINUOUS T HOUGHTS
141
142 In this section, we introduce our new paradigm C OCONUT (Chain of Continuous Thoughts) for
143
reasoning outside the language space. We begin by introducing the background and notations of
language models. For an input sequence x = (x1 , ..., xT ), the standard large language model M
144
can be described as:
145
146
147 Ht = Transformer(Et + Pt )
148 M(xt+1 | x≤t ) = softmax(W ht )
149
150 where Et = [e(x1 ), e(x2 ), ..., e(xt )] is the sequence of token embeddings up to position t; Pt =
151 [p(1), p(2), ..., p(t)] is the sequence of positional embeddings up to position t; Ht ∈ Rt×d is the
152
matrix of the last hidden states for all tokens up to position t; ht is the last hidden state of position t,
i.e., ht = Ht [t, :]; e(·) is the token embedding function; p(·) is the positional embedding function;
153
W is the parameter of the language model head.
154
155 Method Overview. In the proposed C OCONUT method, the LLM switches between the “language
156 mode” and “latent mode” (Figure 1). In language mode, the model operates as a standard language
157 model, autoregressively generating the next token. In latent mode, it directly utilizes the last hidden
158
state as the next input embedding. This last hidden state represents the current reasoning state,
termed as a “continuous thought”.
159
160 Special tokens <bot> and <eot> are employed to mark the beginning and end of the la-
161 tent mode, respectively. As an example, we assume latent reasoning occurs between posi-
tions i and j, i.e., xi = <bot> and xj = <eot>. When the model is in the latent mode

3
Under review as a conference paper at ICLR 2025

162
163
164
165
166
167
168
169
170
171
172
173 Figure 2: The training procedure of C OCONUT. At each stage, we integrate c additional continuous
174 thought (c = 1 in this example), and remove one reasoning step in the training data. The cross-
175 entropy loss is then calculated on the remaining tokens after continuous thoughts.
176
177
178
179 (i < t < j), we use the last hidden state from the previous token to replace the input
180
embedding, i.e., Et = [e(x1 ), e(x2 ), ..., e(xi ), hi , hi+1 , ..., ht−1 ]. After the latent mode fin-
ishes, (t ≥ j), the input after position reverts to using the token embedding, i.e., Et =
181
[e(x1 ), e(x2 ), ..., e(xi ), hi , hi+1 , ..., hj−1 , e(xj ), ..., e(xt )]. It is noteworthy that M(xt+1 | x≤t )
182
is not defined when i < t < j, since the latent thought is not intended to be mapped back to lan-
183 guage space. However, softmax(W ht ) can still be calculated for probing purposes (see Section 4).
184
185 Training Procedure. In this work, we focus on a problem-solving setting where the model receives
186
a question as input and is expected to generate an answer through a reasoning process. We leverage
language CoT data to supervise continuous thought by implementing a multi-stage training curricu-
187
lum inspired by Deng et al. (2024). As shown in Figure 2, in the initial stage, the model is trained
188
on regular CoT instances. In the subsequent stages, at the k-th stage, the first k reasoning steps in
189 the CoT are replaced with k × c continuous thoughts1 , where c is a hyperparameter controlling the
190 number of latent thoughts replacing a single language reasoning step. Following Deng et al. (2024),
191 we also reset the optimizer state when training stages switch. We insert <bot> and <eot> tokens
192 to encapsulate the continuous thoughts.
193
During the training process, we mask the loss on questions and latent thoughts. It is important to
194
note that the objective does not encourage the continuous thought to compress the removed language
195 thought, but rather to facilitate the prediction of future reasoning. Therefore, it’s possible for the
196 LLM to learn a more effective representation compared to language reasoning steps.
197
198
Training Details. Our proposed continuous thoughts are fully differentiable, allowing backpropaga-
tion. We perform n + 1 forward passes when n latent thoughts are scheduled in the current training
199
stage, computing a new latent thought with each pass and then conducting an additional forward
200
pass to obtain a loss on the remaining text sequence. While we can save any repetitive computing
201 by using KV cache, the sequential nature of the multiple forward passes poses challenges for paral-
202 lelism. Further optimizing the training efficiency of C OCONUT remains an important direction for
203 future research.
204
Inference Process. The inference process for C OCONUT is analogous to standard language model
205
decoding, except that in latent mode, we directly feed the last hidden state as the next input em-
206
bedding. A challenge lies in determining when to switch between latent and language modes. As
207 we focus on the problem-solving setting, we insert a <bot> token immediately following the ques-
208 tion tokens. For <eot>, we consider two potential strategies: a) train a binary classifier on latent
209 thoughts to enable the model to autonomously decide when to terminate the latent thoughts, or b)
210 always pad the latent thoughts to a constant length. We found that both approaches work compa-
211 rably well. Therefore, we use the second option in our experiment for simplicity, unless specified
212 otherwise.
213
214
215
1
If a reasoning chain is shorter than k steps, then all the language thoughts will be removed.

4
Under review as a conference paper at ICLR 2025

216
4 E XPERIMENTS
217
218
In this section, we validate the feasibility of LLM reasoning in latent space through experiments on
219 three datasets. We mainly evaluate the accuracy by comparing the model-generated answers with
220 the ground truth. The number of newly generated tokens per question is also listed, as a measure of
221 reasoning efficiency.2
222
223 4.1 DATASETS
224
225 Math Reasoning. We use GSM8k (Cobbe et al., 2021) as the dataset for math reasoning. It consists
226 of grade school-level math problems. Compared to other datasets of our experiments, the problems
227 are more diverse and open-domain, closely resembling the real-world use cases. Through this task,
228 we explore the potential of latent reasoning in practical applications. To train the model, we use a
229
synthetic dataset generated by Deng et al. (2023).
230 Logical Reasoning. Logical reasoning involves the proper application of known conditions to prove
231 or disprove a conclusion using logical rules. This requires the model to choose from multiple pos-
232 sible reasoning paths, where the correct decision often relies on exploration and planning ahead.
233 This serves as a simplified simulation of more advanced reasoning tasks, such as automated theo-
234 rem proving (Chen et al., 2023; DeepMind, 2024). We use 5-hop ProntoQA (Saparov & He, 2022)
235
questions, with fictional concept names. For each problem, an tree-structured ontology is randomly
generated and described in natural language as a set of known conditions. The model is asked to
236
judge whether a given statement is correct based on these conditions.
237
238 We found that the generation process of ProntoQA was overly simplified, especially since the size
239 of distracting branches in the ontology is always small, reducing the need for complex planning. To
240 fix that, we apply a new dataset construction pipeline using randomly generated DAGs to structure
241
the known conditions. The resulting dataset requires the model to perform substantial planning and
searching over the graph to find the correct reasoning chain. We refer to this new dataset as the
242
ProsQA (Proof with Search Question-Answering). A visualized example is shown in Figure 6.
243
More details of datasets can be found in Appendix A.
244
245
4.2 E XPERIMENTAL S ETUP
246
247 We pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. The learning
248 rate is set to 1 × 10−4 while the effective batch size is 128. Following Deng et al. (2024), we also
249 reset the optimizer when training stages switch.
250
Math Reasoning. By default, we use 2 latent thoughts (i.e., c = 2) for each reasoning step. we
251
analyze the correlation between performance and c in Section 4.4. The model goes through 3 stages
252 besides the initial stage. Then, we will have an additional stage, where we still use 3 × c continuous
253 thoughts as in the last stage, but remove all the remaining language reasoning chain. This handles
254 the long-tail distribution of reasoning chains longer than 3 steps. We train the model for 6 epochs in
255 the initial stage, and 3 epochs in each remaining stage.
256
Logical Reasoning. We use one continuous thought for every reasoning step (i.e., c = 1). The
257
model goes through 6 training stages in addition to the initial stage, because the maximum number
258 of reasoning steps is 6 in these two datasets, and the model fully reasons with continuous thoughts
259 to solve the problems in the last stage. We train the model for 5 epochs per stage.
260
261
For all datasets, after the standard schedule, the model stays in the final training stage, until the 50th
epoch. We select the checkpoint based on the accuracy on the validation set. For inference, we
262
manually set the number of continuous thoughts to be consistent with their final training stage. We
263
use greedy decoding for all experiments.
264
265
4.3 BASELINES AND A BLATIONS
266
267 We consider the following baselines: (1) CoT: We use the complete reasoning chains to train the
268 language model with supervised finetuning, and during inference, the model generates a reasoning
269
2
One continuous thought is counted as one token since the computational cost is essentially the same.

5
Under review as a conference paper at ICLR 2025

GSM8k ProntoQA ProsQA


Method
270 Acc. (%) # Tokens Acc. (%) # Tokens Acc. (%) # Tokens
271 CoT 42.9 ±0.2 25.0 98.8 ±0.8 92.5 77.5 ±1.9 49.4
272
No-CoT 16.5 ±0.5 2.2 93.8 ±0.7 3.0 76.7 ±1.0 8.2
273
iCoT 30.0∗ 2.2 99.8 ±0.3 3.0 98.2 ±0.3 8.2
274 Pause Token 16.4 ±1.8 2.2 77.7 ±21.0 3.0 75.9 ±0.7 8.2
275
C OCONUT (Ours) 34.1 ±1.5 8.2 99.8 ±0.2 9.0 97.0 ±0.3 14.2
276
- w/o curriculum 14.4 ±0.8 8.2 52.4 ±0.4 9.0 76.1 ±0.2 14.2
277 - w/o thought 21.6 ±0.5 2.3 99.9 ±0.1 3.0 95.5 ±1.1 8.2
278 - pause as thought 24.1 ±0.7 2.2 100.0 ±0.1 3.0 96.6 ±0.8 8.2
279
280 Table 1: Results on three datasets. Higher accuracy indicates stronger reasoning ability, while gen-
281 erating fewer tokens indicates better efficiency. ∗ The result of iCoT is from Deng et al. (2024).
282
283 chain before outputting an answer. (2) No-CoT: The LLM is trained to directly generate the answer
284
without using a reasoning chain. (3) iCoT (Deng et al., 2024): The model is trained with language
reasoning chains and follows a carefully designed schedule that “internalizes” CoT. As the train-
285
ing goes on, tokens at the beginning of the reasoning chain are gradually removed until only the
286
answer remains. During inference, the model directly predicts the answer. (4) Pause token (Goyal
287 et al., 2023): The model is trained using only the question and answer, without a reasoning chain.
288 However, different from No-CoT, special <pause> tokens are inserted between the question and
289 answer, which are believed to provide the model with additional computational capacity to derive
290 the answer. For a fair comparison, the number of <pause> tokens is set the same as continuous
291 thoughts in C OCONUT.
292
We also evaluate some variants of our method: (1) w/o curriculum: Instead of the multi-stage train-
293 ing, we directly use the data from the last stage which only includes questions and answers to train
294 C OCONUT. The model uses continuous thoughts to solve the whole problem. (2) w/o thought:
295 We keep the multi-stage training which removes initial reasoning steps gradually, but don’t use any
296 continuous latent thought. While this is similar to iCoT in the high-level idea, the exact training
297 schedule is set to be consistent with C OCONUT, instead of iCoT. This ensures a more strict compar-
298 ison. (3) Pause as thought: We use special <pause> tokens to replace the continuous thought, and
299 apply the same multi-stage training scheme as C OCONUT.
300
301 4.4 R ESULTS AND D ISCUSSION
302
303
We show the overall results on all datasets in Table 1. Continuous thoughts effectively enhance LLM
reasoning, as shown by the consistent improvement over no-CoT. It even shows better performance
304
than CoT on ProsQA. We describe several key conclusions from the experiments as follows.
305
“Chaining” continuous thoughts enhances rea-
306
soning. In conventional CoT, the output token serves
36
307
308 as the next input, which is believed to increase the 34
effective depth of LLMs and enhance their expres-
Accuracy (%)

309
siveness (Feng et al., 2023). We explore whether la- 32
310
tent space reasoning retains this property, as it would
311
suggest that this method could scale to solve increas- 30
312 ingly complex problems by chaining multiple latent
313 28
thoughts.
314
In our experiments with GSM8k, we found that 26
315
C OCONUT outperformed other architectures trained
316 0 1 2
317
with similar strategies, particularly surpassing the # Thoughts per step
latest baseline, iCoT (Deng et al., 2024). The perfor-
318 mance is significantly better than C OCONUT (pause
319 as thought) which also enables more computation Figure 3: Accuracy on GSM8k with different
320 in the LLMs. While Pfau et al. (2024) empiri- number of continuous thoughts.
321 cally shows that filler tokens, such as the special
322 <pause> tokens, can benefit highly parallelizable
323 problems, our results show that C OCONUT architecture is more effective for general problems, e.g.,
math word problems, where a reasoning step often heavily depends on previous steps. Additionally,

6
Under review as a conference paper at ICLR 2025

324
we experimented with adjusting the hyperparameter c, which controls the number of latent thoughts
325 corresponding to one language reasoning step. As we increased c from 0 to 1 to 2, the model’s per-
326 formance steadily improved (Figure 3). These results strongly suggest that a chaining effect similar
327 to CoT can be observed in the latent space.
328
In two other synthetic tasks, we found that the varients of C OCONUT (w/o thoughts or pause
329
as thought), and iCoT also achieve impressive accuracy. This indicates that in these tasks, the
330
model’s computational capacity may not the bottleneck. In contrast, GSM8k, being an open-domain
331 question-answering task, likely involves more complex contextual understanding and modeling,
332 placing higher demands on computational capability.
333
334
Latent Reasoning Excels Language Reasoning in Planning. Some complex reasoning tasks re-
quire the model to “look ahead” to assess whether a particular step is the right choice. Among the
335
datasets used in our experiments, GSM8k consists of grade-school-level math word problems, al-
336
lowing for intuitive judgment of the next reasoning step; ProntoQA has distracting branches of small
337 sizes, which makes it relatively easy to determine the next step too. In contrast, ProsQA is based on
338 a randomly generated DAG structure, posing a significant challenge to the model’s planning abil-
339 ities. Reasoning in language space cannot effectively solve the problem. As shown in the table,
340 CoT doesn’t show significant improvement over No-CoT. On the contrary, C OCONUT, some of its
341 variants and iCoT significantly improve the reasoning on ProsQA. This suggests an advantage in
342 using latent space over language tokens for tasks requiring extensive planning. We conduct in-depth
343 analysis of the latent reasoning process in Section 5.
344 The LLM still needs guidance to learn continuous thoughts. In the ideal case, the model should
345 learn the most effective continuous thoughts automatically through gradient descent on questions
346 and answers (i.e., C OCONUT w/o curriculum). However, from the experimental results, we found
347 the models trained this way do not perform any better than no-CoT.
348
With the multi-stage curriculum which decom-
349 poses the training into easier objectives, C O -
350 CONUT is able to achieve top performance
351 across various tasks. The multi-stage train-
352 ing also integrates well with pause tokens
353 (C OCONUT- pause as thought). Despite us-
354 ing the same architecture and similar multi-
355 stage training objectives, we observed a small
356 gap between the performance of iCoT and C O -
CONUT (w/o thoughts). The finer-grained re-
357
moval schedule (token by token) and a few
358
other tricks in iCoT may ease the training pro-
359
cess. We leave combining iCoT and C OCONUT
360 as a future work. While the multi-stage train-
361 ing used for C OCONUT has proven effective,
362 further research is definitely needed to develop
363 better and more general strategies for learning
364 reasoning in latent space, especially without the
365 supervision from language reasoning chains. Figure 4: A case study where we decode the con-
366 tinuous thought into language tokens
Continuous thoughts are efficient represen-
367 tations of reasoning. Though the continuous
368 thoughts are not intended to be decoded to language tokens, we can still use it as an intuitive inter-
369 pretation of the latent reasoning. We show a case study in Figure 4 of a math word problem solved by
370 C OCONUT (c = 1). The first continuous thought can be decoded into tokens like “180”, “ 180” (with
371 a space), and “9”. Note that, the reasoning trace for this problem should be 3×3×60 = 9×60 = 540,
372 or 3 × 3 × 60 = 3 × 180 = 540. The interpretations of the first thought happen to be the first in-
373 termediate variables in the calculation. Moreover, it encodes a distribution of different traces into
374
the continuous thoughts. As shown in Section 5.3, this feature enables a more advanced reasoning
pattern for planning-intense reasoning tasks.
375
376
377

7
Under review as a conference paper at ICLR 2025

378
379
380
381
382
383
384
385
386
387
388
389
390
391
392 Figure 5: The answer accuracy and graph metrics (see Section 5.1) of multiple varients of C OCONUT
393 and baselines on ProsQA.
394
395
396
397 5 U NDERSTANDING THE L ATENT R EASONING IN C OCONUT
398
399
400
Since C OCONUT enables the switch between language and continuous space reasoning, we are able
to manipulate the model to output language at a certain point, and then infer the previous latent
401
reasoning process from it. This is especially helpful to understand why C OCONUT and some other
402
latent reasoning methods can outperform CoT on ProsQA while generating much fewer tokens. Our
403 analysis surprisingly indicates that C OCONUT allows the LLM to develop a fundamentally different
404 reasoning pattern than CoT. The continuous thought not only encodes multiple partial reasoning
405 paths, but also enables a latent search process similar to BFS.
406
407
408
409 5.1 E XPERIMENTAL S ETUP
410
411 Method. We slightly modify the original training curriculum, so that at any training stage, data
412 from other stages is mixed in with a certain probability (p = 0.3). This prevents the model from
413 forgetting earlier stages after the complete training schedule. Therefore, it allows us to control the
414 number of latent thoughts during inference by manually setting the <eot> token. When we enforce
415 C OCONUT to use k continuous thoughts during inference, the model should output the remaining
416 reasoning chain in language, starting from the k + 1 step. In our experiments, we test variants with
417
k ∈ {0, 1, 2, 3, 4, 5, 6}. Note that all these variants only differ in inference time while sharing the
same model weights. Besides, we report the performance of CoT and no-CoT as references.
418
419 Metrics. We apply two sets of evaluation metrics. One of them is based on the correctness of the fi-
420 nal answer, regardless of the reasoning process. It is the metric used in the main results (Section 4.4).
421 To enable fine-grained analysis, we define another metric on the reasoning process. Assuming we
422 have a complete language reasoning chain which specifies a path in the graph, we can classify it into
423
(1) Correct Path: The output is one of the shortest paths to the correct answer. (2) Longer Path: A
valid path that correctly answers the question but is longer than the shortest path. (3) Hallucination:
424
The path includes nonexistent edges or is disconnected. (4) Wrong Target: A valid path in the
425
graph, but the destination node is not the one being asked. These four categories naturally apply
426 to the output from C OCONUT (k = 0) and CoT, which generate the full path. For C OCONUT with
427 k > 0 that outputs only partial paths in language (with the initial steps in continuous reasoning),
428 we classify the reasoning as a Correct Path if a valid explanation can complete it. Also, we define
429 Longer Path and Wrong Target for partial paths similarly. If no valid explanation completes the path,
430 it’s classified as hallucination. In no-CoT and C OCONUT with larger k, the model may only outputs
431 the final answer without any partial path, it falls into (5) Correct Label or (6) Incorrect Label.
These six categories cover all cases without overlap.

8
Under review as a conference paper at ICLR 2025

432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
Figure 6: A case study of ProsQA. The model trained with CoT hallucinates an edge (Every yumpus
is a rempus) after getting stuck in a dead end. C OCONUT (k=1) outputs a path that ends with an
449
irrelevant node. C OCONUT (k=2) answers the question correctly.
450
451
452 5.2 R ESULTS
453
454 Figure 5 shows a comparative analysis of different reasoning method on ProsQA. As more contin-
455
uous thoughts are used, both the answer accuracy and the probability of predicting the correct path
gradually increase. The rate of hallucination, which often occurs when the model makes a wrong
456
move and gets stuck in a dead end, also decreased. A case study is shown in Figure 6, where CoT
457
hallucinates an inexistent edge, C OCONUT (k = 1) leads to a wrong target, but C OCONUT (k = 2)
458 successfully solves the problem. In this example, the model cannot accurately determine which edge
459 to choose at the earlier step. However, as latent reasoning can avoid making a hard choice upfront,
460 the model can progressively eliminate incorrect options in subsequent steps and achieves higher ac-
461 curacy at the end of reasoning. We show more evidence and details of this reasoning process in
462 Section 5.3.
463
The comparison between CoT and C OCONUT (k = 0) reveals another interesting fact: even when
464 C OCONUT is forced to generate a complete reasoning chain, the accuracy of the answers is still
465 higher than CoT. The generated reasoning paths are also more accurate with less hallucination.
466 From this, we can infer that the training method of mixing different stages improves the model’s
467 ability to plan ahead. The training objective of CoT always concentrates on the generation of the
468 immediate next step, making the model “shortsighted”. In later stages of COCONUT training, the
469 first few steps are excluded, allowing the model to focus more on future steps. This is similar to the
470 principle of multi-token prediction pretraining (Gloeckle et al., 2024), which also helps improve the
471 LLM’s ability to plan ahead. We leave more detailed analysis on this phenomenon to future work.
472
473 5.3 I NTERPRETING THE L ATENT T REE S EARCH
474
475 Actually, we can infer the reasoning encoded in latent thoughts from the model’s subsequent out-
476
puts (Figure 7). For instance, if we force the model to switch back to the language space after one
latent thought (by placing <eot>), under greedy decoding, the model predicts “every lempus is a
477
scrompus” as the next step. In this case, we have to assume that the latent thought encodes “Alex is
478
a lempus”. Furthermore, if we do not use greedy decoding, and instead output the probability distri-
479 bution of the token predicted by the model at the position of “lempus”, we can obtain a distribution
480 of the reasoning step encoded by the latent thought. Similarly, we can get the probability of nodes
481 in the second reasoning steps (Figure 7, right). This probability distribution can be viewed as the
482 model’s implicit value function, that is, the estimated potential of a node to lead to the correct target.
483 As shown in the figure, “lempus”, “zhorpus”, “grimpus”, and “sterpus” have a probability of 0.33,
484 0.16, 0.32, and 0.01, respectively. This indicates that in the first continuous thought, the model has
485 mostly ruled out “sterpus” as an option but is still unable to determine which of the remaining three
is the correct choice.

9
Under review as a conference paper at ICLR 2025

486
487
488
489
490
491
492
493
494
495
Figure 7: An illustration of the latent search trees. The example is consistent with Figure 6. The
496
height of a node (denoted as h in the figure) is defined as the longest distance to any leaf nodes in
497
the graph. We calculate the probability of the first concept predicted by the model following latent
498 thoughts (e.g., “lempus” in the left figure). It is calculated as the multiplication of the probability
499 of all tokens within the concept conditioned on previous context (omitted in the figure for brevity).
500 This metric can be interpreted as an implicit value function estimated by the model, assessing the
501 potential of each node leading to the correct answer.
502
503
A significant difference between “sterpus” and the
504 other three options is that it is a leaf node (see Fig-
505 ure 6). In contrast, the other three nodes can still be
506 further explored, which makes them harder to eval-
507 uate. We measure the height of each node, i.e., the
508 shortest distance to any leaf nodes in the graph, as
509 a proxy for the room of exploration. Based on the
510 case discussed above, a natural hypothesis is that the
511 lower a node is, the easier it is to estimate its value
512
accurately. Indeed, in this case, the model is con-
fused between “gripmpus” and “lempus”, both of
513
which has a height of 2.
514
515 To validate this hypothesis, we analyze the first and
516 second latent steps on the whole test set. We can
517 clearly see a trend in Figure 8. Generally, the model
518
can effectively differentiate between correct and in-
correct nodes (defined by whether they lead to the
519
correct target node) when their heights are small, i.e.,
520
assigning a small value for incorrect nodes. How-
521 ever, it tends to become less accurate as the node
522 heights increase. We can conclude that the model is
523 not capable of doing an exhaustive search to evaluate
524 the potential of a node, but relies more on heuristic Figure 8: The correlation between prediction
525 features like the heights. probability of concepts and their heights.
526
Therefore, it’s intuitive to understand why more la-
527 tent thoughts makes reasoning easier: as the search tree expands, the nodes under consideration are
528 expected to have smaller heights. When the height is small, the model is better at distinguishing
529 correct nodes from incorrect nodes, and is more likely to output a correct answer.
530
531
532 6 C ONCLUSION
533
534 In this paper, we presented C OCONUT, a novel paradigm for reasoning in continuous latent space,
535 aimed to address the inherent inefficiencies associated with traditional language-based reasoning in
536 large language models. Through extensive experimentation on various datasets, we demonstrated
537 that C OCONUT significantly enhances LLM reasoning capabilities. Notably, our detailed analysis
538 highlighted how an unconstrained latent space allows the model to develop an effective reasoning
539 pattern similar to BFS. We anticipate that our findings will inspire further research into latent rea-
soning methods, contributing to the development of more intelligent machine reasoning system.

10
Under review as a conference paper at ICLR 2025

540
R EFERENCES
541
542 Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale-
543 man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical
544 report. arXiv preprint arXiv:2303.08774, 2023.
545
Marie Amalric and Stanislas Dehaene. A distinct cortical network for mathematical knowledge in
546
the human brain. NeuroImage, 189:19–31, 2019.
547
548 Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson. Hopping too
549 late: Exploring the limitations of large language models on multi-hop queries. arXiv preprint
550 arXiv:2406.12775, 2024.
551
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and
552
Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023
553
Conference on Empirical Methods in Natural Language Processing, pp. 7889–7901, 2023.
554
555 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
556 Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to
557 solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
558
Google DeepMind. Ai achieves silver-medal standard solving international mathematical
559
olympiad problems, 2024. URL https://deepmind.google/discover/blog/
560
ai-solves-imo-problems-at-silver-medal-level/.
561
562 Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stu-
563 art Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint
564 arXiv:2311.01460, 2023.
565
566
Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to inter-
nalize cot step by step. arXiv preprint arXiv:2405.14838, 2024.
567
568 Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
569 Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.
570 arXiv preprint arXiv:2407.21783, 2024.
571
572 Evelina Fedorenko, Michael K Behr, and Nancy Kanwisher. Functional specificity for high-level
573
linguistic processing in the human brain. Proceedings of the National Academy of Sciences, 108
(39):16428–16433, 2011.
574
575 Evelina Fedorenko, Steven T Piantadosi, and Edward AF Gibson. Language is primarily a tool for
576 communication rather than thought. Nature, 630(8017):575–586, 2024.
577
578 Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing
579 the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information
580 Processing Systems, 36, 2023.
581
Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and
582 Noah D Goodman. Stream of search (sos): Learning to search in language. arXiv preprint
583 arXiv:2404.03683, 2024.
584
585 Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Syn-
586 naeve. Better & faster large language models via multi-token prediction. arXiv preprint
587 arXiv:2404.19737, 2024.
588
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh
589 Nagarajan. Think before you speak: Training language models with pause tokens. arXiv preprint
590 arXiv:2310.02226, 2023.
591
592 Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu.
593 Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992,
2023.

11
Under review as a conference paper at ICLR 2025

594
Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu,
595 Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. Teaching
596 large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642,
597 2024.
598
599
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish
Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv
600
preprint arXiv:2210.02406, 2022.
601
602 Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open
603 Review, 62(1):1–62, 2022.
604
Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian. Be-
605
yond a*: Better planning with transformers via search dynamics bootstrapping. arXiv preprint
606 arXiv:2402.14083, 2024.
607
608 Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to
609 solve inherently serial problems. arXiv preprint arXiv:2402.12875, 2024.
610 Aman Madaan and Amir Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes
611 two to tango. arXiv preprint arXiv:2209.07686, 2022.
612
613 William Merrill and Ashish Sabharwal. The expresssive power of transformers with chain of
614
thought. arXiv preprint arXiv:2310.07923, 2023.
615 Martin M Monti, Daniel N Osherson, Michael J Martinez, and Lawrence M Parsons. Functional
616 neuroanatomy of deductive inference: a language-independent distributed network. Neuroimage,
617 37(3):1005–1016, 2007.
618
Martin M Monti, Lawrence M Parsons, and Daniel N Osherson. The boundaries of language and
619
thought in deductive inference. Proceedings of the National Academy of Sciences, 106(30):
620 12554–12559, 2009.
621
622 Martin M Monti, Lawrence M Parsons, and Daniel N Osherson. Thought beyond language: neural
623 dissociation of algebra and natural language. Psychological science, 23(8):914–922, 2012.
624 Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation
625 in transformer language models. arXiv preprint arXiv:2404.15758, 2024.
626
627 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
628
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
629 Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis
630 of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022.
631
632
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li,
Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open
633
language models. arXiv preprint arXiv:2402.03300, 2024.
634
635 Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always
636 say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural
637 Information Processing Systems, 36, 2024.
638 Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun.
639 Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv
640 preprint arXiv:2212.10001, 2022.
641
642
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang
Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Pro-
643
ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume
644
1: Long Papers), pp. 9426–9439, 2024.
645
646 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
647 Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
neural information processing systems, 35:24824–24837, 2022.

12
Under review as a conference paper at ICLR 2025

648
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael
649 Xie. Self-evaluation guided beam search for reasoning. Advances in Neural Information Process-
650 ing Systems, 36, 2023.
651
652 Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. Do large language
653 models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837, 2024.
654 Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik
655 Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Ad-
656 vances in Neural Information Processing Systems, 36, 2023.
657
658 Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, and Lianhui Qin. Flow of reasoning: Efficient
659
training of llm policy with divergent thinking. arXiv preprint arXiv:2406.05673, 2024a.
660 Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhen-
661 guo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions
662 for large language models. arXiv preprint arXiv:2309.12284, 2023.
663
664
Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling system 2 into system 1. arXiv preprint
arXiv:2407.06023, 2024b.
665
666 Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen.
667 Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint
668 arXiv:2309.05653, 2023.
669
Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman.
670
Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint
671
arXiv:2403.09629, 2024.
672
673 Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuur-
674 mans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex
675 reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701

13
Under review as a conference paper at ICLR 2025

702
# Nodes # Edges Len of Shortest Path # Shortest Paths
703
704 23.0 36.0 3.8 1.6
705
706 Table 2: Statistics of the graph structure in ProsQA.
707
708 Dataset Training Validation Test
709 GSM8k 385,620 500 1319
710 ProntoQA 9,000 200 800
711 GSM8k 17,886 300 500
712
713 Table 3: Statistics of the datasets.
714
715
716
A DATASETS
717
718
A.1 C ONSTRUCTION OF P ROS QA
719 To construct the dataset, we need to define a set of entities (typical names like “Alex”, “Jack”, etc.)
720 and a set of concepts (fictional words like “lorpus”, “rorpus”, etc., following Saparov & He (2022)).
721
722
The desired problem form is “Is [Entity] a [Concept A] or [Concept B]?”. Assume the correct answer
is [Concept A], we will need to construct a graph, so that we can find a path between [Entity] and
723
[Concept A], and make sure [Entity] and [Concept B] are not connected.
724
725 The overall idea to build the DAG is to gradually add more nodes. Every time a new node comes
726 in, we randomly add edges from existing nodes to the new node. We first sample the in-degree
727 following a Poisson distribution with a mean equal to 1.5, then sample the parents for this node. In
728
this process, we need to make sure that any entity or concept cannot be the ancestor of both [Concept
A] and [Concept B], in order to make a valid binary choice problem. Besides, we want to keep the
729
family of [Concept A] and [Concept B] of similar sizes, otherwise the model may learn shortcuts.
730
731 Therefore, we implement a graph construction pipeline as follows: First, we initialize two nodes
732 with labels 1 and 2. Then, for each new node, there is a probability p (p = 0.35) that it can only
733 accept edges from nodes with label 1; and another probability p (p = 0.35) that it can only accept
734
edges from nodes with label 2; otherwise the node can accepts edges from any nodes. After sampling
the incoming edges for the node, it will be assigned a label: 1 if all the parent nodes have label 1; 2
735
if all the parent nodes have label 2; 3 if there are both parent nodes with label 1 and 2; 0 if there are
736
no parent nodes or all parent nodes are labeled 0.
737
738 All nodes without parents will be assigned an entity name, while others are given a concept names.
739 These form the known conditions. To get the question, we use the first node as the [Entity], a node
740
labeled with 1 as [Concept A], a node labeled with 2 as [Concept B]. The construction will ensure
there is always a path from [Entity] to [Concept A] but not [Concept B]. We will find the [Concept A]
741
and [Concept B] that makes the reasoning chain relatively long. Note that after rendering the graph
742
into natural language, we will permute the position of [Concept A] and [Concept B] randomly.
743 Given the symmetry of label 1 and 2, there is no risk for shortcut in the position of choice.
744
745 The statistics of the resulting dataset is listed in Table 2
746
747 A.2 S TATISTICS
748
We show the size of all datasets in Table 3.
749
750
751
752
753
754
755

14

You might also like