0% found this document useful (0 votes)
78 views11 pages

Token-Budget Optimization in LLMs

The document presents a token-budget-aware reasoning framework for large language models (LLMs) that aims to optimize the reasoning process by dynamically estimating token budgets based on problem complexity. This approach, referred to as TALE, effectively reduces token costs associated with Chain-of-Thought (CoT) reasoning while maintaining accuracy, achieving an average reduction of 68.64% in token usage with less than 5% decrease in performance. The findings suggest that incorporating a reasonable token budget can significantly improve efficiency in LLM reasoning tasks.

Uploaded by

bobbyathinkson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views11 pages

Token-Budget Optimization in LLMs

The document presents a token-budget-aware reasoning framework for large language models (LLMs) that aims to optimize the reasoning process by dynamically estimating token budgets based on problem complexity. This approach, referred to as TALE, effectively reduces token costs associated with Chain-of-Thought (CoT) reasoning while maintaining accuracy, achieving an average reduction of 68.64% in token usage with less than 5% decrease in performance. The findings suggest that incorporating a reasonable token budget can significantly improve efficiency in LLM reasoning tasks.

Uploaded by

bobbyathinkson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Token-Budget-Aware LLM Reasoning

Tingxu Han1 , Chunrong Fang* 1 , Shiyu Zhao2 ,


Shiqing Ma3 , Zhenyu Chen1 , Zhenting Wang†2
1
Nanjing University 2 Rutgers University 3 UMass Amherst

Abstract and widely adopted approach. It enhances the re-


liability of the model’s answers by guiding large
Reasoning is critical for large language mod- language models with the prompt "Let’s think step
els (LLMs) to excel in a wide range of tasks.
arXiv:2412.18547v1 [cs.CL] 24 Dec 2024

by step", encouraging them to decompose the prob-


While methods like Chain-of-Thought (CoT)
lem into intermediate steps and solve each before
reasoning enhance LLM performance by de-
composing problems into intermediate steps, arriving at the final answer. Figure 1a and Fig-
they also incur significant overhead in token ure 1b illustrate an intuitive example. Observe that
usage, leading to increased costs. We find that without CoT, the LLM produces incorrect answers
the reasoning process of current LLMs is un- to the question. With a CoT-enhanced prompt, the
necessarily lengthy and it can be compressed LLM systematically breaks the question into multi-
by including a reasonable token budget in the ple steps and reasons through each step sequentially.
prompt, but the choice of token budget plays a
By addressing each step incrementally, the LLM
crucial role in the actual compression effective-
ness. We then propose a token-budget-aware
eventually arrives at the correct answer.
LLM reasoning framework, which dynamically Although reasoning enhancement approaches
estimates token budgets for different problems such as CoT impressively improve LLM perfor-
based on reasoning complexity and uses the mance, they produce substantial additional over-
estimated token budgets to guide the reasoning head, specifically in the form of the increased num-
process. Experiments show that our method ber of tokens produced (Wei et al., 2022; Feng et al.,
effectively reduces token costs in CoT reason- 2023; Yao et al., 2024a). As shown in Figure 1b,
ing with only a slight performance reduction,
the answer to prompt with CoT has notably higher
offering a practical solution to balance effi-
ciency and accuracy in LLM reasoning. Code: token costs due to the detailed intermediate reason-
https://github.com/GeniusHTX/TALE. ing steps included in the output. Such high token
costs can lead to significant expenses, including
“It is not enough to have a good mind; increased computational resource usage and longer
the main thing is to use it well.” running times during the LLM inference phase, ulti-
mately resulting in significant additional monetary
− René Descartes
and energy costs.
1 Introduction This raises an important question: “Is the reason-
ing process of current LLMs unnecessarily lengthy,
Reasoning plays a crucial role in enabling large lan- and how can it be compressed?” Nayab et al.
guage models (LLM) to perform effectively across (2024) demonstrate that LLM has the potential to
a wide range of tasks (Zhou et al., 2022; Hao et al., follow a length constraint in the prompt. Building
2023, 2024a). A variety of methods have been on this, we find that including a token budget (see
proposed to enhance the reasoning capabilities of Table 1) in the prompts is a promising approach to
large language models (Suzgun et al., 2022; Wang compressing the CoT reasoning tokens. However,
et al., 2023; Feng et al., 2023; Yao et al., 2024a; the choice of token budget plays a crucial role in
Xie et al., 2024). Among these, Chain-of-Thought the actual compression effectiveness. For example,
(CoT) (Wei et al., 2022) is the most representative Figure 1d illustrates that including a reasonable
* Corresponding
token budget (e.g., 50 tokens in this case) in the
Author.

Project Lead. instructions reduces the token cost in the chain-of-
Pre-print with preliminary results, work in progress. thought (CoT) process from 258 output tokens to

1
Q: Peyton scheduled after-work activities of a one- as many as with a 50-token budget. In other words,
hour yoga class on Monday, a cooking class that
lasts 9times as long as Monday’ s yoga on Tuesday, when the token budget is relatively small, LLMs
a half-hour cheese-tasting event on Wednesday, a
museum tour that takes half as long as the cooking
often fail to follow the given token budget. In such
class on Thursday, and two hours of errands on cases, the actual token usage significantly exceeds
Friday. How many hours will all Peyton’s after-
work activities take? the given budget—even much larger than the to-
Peyton's after-work activities will take 12.5 hours. ken costs with larger token budgets. We refer to
this phenomenon as the “Token Elasticity” in the
(a) Direct answering (15 output tokens).
CoT process with token budgeting. To address this,
Q: Peyton …How many hours will all Peyton ’s after-
work activities take? the optimal token budget for a specific LLM and
Let's think step by step: a particular question can be searched by gradually
Let's break down activities step by step: reducing the budget specified in the prompt, identi-
1. Monday: Yoga class lasts 1 hour.
- Total for Monday: 1 hour fying the smallest token budget that achieves both
2. Tuesday: Cooking class lasts 9 times as long as the correct answer and the lowest actual token cost.
Monday's yoga class.
- Cooking class duration = 9 * 1 hour = 9 hours - Based on the above observations and analysis,
Total for Tuesday: 9 hours
3. Wednesday: Cheese-tasting event lasts half an hour. we designed a prototype for token-budget-aware
- Total for Wednesday: 0.5 hours
4. Thursday: Museum tour is half the cooking class. -
reasoning in large language models (LLMs). Our
Museum tour duration = 0.5 * 9 hours = 4.5 hours - approach leverages the token budget to guide the
Total for Thursday: 4.5 hours
5. Friday: Errands take 2 hours. reasoning process, dynamically allocating differ-
- Total for Friday: 2 hours ent token budgets to problems based on an esti-
Now, let's add up all the hours from each day:
- Monday: 1 hour - Tuesday: 9 hours mation of their reasoning complexity. We call
- Wednesday: 0.5 hours - Thursday: 4.5 hours
- Friday: 2 hours our method TALE (Token-Budget-Aware LLM
Now, we sum these amounts: rEasoning). For a given problem and a specific
1 + 9 + 0.5 + 4.5 + 2 = 17 hours.
Therefore, the total time is 17 hours. LLM, TALE first estimates an appropriate token
(b) Vanilla CoT (258 output tokens). budget and then uses it to guide the reasoning
Q: Peyton … How many hours will all Peyton’s
process. We discuss different implementations of
after-work activities take? TALE in Section 5. Experiment results show that
Let's think step by step and use less than 10 tokens:
TALE significantly reduces token costs in LLM
Let's break down each activity: chain-of-thought (CoT) reasoning while largely
•Monday: Yoga class = 1 hour.
•Tuesday: Cooking class = 9 times Monday's yoga, maintaining the correctness of the answers. On
so 9 × 1 = 9 hours.
•Wednesday: Cheese-tasting event = 0.5 hours. average, TALE achieves a 68.64% reduction in
•Thursday: Museum tour = Half of Tuesday's token usage while maintaining accuracy with less
cooking class, so 9 ÷ 2 = 4.5 hours.
•Friday: Errands = 2 hours. than 5% decrease.
Now, let's add them up:
1+9+0.5+4.5+2=171 + 9 + 0.5 + 4.5 + 2 =
171+9+0.5+4.5+2=17
Peyton's after-work activities will take 17 hours.
2 Related Work
(c) CoT with an unreasonable budget (157 output LLM Reasoning. Reasoning in LLMs has seen
tokens).
substantial advancements through techniques that
Q: Peyton … How many hours will all Peyton’s after-
work activities take? generate intermediate steps, enabling more accu-
Let's think step by step and use less than 50 tokens: rate and effective performance across diverse do-
- Monday: 1 hour (yoga) - Tuesday: 9 hours mains (Wu et al., 2022; Yang et al., 2022; Zhou
(cooking class)
- Wednesday: 0.5 hours (cheese-tasting) -
et al., 2022; Sun et al., 2024; OpenAI, 2024c). Var-
Thursday: 4.5 hours (museum tour) - Friday: 2 ious LLM reasoning techniques are proposed to
hours (errands)
- Total: 1 + 9 + 0.5 + 4.5 + 2 = 17 hours. improve the LLM performance. Chen et al. (2024)
(d) CoT with an reasonable budget (86 output tokens).
formulates reasoning as sampling from a latent
distribution and optimizing it via variational ap-
Figure 1: Examples of different problem solving
paradigms. The reasoning processes are highlighted.
proaches. Ho et al. (2022) utilizes LLM as rea-
soning teachers, improving the reasoning abilities
86 output tokens, while still enabling the LLM to of smaller models through knowledge distillation.
arrive at the correct answer. However, when the Among them, Chain-of-Thought (CoT) prompting
token budget is set to a different smaller value (e.g., has emerged as a key technique for improving LLM
10 tokens), the output token reduction is less effec- reasoning by breaking problems into intermedi-
tive, resulting in 157 output tokens—nearly twice ate steps, enabling better performance on multiple

2
tasks (Wei et al., 2022; Lyu et al., 2023; Li et al., Table 1: Illustrations of the vanilla CoT prompt and the
2023; Feng et al., 2024). Extensions of CoT in- token-budget-aware prompt.
clude self-consistency, which aggregates multiple
Prompt method Content
reasoning paths to improve robustness (Wang et al.,
Vanilla CoT Let’s think step by step:
2022), and Tree-of-Thoughts, which explores rea-
soning steps in a tree-like structure for more com- Let’s think step by step and use
CoT with Token Budget
less than budget tokens:
plex tasks (Yao et al., 2024b). Reflexion introduces
Let’s think step by step and use
iterative refinement, where the model critiques and Example
less than 50 tokens:
updates its intermediate steps (Shinn et al., 2024).
Token Cost of LLM. Although the above methods
enhance reasoning accuracy, they often increase to- the instructions reduces the token cost in the chain-
ken usages, posing challenges to efficiency (Wang of-thought (CoT) process by several times, but the
et al., 2024; Chiang and Lee, 2024; Bhargava et al., LLM still gets the correct answer. Our results in
2023). Consequently, it is important to mitigate Figure 2 and Table 2 also confirm there are a large
token consumption while maintaining the model number of redundant tokens in the reasoning pro-
performance. To address this issue, Li et al. (2021) cess of the state-of-the-art LLMs.
introduces a multi-hop processing technique de- Causes of Token Redundancy in LLM Reason-
signed to filter out irrelevant reasoning. While ing. A possible explanation for this token redun-
effective, this approach is limited to traditional dancy is that during the post-training phase, such as
neural networks, such as PALM (Bi et al., 2020), the RLHF process (Ouyang et al., 2022), annotators
and lacks adaptability to large language models might favor more detailed responses from LLMs,
(LLMs). Zheng et al. (2024) aims to improve LLM marking them as preferred. As a result, the model
inference speed by predicting response lengths and learns to associate longer, more detailed responses
applying a scheduling algorithm to enhance effi- with alignment to human preferences and tends
ciency. However, their method is constrained to to produce such outputs during reasoning. How-
and does not address the reduction of actual token ever, in many scenarios, we primarily need LLMs
costs. Hao et al. (2024b) reduces token usage by to provide the correct answer and make accurate
substituting decoded text tokens with continuous decisions, rather than elaborate extensively with
latent tokens. However, its application is currently detailed explanations. This motivates the need to
restricted to small-scale, early language models eliminate redundant tokens in the LLM reasoning
like GPT-2 (Radford et al., 2019). Additionally, process in many cases.
it significantly impacts reasoning accuracy, result-
ing in over a 20% relative accuracy reduction on 4 Searching Optimal Token Budget
benchmarks such as GSM8K (Cobbe et al., 2021).
As demonstrated in Figure 1, different token bud-
3 Token Redundancy in LLM Reasoning gets have different effects. Therefore, it is natural to
investigate the following question: “How to search
Token Budget. Previous research Nayab et al. the optimal token budget for a specific question and
(2024) demonstrates that LLM has the potential a particular LLM?”
to follow a length constraint in the prompt. Ta- Vanilla Method for Optimal Budget Search. An
ble 1 shows the difference between the vanilla CoT intuitive method is finding the minimal needed to-
and the CoT with token budget. For instance, by kens as the budget, ensuring that the LLM can
including a token budget (50 tokens) within the still produce correct and accurate responses within
prompt, as illustrated in Figure 1d, the LLM adjusts this constraint. To find the minimal token bud-
the length of its output (86 output tokens), trying get required for each question, we utilize a binary
to align with the specified budget. This indicates search-based minimal budget search algorithm. Al-
that LLMs have a certain capability in following gorithm 1 showcases the details. Before initiating
prompts with an explicit token budget. the search process, we first apply the vanilla CoT to
Token Redundancy Phenomenon. We find that generate an answer for each question, as illustrated
providing a reasonable token budget can signifi- in Figure 1b. The number of tokens in the result-
cantly reduce the token cost during reasoning. As ing answer is then calculated and designated as the
shown in Figure 1d, including a token budget in right boundary for search, denoted by right. The

3
   

7RNHQFRVW

7RNHQFRVW
%XGJHW

%XGJHW
   
   
   
                   
6HDUFKLWHUDWLRQ 6HDUFKLWHUDWLRQ 6HDUFKLWHUDWLRQ 6HDUFKLWHUDWLRQ
(a) GPT-4o-mini budget search. (b) GPT-4o-mini token cost. (c) Yi-lightning budget search. (d) Yi-lightning token cost.

Figure 2: Token elasticity phenomenon. The x-axis denotes the budget search iteration. The y-axis denotes the
searched budget (Figure 2a and Figure 2c) or the real token costs for each searched budget (Figure 2b and Figure 2d).
Different colors denote different samples. The token cost is significantly lower in a reasonable token budget range.
When the token budget is smaller than the reasonable range, the token cost gradually increases.

Algorithm 1 Budget Search  9DQLOOD&R7

1XPEHURIWRNHQFRVW
Input: feasibility checking function isFeasible, &R7ZLWKRSWLPDOVHDUFKHGEXGJHW

a large language model M, a given question x

and the ground truth label y

Output: searched budget β
1: function S EARCH(isFeasible,M, x, y) 
2: right ← the actual token costs of M with 
     
vanilla CoT prompt on x 4XHVWLRQLQGH[
3: β ← ⌊(0 + right)/2⌋
4: β0 ← right Figure 3: The effects of optimal searched budget. CoT
5: while True do with our optimal searched budget reduces the token
6: if isFeasible(M, x, y, β0 , β) then costs significantly without influencing the accuracy.
 Update the searched budget budget required to produce a correct answer is not
7: β ← ⌊(0 + right)/2⌋ necessarily the optimal budget. When the budget
 Record previous searched budget is unreasonably small, the actual token cost often
8: β0 ← right exceeds that of cases where a larger budget is used.
 Update the search range Observation of Token Elasticity. During our min-
9: right ← β imal budget search process, we observe a “token
10: else elasticity” phenomenon as we approach the min-
11: break imal budget. Specifically, as Algorithm 1 pro-
12: return β gresses, we aim to identify the minimal budget
that still ensures the answer’s correctness. How-
ever, we find that if the budget is reduced beyond a
function isFeasible is used to determine the fea- certain range, the token cost increases, indicating
sibility of a budget. A budget is considered feasible that further reductions in the budget lead to increas-
here if the CoT prompt with that budget preserves ing token consumption. Figure 2 showcases the
the correctness of the answer. Given the feasibil- evidence. The x-axis represents the iterations of
ity function, large language model M, question x the budget binary search, with the budget values
and label yas the input, Algorithm 1 first calculates decreasing progressively. The y-axis in Figure 2b
the right boundary of search (line 2). With 0 as and Figure 2d show the corresponding token costs
the left boundary, the current possible budget β is at each budget search iteration. We can observe
computed as the midpoint of 0 and right (line 3). that when the searched budget crosses a range for
We use β0 to record the previously searched budget reasonable budgets, the token costs begin to grad-
(line 4). While the current β is feasible, the algo- ually increase. Figure 1c also shows an example.
rithm updates β by recalculating the midpoint (line As observed, when a small token budget (e.g., 10
7) and adjusts the search bounds accordingly to nar- tokens) is used, the real token cost is significantly
row the range (line 9). Once the loop ends, the final higher compared to scenarios where a reasonable
budget β is returned as the searched result (line 12). token budget is allocated (i.e., Figure 1d).
Algorithm 1 is designed to find the minimal budget Token Elasticity based Optimal Budget Search.
efficiently. However, we observe that the minimal The token elasticity observation shows that while

4
Algorithm 2 Greedy Feasibility Function to elaborate a token-budget-aware prompt that
Input: a large language model M, a question x achieves performance comparable to vanilla CoT
and the ground truth label y, previous and cur- while reducing token costs. To strike this balance,
rent budget: β0 , β TALE follows a two-phase approach: budget esti-
Output: True if the budget satisfies the require- mation and prompt construction. The observation
ments, False otherwise of token elasticity, as discussed in Section 4, sug-
1: function isFeasible(M, x, y, β0 , β) gests that only an appropriate budget located in
2: t, t0 ← gets the actual token costs under a reasonable range can effectively reduce token
budgets of β and β0 costs and preserve LLM performance simultane-
3: if M(x, β) == y and t < t0 then ously. The optimal budget searched by Algorithm 1
4: return True and Algorithm 2 is located in such reasonable bud-
5: return False get range and achieves satisfying trade-off between
token costs and LLM performance. In this case,
TALE first estimates a reasonable token budget
that is close to the searched optimal budget for the
Question
Prompt LLM given question. Using the estimated token budget,
Budget Estimated
Estimator Token Budget
TALE then crafts a token-budget-aware prompt
Question and then feeds it into LLM to generate the final
answer. Figure 7 illustrates an example of TALE.
Figure 4: The workflow of TALE. Given a question,
TALE first estimates the token budget using a budget 5.2 Budget Estimation
estimator. It then crafts a token-budget-aware prompt
by combining the question with the estimated budget. To estimate an appropriate budget within the rea-
Finally, the prompt is input to the LLM to generate the sonable budget range, two possible solutions are
answer as the final output. taken into consideration: zero-shot-based mecha-
nism and budget regression. We also discuss an
a minimal budget may keep the correctness of the
approach that internalizes the budget awareness of
answer, it does not necessarily minimize the token
LLM by fine-tuning it. All approaches focus on de-
cost. Figure 1c and Figure 1d illustrate an intuitive
veloping an estimator that effectively approximates
example. To address this, we enhance Algorithm 1
the optimal budget.
by incorporating a greedy search strategy aimed
Zero-shot Estimator. For zero-shot mechanism to
at finding the optimal budget that simultaneously
predict the reasonable budget, TALE leverages the
minimizes token cost and preserves answer cor-
reasoning LLM itself as an estimator. Before query-
rectness. Specifically, we introduce an additional
ing the LLM with a question, TALE first prompts
constraint to the isFeasible condition. Beyond
the LLM to estimate the number of output tokens
ensuring correctness, the updated budget must re-
needed to answer the question. Figure 5 illustrates
sult in a lower token cost compared to the previ-
the budget estimation prompt. The key intuition
ously searched budget. Algorithm 2 outlines the
behind this is the human-like thinking paradigm.
feasibility function employed during the search pro-
When presented with a mathematical question, al-
cess. Initially, the actual token cost is computed
though it may take humans a few minutes to calcu-
for both the current and previously evaluated bud-
late the answer, they can typically estimate the time
gets (line 2). Next, feasibility is assessed based on
or effort required to solve it after just briefly review-
two criteria: the correctness of the answer and a
ing the question. For instance, when presented with
reduction in token cost following a greedy strategy
a question from primary school arithmetic and an-
(line 3). The search process is terminated if either
other from college-level calculus, a human may not
condition fails.
immediately provide the answers. Still, it is easy
5 Methodology to infer that the former can be solved in seconds,
while the latter requires a significantly longer time,
5.1 Overview even with only a brief glance. RQ2 in Section 6
Based on the above observations and analysis, showcases the performance of budget estimation.
we designed our method TALE for token-budget- Observe that a large amount of the estimated bud-
aware reasoning in LLMs. Figure 4 provides gets are around the optimal searched budget and
an overview of TALE’s workflow. TALE aims achieve competitive performance.

5
Task: Analyze the given question and estimate the “Let’s think step by step and use less
minimum number of tokens required to generate a than βi∗ tokens:”
complete and accurate response. Please Give the
response by strictly following this format: [[budget]], where βi∗ is the searched optimal budget for the
for example, Budget: [[12]]. given question xi (see search process in Algo-
rithm 1 and Algorithm 2). Figure 1d illustrates an
Figure 5: The prompt for zero-shot estimator. example. The resulting LLM output, constrained
Regression Estimator. For regression-based esti- by the token budget specified in the prompt, is
mator, we aim to train/fine-tune another LLM f (θ) taken as the crafted target output yi . In the LLM
to serve as the estimator, such that f (θ) estimates fine-tuning stage, we train the LLM M(θ) using
the optimal token budget given a specific LLM and the crafted target outputs from the first stage. The
a particular question. Given D = {(p, xi , βi∗ )}N instruction prompt for fine-tuning is standardized
i=1 ,
p is the instruction prompt, xi is a question, βi∗ as follows:
is our searched optimal budget (searched by Algo- “Let’s think step by step:”
rithm 1 and Algorithm 2) for x and N is the dataset The training sample for fine-tuning consists of the
size. We assign the instruction prompt p as “Esti- given question xi paired with the crafted target
mate the token budget for the following question”. output yi . The training objective aligns with the
Next, we initialize f (θ) using a pre-trained LLM, loss function described in Equation 2 where (x, yi )
such as LLaMA 3-8B (AI@Meta, 2024). Then, indicates the training sample and pi means the
we craft the target output yi for xi using βi∗ . For above instruction prompt. During fine-tuning, the
example, given a β0∗ as 14, the corresponding target LLM is encouraged to internalize the token budget
output y0 is “The token budget is 14”. We aim to constraint and prefer a compact reasoning process,
make the model output as close as possible to the following the target outputs generated in the first
target output yi by minimizing loss function, which stage. This two-stage process ensures that the LLM
can be formalized as: produces concise yet correct responses, effectively
N
X balancing reasoning quality with token efficiency
θ∗ = arg min L (fθ (pi , xi ), yi ) (1) during inference.
θ
i=1
6 Evaluation
where θ denotes the model parameters and θ∗ the
optimized parameters. The loss function is defined In this section, we provide the preliminary re-
as follows: sults of the zero-shot estimator version of our
method TALE. This project is ongoing and we will
N
X update more results soon. Three state-of-the-art
L=− log P(yi | pi , xi ; θ) (2)
LLMs (i.e., GPT-4o (OpenAI, 2024b), GPT-4o-
i=1
mini (OpenAI, 2024a), Yi-lightning (Wake et al.,
Through Equation 2, model parameters θ are opti- 2024)) are involved in the experiments. Our evalu-
mized to maximize the probability P (yi | pi , xi ; θ) ation is surrounding around the following research
of the target output yi given the input xi and in- questions(RQs):
struction prompt p across all N training samples. RQ1. How effective is TALE in reducing token
Token-Budget Awareness Internalization. To ob- costs while maintaining LLM performance?
tain an LLM with token awareness, we fine-tune RQ2. How effective is TALE to estimate the token
the LLM to internalize the budget estimation into budget given a question?
the inference process and produce token-efficient RQ3. How general is TALE across different state-
reasoning responses. Specifically, we fine-tune of-the-art LLMs?
the LLM M(θ) so that it generates token-budget-
aware answers. This process is divided into two 6.1 Experiment Setup
key stages: target output generation and LLM fine- Metrics. The target of TALE is to balance the
tuning. In the target output generation stage, we LLM performance and extra redundant token costs.
craft the target output yi by prompting M(θ) with a Specifically, TALE seeks to minimize token con-
Chain-of-Thought (CoT) prompt that incorporates sumption while maintaining comparable LLM per-
our searched optimal token budget. The prompt is formance. To evaluate the LLM performance, three
formatted as follows: most challenging mathematical datasets are taken

6
Q: Peyton … How many hours will all Peyton’s tently demonstrates significant improvements in
after-work activities take?
Please strictly follow the format: [[choice]], for efficiency while maintaining competitive accuracy.
example: Choice: [[A]]. Directly answering achieves the lowest output to-
kens (14.57 on average) and expenses (25.37 on
Figure 6: The instruction prompt used to format the average) but has the lowest accuracy (52.31% on
LLM output on multiple-choice questions.
average). Vanilla CoT achieves the highest ac-
curacy (83.75% on average) but incurs a signifi-
into consideration: GSM8K (Cobbe et al., 2021), cant token cost (461.25 on average) and expenses
GSM8K-Zero (Chiang and Lee, 2024), and Math- (289.78 on average). TALE demonstrates a bal-
Bench (Liu et al., 2024). GSM8K-Zero, derived anced trade-off between performance and cost. It
from the GSM8K dataset, specifically targets the achieves competitive accuracy (81.03%) while re-
analysis of over-reasoning and redundancy in LLM- ducing token costs (32% of Vanilla CoT) and ex-
generated outputs. In short, GSM8K-Zero is de- penses (41% of Vanilla CoT) significantly. For
signed so that the answers are embedded within the accuracy, notably, on GSM8K, TALE even im-
questions themselves. LLMs can easily generate proves accuracy to 84.46%, surpassing Vanilla CoT,
correct responses without complicated additional demonstrating its ability to adapt well to complex
reasoning or redundant calculations. reasoning tasks while remaining efficient. For out-
Accuracy (Acc). This metricP is calculated as the fol- put tokens on GSM8K-Zero, TALE achieves an
lowing: Accuracy = N1 N i=1 I{M(xi ) = yi },
impressive reduction in output token costs from
where (xi , yi ) ∈ X . xi is the math question from 252.96 to 22.67 while maintaining high accuracy
dataset X and yi the ground truth answer. M(·) (98.72%), showcasing its capability to optimize
returns the answer for a given question. I{·} repre- reasoning efficiency in such tasks. For expenses,
sents an indicator function. This function evaluates TALE demonstrates its cost-effectiveness, reduc-
whether the inside given condition holds. Specif- ing expenses from 78.58 to 18.62 while achiev-
ically, it returns 1 if the condition is true and 0 if ing reasonable accuracy (73.67% vs 75.00%) on
the condition is false. For a better evaluation, we MathBench-Arithmetic. TALE demonstrates that
format the LLM output by crafting an elaborate incorporating token-budget awareness allows for a
instruction detailed in Figure 6. significant reduction in token costs and monetary
Number of Output Tokens. We evaluate the to- expenses without a major compromise in accuracy.
ken costs by calculating the average output to- TALE reduces output token costs by 68.64% on
ken consumption for each specific task. The average, making it a more efficient solution for
output token costs are measured as follows: budget-constrained reasoning tasks while retaining
1 PN competitive performance. These results highlight
Number of Output Tokens = N i=1 T(M(xi )),
where xi represents the given question, and T is a TALE’s generalizability across tasks with vary-
function that measures the number of tokens. In- ing complexity, demonstrating its potential to scale
tuitively, the more output tokens, the higher the in real-world scenarios while managing computa-
costs incurred by M. To evaluate costs more pre- tional and financial resources effectively.
cisely, we calculate the average expense per sample
for querying the LLM. The total token expense in-
cludes both input and output tokens used during 6.3 RQ2. Effectiveness of Budget Estimation.
the query process.
In this RQ, we evaluate the effectiveness of the
6.2 RQ1. Effectiveness of TALE. budget estimation performance. An ideal esti-
Table 2 presents a comparison of TALE and other mated budget should be located around the optimal
prompt engineering methods, including “Directly searched budget and in the bottom area of Figure 2.
Answering” and “Vanilla CoT”, in their effective- We further define such an area as the ideal budget
ness for seven different datasets. The results are range and give the formalized definition in Sec-
evaluated regarding ACC, Number of Output To- tion A.1. A good budget should be located in the
kens (Output Tokens for short), and Expense. A ideal budget range. Two metrics are taken into
well-designed prompt engineering approach should consideration: in-range accuracy and out-of-range
induce the LLM to generate a correct response distance. In-range accuracy determines whether
with as few tokens as possible. TALE consis- the predicted budget β̂ falls within the ideal budget

7
Table 2: Comparison of TALE (Zero-shot Estimator Version) and other prompt engineering methods. “Directly
Answering” means prompting LLM without any reasoning process. “Vanilla CoT” means the vanilla CoT prompting
with budget. The model used in our evaluation is GPT-4o-mini (OpenAI, 2024a). Observe that TALE achieves
an average accuracy (ACC) of 80.22%, with an average output token cost of 138.53 and an average expense of
118.46. TALE reduces output token costs by 67%, lowers expenses by 59%, and maintains competitive performance
compared to the vanilla CoT approach. ACC ↑, Output Tokens ↓, Expense (10−5 $ / sample) ↓.

Directly Answering Vanilla CoT TALE (Ours)


Dataset
ACC ↑ Output Tokens ↓ Expense ↓ ACC ↑ Output Tokens ↓ Expense ↓ ACC ↑ Output Tokens ↓ Expense ↓
GSM8K 28.29% 12.46 39.43 81.35% 318.10 541.09 84.46% 77.26 279.84
GSM8K-Zero 97.21% 18.85 91.69 99.50% 252.96 886.79 98.72% 22.67 276.12
MathBench-Arithmetic 59.67% 41.10 9.78 75.00% 313.51 78.58 73.67% 39.60 18.62
MathBench-Middle 33.33% 5.00 3.58 84.67% 553.93 68.22 79.33% 238.14 42.95
MathBench-High 51.33% 5.00 4.07 84.00% 653.24 82.44 80.00% 254.82 47.61
MathBench-College 44.00% 5.00 3.68 78.00% 675.78 81.56 70.00% 259.85 45.60
Average 52.31% 14.57 25.37 83.75% 461.25 289.78 81.03% 148.72 118.46

Table 3: The generalization of TALE (Zero-shot Estimator Version) across different LLMs. Yi-lightning (Wake
et al., 2024), GPT-4o-mini (OpenAI, 2024a) and GPT-4o (OpenAI, 2024b) are taken into consideration. We conduct
the evaluation on MathBench-College. ACC ↑, Output Tokens ↓, Expense (10−5 $ / sample) ↓.

Directly Answering Vanilla CoT TALE (Ours)


LLM
ACC ↑ Output Tokens ↓ Expense ↓ ACC ↑ Output Tokens ↓ Expense ↓ ACC ↑ Output Tokens ↓ Expense ↓
Yi-lightning 66.67% 80.01 3.09 79.33% 998.10 21.55 76.67% 373.52 17.25
GPT-4o-mini 44.00% 5.00 3.68 78.00% 675.78 81.56 70.00% 259.85 45.60
GPT-4o 57.33% 5.00 61.34 84.00% 602.29 1359.42 80.00% 181.61 759.95

range Wk∗ . Mathematically, it can be expressed as: achieves substantial token savings, reducing out-
( put tokens by 62.6% for Yi-lightning, 61.5% for
1, if β̂ ∈ Wk∗ , GPT-4o-mini, and 69.8% for GPT-4o, compared
I{β̂ ∈ Wk∗ } =
0, otherwise. to Vanilla CoT. Expense reductions are equally no-
table, with costs decreasing from 21.55 to 17.25
Out-of-range distance quantifies the distance be- for Yi-lightning, 81.56 to 45.60 for GPT-4o-mini,
tween β̂ and Wk∗ if the predicted budget β ∗ and 1359.42 to 759.95 for GPT-4o. Despite these
falls outside the ideal budget range Wk∗ . Let cost savings, TALE maintains strong accuracy,
dist(β̂, Wk∗ ) represent the distance, defined as: achieving 76.67% on Yi-lightning, 70.00% on GPT-
 4o-mini, and 80.00% on GPT-4o, comparable to
0, if β̂ ∈ Wk∗ , Vanilla CoT. These results highlight TALE’s ef-

dist(β̂, Wk ) = min |β̂ − β|, if β̂ ∈ / Wk∗ . fectiveness in balancing cost efficiency and rea-
β̂∈Wk∗

soning performance across diverse LLM architec-
tures. The observed accuracy drop is most signif-
Intuitively, a higher in-range accuracy and a lower
icant for GPT-4o-mini. This could be attributed
out-range distance indicate a better estimated bud-
to its smaller number of parameters, which makes
get. During our evaluation, the in-range accuracy
it more challenging to answer correctly within a
is 60.61%, and the out-of-range distance is 109.64.
limited response reasoning length.
It indicates that more than two-thirds of estimated
budgets are located in the ideal range. For those 7 Conclusion
out-of-range samples, they have an offset of 109.64 In this paper, we introduce TALE, a frame-
tokens on average. work that reduces token redundancy in Chain-of-
Thought (CoT) reasoning by incorporating token
6.4 RQ3. Generalization of TALE. budget awareness. TALE dynamically estimates
Table 3 demonstrates the generalization of TALE token budgets based on task complexity to guide
across Yi-lightning (Wake et al., 2024), GPT- reasoning, balancing efficiency and accuracy. Ex-
4o-mini (OpenAI, 2024a), and GPT-4o (OpenAI, periments show that TALE reduces token usage by
2024b) on MathBench-College, showing its ability 68.9% on average with less than a 5% accuracy loss,
to reduce output tokens and expenses while main- outperforming Vanilla CoT in cost-effectiveness
taining competitive accuracy significantly. TALE while generalizing well across various LLMs.

8
References Chenliang Li, Bin Bi, Ming Yan, Wei Wang, and Song-
fang Huang. 2021. Addressing semantic drift in gen-
AI@Meta. 2024. Llama 3 model card. erative question answering with auxiliary extraction.
In Proceedings of the 59th Annual Meeting of the
Aman Bhargava, Cameron Witkowski, Shi-Zhuo Looi, Association for Computational Linguistics and the
and Matt Thomson. 2023. What’s the magic word? 11th International Joint Conference on Natural Lan-
a control theory of llm prompting. arXiv preprint guage Processing (Volume 2: Short Papers), pages
arXiv:2310.04444. 942–947.
Bin Bi, Chenliang Li, Chen Wu, Ming Yan, Wei Wang, Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen,
Songfang Huang, Fei Huang, and Luo Si. 2020. Jian-Guang Lou, and Weizhu Chen. 2023. Making
Palm: Pre-training an autoencoding&autoregressive language models better reasoners with step-aware
language model for context-conditioned generation. verifier. In Proceedings of the 61st Annual Meet-
arXiv preprint arXiv:2004.07159. ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 5315–5333.
Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Ak-
shara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong
Mui, Silvio Savarese, Caiming Xiong, et al. 2024. Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang,
Language models are hidden reasoners: Unlocking la- Songyang Zhang, Dahua Lin, and Kai Chen. 2024.
tent reasoning capabilities via self-rewarding. arXiv Mathbench: Evaluating the theory and application
preprint arXiv:2411.04282. proficiency of llms with a hierarchical mathematics
benchmark. arXiv preprint arXiv:2405.12209.
Cheng-Han Chiang and Hung-yi Lee. 2024. Over-
reasoning and redundant calculation of large lan- Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang,
guage models. arXiv preprint arXiv:2401.11467. Delip Rao, Eric Wong, Marianna Apidianaki, and
Chris Callison-Burch. 2023. Faithful chain-of-
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, thought reasoning. arXiv preprint arXiv:2301.13379.
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Sania Nayab, Giulio Rossolini, Giorgio Buttazzo, Nico-
Nakano, et al. 2021. Training verifiers to solve math lamaria Manes, and Fabrizio Giacomelli. 2024. Con-
word problems. arXiv preprint arXiv:2110.14168. cise thoughts: Impact of output length on llm reason-
ing and cost. arXiv preprint arXiv:2407.19825.
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye,
Di He, and Liwei Wang. 2024. Towards revealing the OpenAI. 2024a. Gpt-4o mini: advancing cost-efficient
mystery behind chain of thought: a theoretical per- intelligence. Technical report, OpenAI. Accessed:
spective. Advances in Neural Information Processing July 18, 2024.
Systems, 36.
OpenAI. 2024b. Hello gpt-4o. Technical report, Ope-
Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, nAI. Accessed: May 13, 2024.
Weinan Zhang, and Jun Wang. 2023. Alphazero-like
OpenAI. 2024c. Learning to reason with llms. Techni-
tree-search can guide large language model decoding
cal report, OpenAI.
and training. In NeurIPS 2023 Foundation Models
for Decision Making Workshop. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, 2022. Training language models to follow instruc-
Adithya Samavedhi, Qiyue Gao, et al. 2024a. Llm tions with human feedback. Advances in neural in-
reasoners: New evaluation, library, and analysis of formation processing systems, 35:27730–27744.
step-by-step reasoning with large language models.
arXiv preprint arXiv:2404.05221. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. 2019. Language
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, models are unsupervised multitask learners. OpenAI
Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. blog, 1(8):9.
Reasoning with language model is planning with
world model. arXiv preprint arXiv:2305.14992. Noah Shinn, Federico Cassano, Ashwin Gopinath,
Karthik Narasimhan, and Shunyu Yao. 2024. Re-
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian flexion: Language agents with verbal reinforcement
Li, Zhiting Hu, Jason Weston, and Yuandong Tian. learning. Advances in Neural Information Process-
2024b. Training large language models to rea- ing Systems, 36.
son in a continuous latent space. arXiv preprint
arXiv:2412.06769. Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-
Long Wang, Siqi Ma, Qifan Wang, Ying Nian Wu,
Namgyu Ho, Laura Schmid, and Se-Young Yun. 2022. Yongfeng Zhang, and Dongfang Liu. 2024. Visual
Large language models are reasoning teachers. arXiv agents as fast and slow thinkers. arXiv preprint
preprint arXiv:2212.10071. arXiv:2408.08862.

9
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- with large language models. Advances in Neural
bastian Gehrmann, Yi Tay, Hyung Won Chung, Information Processing Systems, 36.
Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny
Zhou, et al. 2022. Challenging big-bench tasks and Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang
whether chain-of-thought can solve them. arXiv Luo, Xin Jiang, and Yang You. 2024. Response
preprint arXiv:2210.09261. length perception and sequence scheduling: An llm-
empowered llm inference pipeline. Advances in Neu-
Alan Wake, Albert Wang, Bei Chen, CX Lv, Chao Li, ral Information Processing Systems, 36.
Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel
Cooper, Ethan Dai, et al. 2024. Yi-lightning technical Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,
report. arXiv preprint arXiv:2412.01253. Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022.
Junlin Wang, Siddhartha Jain, Dejiao Zhang, Baishakhi Least-to-most prompting enables complex reason-
Ray, Varun Kumar, and Ben Athiwaratkun. 2024. ing in large language models. arXiv preprint
Reasoning in token economies: Budget-aware eval- arXiv:2205.10625.
uation of llm reasoning strategies. arXiv preprint
arXiv:2406.06461.
Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi
Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-
and-solve prompting: Improving zero-shot chain-of-
thought reasoning by large language models. In Pro-
ceedings of the 61st Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 2609–2634.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. 2022. Self-consistency improves chain
of thought reasoning in language models. arXiv
preprint arXiv:2203.11171.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. 2022. Chain-of-thought prompting elicits rea-
soning in large language models. Advances in neural
information processing systems, 35:24824–24837.
Tongshuang Wu, Michael Terry, and Carrie Jun Cai.
2022. Ai chains: Transparent and controllable
human-ai interaction by chaining large language
model prompts. In Proceedings of the 2022 CHI
conference on human factors in computing systems,
pages 1–22.
Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen
Kan, Timothy P Lillicrap, Kenji Kawaguchi, and
Michael Shieh. 2024. Monte carlo tree search boosts
reasoning via iterative preference learning. arXiv
preprint arXiv:2405.00451.
Jingfeng Yang, Haoming Jiang, Qingyu Yin, Dan-
qing Zhang, Bing Yin, and Diyi Yang. 2022. Se-
qzero: Few-shot compositional semantic parsing with
sequential prompts and zero-shot models. arXiv
preprint arXiv:2205.07381.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Tom Griffiths, Yuan Cao, and Karthik Narasimhan.
2024a. Tree of thoughts: Deliberate problem solving
with large language models. Advances in Neural
Information Processing Systems, 36.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Tom Griffiths, Yuan Cao, and Karthik Narasimhan.
2024b. Tree of thoughts: Deliberate problem solving

10
A Appendix token consumption. Let β = {β1 , β2 , ..., βN } de-
note all possible budgets that can maintain answer
Q: Tom's ship can travel at 50 miles per hour. He is
correctness. A rolling window W ∈ β is applied
sailing from 1 to 4 PM. He then travels back at a rate iteratively over β. Let k represent the range size,
of 30 mph. How long does it take him to get back?
which is adaptively determined during our evalua-
It takes Tom 3hours to get back.
tion as N3 , where N is the total number of possible
(a) Direct answering (10 output tokens). budgets. A budget range is defined as:
Q: Tom … How long does it take him to get back?
Let's think step by step: Wk (i) = {βj | i ≤ j ≤ i + k − 1},
Let's break the problem down step by step:
### Step 1: Calculate the distance Tom 1 ≤ i ≤ |β| − k + 1
travels.Tom sails from point 1 to point 4 PM, so he is
traveling from 1:00 PM to 4:00 PM, which is a 3-
hour journey.Since Tom's ship travels at a speed of The ideal budget range W ∗ is defined as:
50 miles per hour, we can calculate the distance he
covers during this time:\[\text{Distance} =
 
\text{Speed} \times \text{Time} = 50 \, X
\text{miles/hour} \times 3 \, \text{hours} = 150 \, Wk∗ = arg min  T(βj ) , (3)
\text{miles}\] i
### Step 2: Calculate the time it takes to travel βj ∈Wk (i)
back.Tom travels back at a rate of 30 miles per hour,
and the distance is the same (150 miles). We can
calculate the time it takes him to return by dividing where T denote the actual token consumption for
the distance by his return speed:\[\text{Time to return} a given budget β ∈ β. We aim to estimate a bud-
= \frac{\text{Distance}}{\text{Speed}} = \frac{150 \,
\text{miles}}{30 \, \text{miles/hour}} = 5 \, get located in the ideal budget ranges without any
\text{hours}\]
### Final Answer:It takes Tom **5 hours** to get
search process. In that case, TALE obtains the
back. ideal budget within acceptable sacrifice.
(b) Vanilla CoT (271 output tokens).
Q: Tom's ship can travel at 50 miles per hour. He is
sailing from 1 to 4 PM. He then travels back at a rate
of 30 mph. How long does it take him to get back?
Task: Analyze the given question and estimate the
minimum number of tokens required to generate a
complete and accurate response. Please give the
response by strictly following this format: [[budget]],
for example, Budget: [[12]].

Budget: [[26]]
Budget Estimation

Please answer the above question.


Let's think step by step and use less than 26 tokens:
1. Time out: 3 hours (1 to 4 PM).
2. Distance = speed × time = 50 mph × 3 hours =
150 miles. 3. Time back = distance ÷ speed = 150
miles ÷ 30 mph = 5 hours.
Answer: 5 hours.
Token-budget-aware CoT

(c) TALE (68 output tokens).


Figure 7: An intuitive example to illustrate the workflow
of TALE.

A.1 Definition of Ideal Budget Range


Ideal Budget Range. Based on the observation of
token elasticity, a token cost bottom range exists
during searching for the optimal budget. In this
range, the token costs approach the token cost low-
est bound. Before or after the range, the token cost
will increase. We define such a bottom range as
“ideal budget range”. It’s worth noting that the bud-
get continuously degrades during the search. Only
the token cost rebounds. That’s why we refer to
this observation as token elasticity. To summarize,
ideal budget range is an range that minimizes actual

11

You might also like