Token-Budget Optimization in LLMs
Token-Budget Optimization in LLMs
1
Q: Peyton scheduled after-work activities of a one- as many as with a 50-token budget. In other words,
hour yoga class on Monday, a cooking class that
lasts 9times as long as Monday’ s yoga on Tuesday, when the token budget is relatively small, LLMs
a half-hour cheese-tasting event on Wednesday, a
museum tour that takes half as long as the cooking
often fail to follow the given token budget. In such
class on Thursday, and two hours of errands on cases, the actual token usage significantly exceeds
Friday. How many hours will all Peyton’s after-
work activities take? the given budget—even much larger than the to-
Peyton's after-work activities will take 12.5 hours. ken costs with larger token budgets. We refer to
this phenomenon as the “Token Elasticity” in the
(a) Direct answering (15 output tokens).
CoT process with token budgeting. To address this,
Q: Peyton …How many hours will all Peyton ’s after-
work activities take? the optimal token budget for a specific LLM and
Let's think step by step: a particular question can be searched by gradually
Let's break down activities step by step: reducing the budget specified in the prompt, identi-
1. Monday: Yoga class lasts 1 hour.
- Total for Monday: 1 hour fying the smallest token budget that achieves both
2. Tuesday: Cooking class lasts 9 times as long as the correct answer and the lowest actual token cost.
Monday's yoga class.
- Cooking class duration = 9 * 1 hour = 9 hours - Based on the above observations and analysis,
Total for Tuesday: 9 hours
3. Wednesday: Cheese-tasting event lasts half an hour. we designed a prototype for token-budget-aware
- Total for Wednesday: 0.5 hours
4. Thursday: Museum tour is half the cooking class. -
reasoning in large language models (LLMs). Our
Museum tour duration = 0.5 * 9 hours = 4.5 hours - approach leverages the token budget to guide the
Total for Thursday: 4.5 hours
5. Friday: Errands take 2 hours. reasoning process, dynamically allocating differ-
- Total for Friday: 2 hours ent token budgets to problems based on an esti-
Now, let's add up all the hours from each day:
- Monday: 1 hour - Tuesday: 9 hours mation of their reasoning complexity. We call
- Wednesday: 0.5 hours - Thursday: 4.5 hours
- Friday: 2 hours our method TALE (Token-Budget-Aware LLM
Now, we sum these amounts: rEasoning). For a given problem and a specific
1 + 9 + 0.5 + 4.5 + 2 = 17 hours.
Therefore, the total time is 17 hours. LLM, TALE first estimates an appropriate token
(b) Vanilla CoT (258 output tokens). budget and then uses it to guide the reasoning
Q: Peyton … How many hours will all Peyton’s
process. We discuss different implementations of
after-work activities take? TALE in Section 5. Experiment results show that
Let's think step by step and use less than 10 tokens:
TALE significantly reduces token costs in LLM
Let's break down each activity: chain-of-thought (CoT) reasoning while largely
•Monday: Yoga class = 1 hour.
•Tuesday: Cooking class = 9 times Monday's yoga, maintaining the correctness of the answers. On
so 9 × 1 = 9 hours.
•Wednesday: Cheese-tasting event = 0.5 hours. average, TALE achieves a 68.64% reduction in
•Thursday: Museum tour = Half of Tuesday's token usage while maintaining accuracy with less
cooking class, so 9 ÷ 2 = 4.5 hours.
•Friday: Errands = 2 hours. than 5% decrease.
Now, let's add them up:
1+9+0.5+4.5+2=171 + 9 + 0.5 + 4.5 + 2 =
171+9+0.5+4.5+2=17
Peyton's after-work activities will take 17 hours.
2 Related Work
(c) CoT with an unreasonable budget (157 output LLM Reasoning. Reasoning in LLMs has seen
tokens).
substantial advancements through techniques that
Q: Peyton … How many hours will all Peyton’s after-
work activities take? generate intermediate steps, enabling more accu-
Let's think step by step and use less than 50 tokens: rate and effective performance across diverse do-
- Monday: 1 hour (yoga) - Tuesday: 9 hours mains (Wu et al., 2022; Yang et al., 2022; Zhou
(cooking class)
- Wednesday: 0.5 hours (cheese-tasting) -
et al., 2022; Sun et al., 2024; OpenAI, 2024c). Var-
Thursday: 4.5 hours (museum tour) - Friday: 2 ious LLM reasoning techniques are proposed to
hours (errands)
- Total: 1 + 9 + 0.5 + 4.5 + 2 = 17 hours. improve the LLM performance. Chen et al. (2024)
(d) CoT with an reasonable budget (86 output tokens).
formulates reasoning as sampling from a latent
distribution and optimizing it via variational ap-
Figure 1: Examples of different problem solving
paradigms. The reasoning processes are highlighted.
proaches. Ho et al. (2022) utilizes LLM as rea-
soning teachers, improving the reasoning abilities
86 output tokens, while still enabling the LLM to of smaller models through knowledge distillation.
arrive at the correct answer. However, when the Among them, Chain-of-Thought (CoT) prompting
token budget is set to a different smaller value (e.g., has emerged as a key technique for improving LLM
10 tokens), the output token reduction is less effec- reasoning by breaking problems into intermedi-
tive, resulting in 157 output tokens—nearly twice ate steps, enabling better performance on multiple
2
tasks (Wei et al., 2022; Lyu et al., 2023; Li et al., Table 1: Illustrations of the vanilla CoT prompt and the
2023; Feng et al., 2024). Extensions of CoT in- token-budget-aware prompt.
clude self-consistency, which aggregates multiple
Prompt method Content
reasoning paths to improve robustness (Wang et al.,
Vanilla CoT Let’s think step by step:
2022), and Tree-of-Thoughts, which explores rea-
soning steps in a tree-like structure for more com- Let’s think step by step and use
CoT with Token Budget
less than budget tokens:
plex tasks (Yao et al., 2024b). Reflexion introduces
Let’s think step by step and use
iterative refinement, where the model critiques and Example
less than 50 tokens:
updates its intermediate steps (Shinn et al., 2024).
Token Cost of LLM. Although the above methods
enhance reasoning accuracy, they often increase to- the instructions reduces the token cost in the chain-
ken usages, posing challenges to efficiency (Wang of-thought (CoT) process by several times, but the
et al., 2024; Chiang and Lee, 2024; Bhargava et al., LLM still gets the correct answer. Our results in
2023). Consequently, it is important to mitigate Figure 2 and Table 2 also confirm there are a large
token consumption while maintaining the model number of redundant tokens in the reasoning pro-
performance. To address this issue, Li et al. (2021) cess of the state-of-the-art LLMs.
introduces a multi-hop processing technique de- Causes of Token Redundancy in LLM Reason-
signed to filter out irrelevant reasoning. While ing. A possible explanation for this token redun-
effective, this approach is limited to traditional dancy is that during the post-training phase, such as
neural networks, such as PALM (Bi et al., 2020), the RLHF process (Ouyang et al., 2022), annotators
and lacks adaptability to large language models might favor more detailed responses from LLMs,
(LLMs). Zheng et al. (2024) aims to improve LLM marking them as preferred. As a result, the model
inference speed by predicting response lengths and learns to associate longer, more detailed responses
applying a scheduling algorithm to enhance effi- with alignment to human preferences and tends
ciency. However, their method is constrained to to produce such outputs during reasoning. How-
and does not address the reduction of actual token ever, in many scenarios, we primarily need LLMs
costs. Hao et al. (2024b) reduces token usage by to provide the correct answer and make accurate
substituting decoded text tokens with continuous decisions, rather than elaborate extensively with
latent tokens. However, its application is currently detailed explanations. This motivates the need to
restricted to small-scale, early language models eliminate redundant tokens in the LLM reasoning
like GPT-2 (Radford et al., 2019). Additionally, process in many cases.
it significantly impacts reasoning accuracy, result-
ing in over a 20% relative accuracy reduction on 4 Searching Optimal Token Budget
benchmarks such as GSM8K (Cobbe et al., 2021).
As demonstrated in Figure 1, different token bud-
3 Token Redundancy in LLM Reasoning gets have different effects. Therefore, it is natural to
investigate the following question: “How to search
Token Budget. Previous research Nayab et al. the optimal token budget for a specific question and
(2024) demonstrates that LLM has the potential a particular LLM?”
to follow a length constraint in the prompt. Ta- Vanilla Method for Optimal Budget Search. An
ble 1 shows the difference between the vanilla CoT intuitive method is finding the minimal needed to-
and the CoT with token budget. For instance, by kens as the budget, ensuring that the LLM can
including a token budget (50 tokens) within the still produce correct and accurate responses within
prompt, as illustrated in Figure 1d, the LLM adjusts this constraint. To find the minimal token bud-
the length of its output (86 output tokens), trying get required for each question, we utilize a binary
to align with the specified budget. This indicates search-based minimal budget search algorithm. Al-
that LLMs have a certain capability in following gorithm 1 showcases the details. Before initiating
prompts with an explicit token budget. the search process, we first apply the vanilla CoT to
Token Redundancy Phenomenon. We find that generate an answer for each question, as illustrated
providing a reasonable token budget can signifi- in Figure 1b. The number of tokens in the result-
cantly reduce the token cost during reasoning. As ing answer is then calculated and designated as the
shown in Figure 1d, including a token budget in right boundary for search, denoted by right. The
3
7 R N H Q F R V W
7 R N H Q F R V W
% X G J H W
% X G J H W
6 H D U F K L W H U D W L R Q 6 H D U F K L W H U D W L R Q 6 H D U F K L W H U D W L R Q 6 H D U F K L W H U D W L R Q
(a) GPT-4o-mini budget search. (b) GPT-4o-mini token cost. (c) Yi-lightning budget search. (d) Yi-lightning token cost.
Figure 2: Token elasticity phenomenon. The x-axis denotes the budget search iteration. The y-axis denotes the
searched budget (Figure 2a and Figure 2c) or the real token costs for each searched budget (Figure 2b and Figure 2d).
Different colors denote different samples. The token cost is significantly lower in a reasonable token budget range.
When the token budget is smaller than the reasonable range, the token cost gradually increases.