Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17602

I intentionally kept the bar simple without specifying part numbers (which ultimately don't matter much) the only thing we care about is tracking progress

I intentionally kept the bar simple without specifying part numbers
(which ultimately don't matter much) the only thing we care about is
tracking progress.

Signed-off-by: Adrien Gallouët <[email protected]>
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #368

Overview

PR #368 introduces multi-threaded progress bar functionality to common/download.cpp, adding mutex-based synchronization and ANSI terminal sequences for concurrent download progress display. The modification affects the print_progress function, which is not part of the inference pipeline.

Key Findings

Impacted Function:

  • print_progress: Response time increased by 10508 ns (604 ns → 11113 ns) in llama-cvector-generator and 10433 ns (608 ns → 11041 ns) in llama-tts. Throughput increased by 301 ns and 275 ns respectively.

Code Changes:
The implementation adds thread-safe progress tracking using std::mutex and std::map<std::thread::id, int> for line assignment, plus ANSI escape sequences for cursor positioning. The response time increase stems from mutex acquisition (20-50 ns), map lookup operations (30-50 ns), and console I/O for ANSI sequences (5-8 microseconds across multiple std::cout calls).

Inference Impact:
No impact on tokens per second. The print_progress function operates during model loading and downloading operations, not during inference execution. Functions responsible for tokenization and inference (llama_decode, llama_encode, llama_tokenize) remain unmodified. The performance change is isolated to progress reporting, which occurs outside the token generation pipeline.

Power Consumption:

  • llama-tts: 0.321% increase (720 nJ total)
  • llama-cvector-generator: 0.159% increase (350 nJ total)

The power increase reflects the cumulative throughput changes in progress reporting functions. Since progress updates occur infrequently during downloads rather than continuously during inference, the total energy impact per operation remains minimal.

Context:
The 18x response time increase is confined to user-facing progress display during file operations. The added synchronization overhead enables clean multi-threaded progress bars without affecting model inference performance or token generation rates.

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from e4a4e1d to d0b408b Compare November 30, 2025 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants