Can someone explain what are the meanings of these timings #1323
Replies: 2 comments 8 replies
-
The timing print format was changed in #370, if that helps. |
Beta Was this translation helpful? Give feedback.
-
When it says 38.57 ms per token, it means it evaluated 24 tokens in 925.78 ms. Lower is better. Another common metric is tokens per second which is used in other ML projects than llama.cpp. You can get that by calculating 1000/38.57 = 25.93 t/s. Higher is better. Why is eval time slower than prompt eval time? Because we can only predict the single next token, that means that after the sampler chooses the next token the model has to be run again but with a batch size of 1. In the prompt eval phase, the model can evaluate large batches (512 max) meaning less overhead and more efficiency, especially with BLAS or GPU. EDIT: "sample time" is not tokenization time, it is actually the "sampling" time, that means running the RNG, sorting and filtering candidates, etc. |
Beta Was this translation helpful? Give feedback.
-
What does load / sample / prompt eval / eval / total time mean? What does 0.58 ms per run mean? Why can't I see the timings report like what readme shows?
This is much clearer! What is the mapping between them? I want to know the predict: token per ms.
Beta Was this translation helpful? Give feedback.
All reactions