Skip to content

Conversation

@DamonFool
Copy link
Contributor

Thanks to #16547 , we can enable hexagon npus in llama.cpp.

While inferencing with CPU & NPU, llama.cpp was observed to randomly hang forever in ggml_barrier in our experiments.

The situation is

compute graph split n    ---> scheduled to npu with 1 active worker threads
     |
     v
compute graph split n+1  ---> scheduled to cpu with 6 active worker threads
     |  (only 5 worker threads wake up, the remaining 1 thread sleeps again)
     v
 hang forever in ggml_barrier (the barrier would expect 6 wokers to finish, but there are only 5, so hang forever)

Just imagine that graph split n was scheduled to npu with 1 active worker threads.
Then graph split n+1 was scheduled to cpu with 6 active worker threads.
However, only 5 worker threads wake up, the remaining 1 thread sleeps again due the failure of ggml_graph_compute_thread_active check.

Why the ggml_graph_compute_thread_active check fails?
This is because the thread read the old value of n_threads_cur as 1 (which should be 6).

// check if thread is active
static inline bool ggml_graph_compute_thread_active(struct ggml_compute_state * state) {
    struct ggml_threadpool * threadpool = state->threadpool;
    int n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_relaxed);  // <--- read the old value of n_threads_cur=1, which should be 6.
    return (state->ith < n_threads);
}

The case can be simplified with the following code.

main-thread: 
  atomic_store_explicit(&threadpool->n_threads_cur, g_n_threads, memory_order_relaxed);
  atomic_fetch_add_explicit(&threadpool->n_graph, 1, memory_order_seq_cst);

worker-thread:
  int local_n_graph = atomic_load_explicit(&threadpool->n_graph, memory_order_relaxed);
  int local_n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_relaxed);

The woker-thread may read the old value of n_threads_cur (not as the value of main-thread set) with memory_order_relaxed on weak memory model systems.

The suggested fix is to use store-release/load-acquire pattern.

@DamonFool DamonFool requested a review from ggerganov as a code owner November 26, 2025 06:47
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 26, 2025
@jeffbolznv
Copy link
Collaborator

This is because the thread read the old value of n_threads_cur as 1 (which should be 6).

This explanation/fix doesn't make sense to me. If n_threads_cur is atomic, then changing relaxed to release/acquire only affects consistency of other values that are synchronized by the store/release. It doesn't affect atomicity or ordering of n_threads_cur itself.

Is the issue about ordering of n_thread_cur vs n_graph?

@max-krasnyansky
Copy link
Collaborator

@DamonFool
Please share the command you run.
In particular the number of threads used for the CPU backend and which device you used for the test.

I bet the issue is simply that you have too many threads. The OS/device scheduler probably offlined some of the CPU cores for power saving and there are not enough active cores to actively run all threads.

Try reducing the number of threads. My guess would be you're using the default 6 and maybe 4 is better on your device.
Also if you're using scripts/snapdradon/adb/run-... scripts you might want to disable the cpu-mask settings or maybe bump the priority --prio 1. In other words, play with those settings to find a good combo.

I agree with @jeffbolznv that making those atomic reads more strict is not a "fix".
Those are supposed to be relaxed by design.

@DamonFool
Copy link
Contributor Author

Is the issue about ordering of n_thread_cur vs n_graph?

See the following code

main-thread: 
  atomic_store_explicit(&threadpool->n_threads_cur, g_n_threads, memory_order_relaxed);
  atomic_fetch_add_explicit(&threadpool->n_graph, 1, memory_order_seq_cst);

worker-thread:
  int local_n_graph = atomic_load_explicit(&threadpool->n_graph, memory_order_relaxed);
  int local_n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_relaxed);

Although n_graph is updated after n_threads_cur in main-thread, the worker thread may still read the updated value of n_graph and the stale value of n_thread_cur with relaxed load.
However, if we change relaxed to require in the worker thread, it is guaranteed that the worker thread read the updated value of n_threads_cur if it has read the updated value of n_graph.

So before this patch, the 6 worker threads may

thread 0: the main thread
thread 1:  read the updated n_graph, read n_threads_cur = 6
thread 2:  read the updated n_graph, read n_threads_cur = 6
thread 3:  read the updated n_graph, read n_threads_cur = 1    <=== got the stale value of n_threads_cur, check fail in ggml_graph_compute_thread_active and fall asleep immediately
thread 4:  read the updated n_graph, read n_threads_cur = 6
thread 5:  read the updated n_graph, read n_threads_cur = 6

After this patch, all the worker thread would get read n_threads_cur = 6 if the updated n_graph has been read.

@DamonFool
Copy link
Contributor Author

I bet the issue is simply that you have too many threads. The OS/device scheduler probably offlined some of the CPU cores for power saving and there are not enough active cores to actively run all threads.

No, I don't think so.

Before this patch, we can reproduce the hang in less than five minutes.
After this patch, we've tested for more than 6 hours, no hang any more.

@DamonFool
Copy link
Contributor Author

Please share the command you run.
In particular the number of threads used for the CPU backend and which device you used for the test.

@max-krasnyansky
The following is the running script.

export LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib
export ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib          
export GGML_HEXAGON_NDEV=1          
export GGML_HEXAGON_VERBOSE=0

/data/local/tmp/llama.cpp/bin/llama-cli \
    -m /data/local/tmp/gguf/hy1.8b_Q8_0.gguf \
    --no-mmap \
    --poll 1000 \
    -t 6 \
    --cpu-mask 0xfc \
    --cpu-strict 1 \
    --ctx-size 8192 \
    --batch-size 128 \
    -ctk q8_0 -ctv q8_0 \
    -fa on \
    -ngl 99 \
    --device HTP0  \
    --no-warmup \
    -no-cnv -p "你好"

You can get the model here https://huggingface.co/tencent/Hunyuan-1.8B-Instruct .

The device is Snapdragon 8 Gen 3, with Hexagon Arch version v75.

@jeffbolznv
Copy link
Collaborator

Thanks for the details, the change makes sense to me now.

@DamonFool
Copy link
Contributor Author

Thanks for the details, the change makes sense to me now.

Thanks @jeffbolznv for your review.

@DamonFool
Copy link
Contributor Author

Hi @max-krasnyansky , may I ask can you reproduce the hang issue in your environment?
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants