You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So I built with cuBLAS, quantize my 7B model to q4_0, offload all my 7B model layers to GPU with ./main, and I realize even though compute is happening in GPU and about 4GB VRAM is taken, the CPU memory never gets a chance to be released. So there is also about 4GB CPU memory in use.
Is this the right behavior? are the weights directlly offloaded to GPU, or loaded to CPU RAM first and then copied to VRAM? but Then why CPU memory is not released or not immediately?
I also tried the server/chat.sh program built with cuBLAS, and I see once server is uprunning, after a short moment CPU memory is released.
Help me understand please
The text was updated successfully, but these errors were encountered:
How should I know? Which OS, git revision, and CLI arguments are you using, and what method are you even using to determine whether or not the memory has been released?
So I built with cuBLAS, quantize my 7B model to q4_0, offload all my 7B model layers to GPU with ./main, and I realize even though compute is happening in GPU and about 4GB VRAM is taken, the CPU memory never gets a chance to be released. So there is also about 4GB CPU memory in use.
Is this the right behavior? are the weights directlly offloaded to GPU, or loaded to CPU RAM first and then copied to VRAM? but Then why CPU memory is not released or not immediately?
I also tried the server/chat.sh program built with cuBLAS, and I see once server is uprunning, after a short moment CPU memory is released.
Help me understand please
The text was updated successfully, but these errors were encountered: