Replies: 2 comments 5 replies
-
AFAIK, all the research says less than 4-bit quantizing impacts the results too greatly. There are values within the blocking (called outliers) which can't be quantized below some threshold without breaking the model. |
Beta Was this translation helpful? Give feedback.
-
My Lenovo ThinkPad laptop has an NVIDIA GA107M GeForce RTX 3050 Mobile gpu with 4 GB VRAM. The 4-bit quantized version of the 7B model with python torch requires 6 GB VRAM. Is there any chance that a 3-bit or 2-bit quantized version of the 7B model would fit in 4 GB VRAM? llama.cpp works on my laptop but I wonder how much faster the gpu would be. (I would also have to figure out how to get Fedora 37 X.org to work solely from the Intel gpu and not touch the nvidia gpu. I have X configured to use the Intel gpu as the primary, but it still allocates almost all of the VRAM on the nvidia gpu.) |
Beta Was this translation helpful? Give feedback.
-
I'm not sure if this is possible with the current setup? I wasn't successful at 2-bit quantizing the 7B model, it actually came out bigger than the 4-bit result. There has been a little bit of discussion about going smaller here. https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and
Beta Was this translation helpful? Give feedback.
All reactions