Any interest in 2-bit quantizing the 65B model? #206

Alcyon6 · 2023-03-16T12:13:36Z

Alcyon6
Mar 16, 2023

I'm not sure if this is possible with the current setup? I wasn't successful at 2-bit quantizing the 7B model, it actually came out bigger than the 4-bit result. There has been a little bit of discussion about going smaller here. https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and

bitRAKE · 2023-03-16T16:32:52Z

bitRAKE
Mar 16, 2023

AFAIK, all the research says less than 4-bit quantizing impacts the results too greatly. There are values within the blocking (called outliers) which can't be quantized below some threshold without breaking the model.

0 replies

william8000 · 2023-03-19T03:13:16Z

william8000
Mar 19, 2023

My Lenovo ThinkPad laptop has an NVIDIA GA107M GeForce RTX 3050 Mobile gpu with 4 GB VRAM. The 4-bit quantized version of the 7B model with python torch requires 6 GB VRAM. Is there any chance that a 3-bit or 2-bit quantized version of the 7B model would fit in 4 GB VRAM? llama.cpp works on my laptop but I wonder how much faster the gpu would be. (I would also have to figure out how to get Fedora 37 X.org to work solely from the Intel gpu and not touch the nvidia gpu. I have X configured to use the Intel gpu as the primary, but it still allocates almost all of the VRAM on the nvidia gpu.)

5 replies

StuartIanNaylor Mar 21, 2023

You need to read what @bitRAKE wrote above and not even sure if 3bit optimisation is actually a reality in terms of instruction optimisation, whilst 4bit still has accuracy questions but at least fits into the address space of Neon, AVX & GPU instructions.
Your RTX3050 mobile is pretty much useless for training models as quantisation is often a post training procedure. I have a RTX3050 desktop (8gb) and whilst running on a desktop it only releases approx 6gb and fails on training fairly modest models.
Haven't managed to train a single transformer but maybe if I ran a headless server version of Cuda, but likely your RTX3050 with desktop hasn't a chance.

Likely you should try Collab or another webbased GPU farm that also does payasyougo and even for running a model you likely have less GPU ram than a Pi4-4GB ram can muster with slow but swap backup.
4gb vram is extremely tiny nowadays especially when all is not available when running a desktop, definately will not train and likely will not run.

Dunno with Fedora but nvidia has headless server kernels that you need to run and even if you do with your 3050 your still likely to be in the same position.

william8000 Mar 21, 2023

Thanks. I'm more interested in seeing how it works than in actually running or training a large model, so playing with a small model with llama.cpp and reading through main.cpp and ggml.c is enough for my purposes. I have the intel gpu set as the primary because setting the nvidia gpu as the primary makes my laptop run a lot hotter and I'm not doing anything that needs 3D acceleration. It is probably possible to run with just the intel gpu and leave the entire nvidia gpu available for cuda, but I couldn't figure out how, and when I first got the laptop, I broke X a few times trying, and I'm not ready to deal with a blank boot screen again. My laptop has an i7-12800H with 64 GB RAM and an integrated Iris Xe Graphics 96EU gpu (which is probably only about 1/10 as powerful as the nvidia rtx3050 gpu). Is it worth trying to install https://github.com/intel/intel-extension-for-pytorch/ ?

StuartIanNaylor Mar 21, 2023

Nope I am the same as you can run headless with the nividia kernel but installing that way is such a pain and also makes the OS very limiting.
There are Nvidia server kernels, but I am the same as have used the embedded GPU to find ram still alocated and likely if I had bothered I should of booted a headless server version of Ubuntu with the Nvidia server kernel.
There is also Nvidia Optimus but again just seems a pain to me and just use the Nvidia GPU but https://github.com/intel/intel-extension-for-pytorch/ is for running on Intel whilst you are trying to run on Nvidia, but Optimus should keep your laptop cool.

Llama is a huge model and 4bit quantisation is already drastic, Llama.cpp as I know is focussed on a CPU solution currently and you just don't have the hardware as if you are not 1080p gaming then the RTX3050/4gb/laptop always has been an odd choice.
Even my 8gb RTX3050 was an hobsons choice of keeping my current PSU and yeah you can run on the Intel GPU for video and use the Nvidia as AI only but I found the server only install a bit of a nightmare and just don't bother as I also want a desktop install and that is where I seemed to fail.

Have look at the original Llama pytorch repo and see if anyone is running quantised GPU based models as even if you do stop ram being allocated for video your probably looking at the wrong repo.

Have look at what Arch Linux say as often the documentation is excellent.
https://wiki.archlinux.org/title/Hybrid_graphics#Fully_power_down_discrete_GPU

william8000 Mar 22, 2023

Thanks. I think that optimus is built into the Fedora X server since Fedora 30. On my laptop, glmark2 renders with Mesa Intel Graphics, while __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glmark2 renders with the NVIDIA GeForce RTX 3050 GPU. (I get a 15% higher frame rate with the Intel gpu, which doesn't seem right.) lsmod shows that I have the nvidia drivers installed (instead of nouveau). I have a mate panel applet that shows the cpu temperature, the fan speed, and whether the nvidia gpu is active. https://github.com/william8000/panel The laptop is for work, which unfortunately does not involve gaming or running AI models, so I wasn't careful with the gpu when I configured it. The 4GB Nvidia RTX 3050 came bundled with the option for a 15" 4k display. I wasn't thrilled about getting an nvidia gpu because I had a laptop with one 15 years ago, and the proprietary linux drivers sometimes gave problems with kernel updates.
I looked around a few repos, and the smallest requirement still seems to be 6 GB VRAM https://rentry.org/llama-tard-v2
Another question: The quantization in llama.cpp seems to do a linear mapping based on the min and max values https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L617
The llama-tard-v2 page has quantizations done with HFv2 https://huggingface.co/docs/optimum/concept_guides/quantization
Can convert-pth-to-ggml.py process them? Will they work better than models quantized by llama.cpp quantize?

StuartIanNaylor Mar 22, 2023

https://rentry.org/llama-tard-v2 looks a great ref but yeah sadly for you its looking like 6gb vram is the smallest.
As for optimus, not really sure as I tried it and went back to dedicated GPU.
With optimus 3D acceleration is passed through automatically so you only use the Nvidia when you need it, so the GLmark might not be the intel card, how that works in practise, but presume your 15% is some optimus loading as the Nvidia is hugely faster.
I haven't looked at Optimus since I think I was running Ubuntu 20.04 and all I remember now is I scrapped it for a dedicated GPU setup.
You likely need to dual boot with a headless server version with the nvidia kernel with no optimus with the bios setting the intel as the primary card and maybe even exclusions for modprobe udev to stop the nvidia driver being loaded on boot.
Dunno though as never got that far.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any interest in 2-bit quantizing the 65B model? #206

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Any interest in 2-bit quantizing the 65B model? #206

Replies: 2 comments · 5 replies

Replies: 2 comments 5 replies