Skip to content

Multi-part GGML files: do they still work? And how hard would it be to modify convert.py to create them? #1503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TheBloke opened this issue May 17, 2023 · 18 comments

Comments

@TheBloke
Copy link
Contributor

TheBloke commented May 17, 2023

Hi all

Hugging Face has a max file size limit of 50GB, which is a bit annoying. This means it's not possible to upload a q8_0 GGML of a 65B model, or a float16 GGML for a 30B model.

I've had two people ask me to upload q8_0's for my 65B uploads. One of them asked if I could use another file sharing site like Google Drive or something like that. But the other mentioned the possibility of multi-part GGMLs.

I know that llama.cpp used to support multi-part models? It still shows n_parts 1 in the header, implying that it might support 2 parts as well?

So I'd love to know:

  1. Does llama.cpp still support multi-part GGMLs?
  2. And if so, should it be fairly straightforward to modify convert.py to create one?

Here's the method convert.py uses to write the GGML file:

    @staticmethod
    def write_all(fname_out: Path, params: Params, model: LazyModel, vocab: Vocab) -> None:
        check_vocab_size(params, vocab)
        of = OutputFile(fname_out)
        of.write_file_header(params)
        print("Writing vocab...")
        of.write_vocab(vocab)

        def do_item(item: Tuple[str, LazyTensor]) -> NDArray:
            name, lazy_tensor = item
            return lazy_tensor.load().to_ggml().ndarray

        ndarrays = bounded_parallel_map(do_item, model.items(), concurrency=8)
        for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)):
            size = ' x '.join(f"{dim:6d}" for dim in lazy_tensor.shape)
            padi = len(str(len(model)))
            print(f"[{i+1:{padi}d}/{len(model)}] Writing tensor {name:38s} | size {size:16} | type {lazy_tensor.data_type}")
            of.write_tensor_header(name, lazy_tensor.shape, lazy_tensor.data_type)
            ndarray.tofile(of.fout)
        of.fout.close()

Would it just be a case of writing the file header twice, and then just putting the first X layers in the first file, and the rest in the other?

What about the vocab - would that go in both files, or only in the first?

Thanks in advance for any info!

@sw
Copy link
Contributor

sw commented May 17, 2023

The split files were only used because the original LLaMA models came like that, but they are no longer supported AFAIK. Actually llama_model_loader still seems to support n_parts > 1.

But you could just split the file by some other means for distribution, using the split command from the GNU coreutils, or an archiver (RAR, ZIP etc.)

@ggerganov
Copy link
Member

We used to support split files, and maybe we still do (not sure too)

But I agree with @sw - some alternative common method for splitting a large file independently from ggml would make more sense

@slaren
Copy link
Member

slaren commented May 17, 2023

The loader should still support multi part files, if you are able to hack convert.py to split the models it could work. Or just find the old converter somewhere in the repository log, but that will prevent mmap from working.

@LostRuins
Copy link
Collaborator

Actually since HF doesn't really enforce model format, uploading it as a multipart .7z or .rar will probably be best. And as a bonus you might get some compression too.

@TheBloke
Copy link
Contributor Author

OK, thanks very much everyone for the info.

If the loader does still support multi-part files then when I have some time I'll see about hacking convert.py. It'd be nice to do this "properly", but certainly not the end of the world if I can't. It's only a nice-to-have.

And yeah, if that doesn't work then either I'll do a manual split which the user can re-join themselves, or else a compressed archive.

Thanks again.

@LostRuins
Copy link
Collaborator

@TheBloke one risk of using multi part files is that they get mixed up since the formats are prone to change. So if you have a 4 part file and somehow mix one of a q5_0 with q4_1, or worse, q4_1 for ggjtv2 vs q4_1 of ggjtv3 it will fail to work and you won't know why.

@TheBloke
Copy link
Contributor Author

TheBloke commented May 18, 2023

@LostRuins I guess. However I always use separate branches when I do version updates. So eg right now my GGML repos have only ggjtv2 files in main, and the v1 format files are in branch previous_llama. So the user would have to somehow grab one file from one and one from the other.

I guess I'm a little reluctant to use an archive because I have past experience of downloading a couple of models that used them, and it taking forever to uncompress them.

Then again they do have the advantage of it being immediately understandable to all users how to access them. If I split the files I'd have to provide a script to join them, and then check it works on Windows...

Yeah maybe I should just use an archive :)

@EliEron
Copy link

EliEron commented May 18, 2023

@LostRuins I guess. However I always use separate branches when I do version updates. So eg right now my GGML repos have only ggjtv2 files in main, and the v1 format files are in branch previous_llama. So the user would have to somehow grab one file from one and one from the other.

I guess I'm a little reluctant to use an archive because I have past experience of downloading a couple of models that used them, and it taking forever to uncompress them.

Then again they do have the advantage of it being immediately understandable to all users how to access them. If I split the files I'd have to provide a script to join them, and then check it works on Windows...

Yeah maybe I should just use an archive :)

It's worth noting that 7z (and probably other archivers) support using no compression when archiving files, referred to as compression level 0 in 7z. In my experience there is essentially no speed penalty when extracting large files that are stored with no compression. So if speed is a big concern then that is the approach I would take.

@TheBloke
Copy link
Contributor Author

TheBloke commented May 18, 2023

It's worth noting that 7z (and probably other archivers) support using no compression when archiving files, referred to as compression level 0 in 7z. In my experience there is essentially no speed penalty when extracting large files that are stored with no compression. So if speed is a big concern then that is the approach I would take.

But in this case the whole reason to compress is to reduce the size of the file 😆 The issue is that HF won't store files larger than 50GB, so any files larger than that either need to be split, or compressed, to get under that limit.

But yeah it might be worth experimenting with compression levels to find the optimum one that both reduces the file size below 50GB, while also being as fast as possible to decompress.

@EliEron
Copy link

EliEron commented May 18, 2023

It's worth noting that 7z (and probably other archivers) support using no compression when archiving files, referred to as compression level 0 in 7z. In my experience there is essentially no speed penalty when extracting large files that are stored with no compression. So if speed is a big concern then that is the approach I would take.

But in this case the whole reason to compress is to reduce the size of the file 😆 The issue is that HF won't store files larger than 50GB, so any files larger than that either need to be split, or compressed, to get under that limit.

But yeah it might be worth experimenting with compression levels to find the optimum one that both reduces the file size below 50GB, while also being as fast as possible to decompress.

Oh I was thinking you were just going to use the split archive feature that is natively supported by 7zip/WinRAR itself. Those wouldn't require any external script to split or join the files. But yeah if you don't want split archives at all then compression is unavoidable of course. Sorry for the misunderstanding 😄.

@TheBloke
Copy link
Contributor Author

Oh! I didn't even think of that! :)

That sounds like the best of all worlds. Then I don't need any script to join the files, and it's one simple command for the user to get the file, and also doesn't have any compression slowdown.

Sorry to you for my misunderstanding - thank you! :)

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 18, 2023

I could get up to 20% file size reduction with zstd. but was a while back, so I don't remember the specifics.

Please just use normal zip multiparts, no need for rar or lzma.

$ zip -s 200 7b-q4_0.zip ggml-model-q4_0.bin
  adding: ggml-model-q4_0.bin (deflated 8%)

you can use -0 for store only

$ ll
drwxrwxr-x 2 green green 4,0K Mai 18 13:37 ./
drwxrwxr-x 7 green green 4,0K Mai 17 13:12 ../
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z01
-rw-rw-r-- 1 green green 200M Mai 18 13:34 7b-q4_0.z02
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z03
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z04
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z05
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z06
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z07
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z08
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z09
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z10
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z11
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z12
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z13
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z14
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z15
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z16
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z17
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z18
-rw-rw-r-- 1 green green 100M Mai 18 13:37 7b-q4_0.zip
-rw-rw-r-- 1 green green 4,0G Mai 13 13:42 ggml-model-q4_0.bin

@TheBloke
Copy link
Contributor Author

OK thanks for the test! I agree ZIP is easiest.

@Green-Sky
Copy link
Collaborator

Yea, just use -0. Since zip has no multithreading, it takes ages to compress.

@TheBloke
Copy link
Contributor Author

TheBloke commented May 25, 2023

So I finally got around to try this, for Tim Dettmer's Guanaco 65B in q8_0. And it doesn't work?

[pytorch2] root@64c767772631:/workspace/process/TheBloke_guanaco-65B-GGML/ggml # zip -0 -s 49000m zip/guanaco-65B.ggmlv3.q8_0.zip guanaco-65B.ggmlv3.q8_0.bin
  adding: guanaco-65B.ggmlv3.q8_0.bin (stored 0%)

[pytorch2] root@64c767772631:/workspace/process/extract # unzip zip/guanaco-65B.ggmlv3.q8_0.zip
Archive:  ../TheBloke_guanaco-65B-GGML/ggml/zip/guanaco-65B.ggmlv3.q8_0.zip
warning [../TheBloke_guanaco-65B-GGML/ggml/zip/guanaco-65B.ggmlv3.q8_0.zip]:  zipfile claims to be last disk of a multi-part archive;
  attempting to process anyway, assuming all parts have been concatenated
  together in order.  Expect "errors" and warnings...true multi-part support
  doesn't exist yet (coming soon).
file #1:  bad zipfile offset (local header sig):  4

EDIT: figured it out. unzip is trash, I need to use something like 7zip

@TheBloke
Copy link
Contributor Author

TheBloke commented May 25, 2023

OK! Update: the ZIP is actually fine. I tested with 7z on macOS and it uncompressed it no problem.

And then on Linux I did apt install 7zip and now it works:

[pytorch2] root@64c767772631:/workspace/process/TheBloke_guanaco-65B-GGML/ggml/xtract # 7zz x ../zip/guanaco-65B.ggmlv3.q8_0.zip

7-Zip (z) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
 64-bit locale=C.UTF-8 Threads:48

Scanning the drive for archives:
1 file, 15629462861 bytes (15 GiB)

Extracting archive: ../zip/guanaco-65B.ggmlv3.q8_0.zip
--
Path = ../zip/guanaco-65B.ggmlv3.q8_0.zip
Type = zip
Physical Size = 15629462861
Embedded Stub Size = 4
64-bit = +
Characteristics = Zip64
Total Physical Size = 67009686861
Multivolume = +
Volume Index = 1
Volumes = 2

  2% - guanaco-65B.ggmlv3.q8_0.bin

So I guess unzip is just old trash :)

Panic over!

@TheBloke
Copy link
Contributor Author

TheBloke commented May 26, 2023

Finally got a 65B q8_0 uploaded :) Thanks again for the ideas!

image

image

I'd still love to do this natively in GGML sometime, with a two-part GGML. But for now this is fine and much better than not uploading a q8_0.

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 26, 2023

So I guess unzip is just old trash :)

Personally I use unar - which eg loads in 7z for .7z, etc, so I dont have to choose the right tool 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants