Multi-part GGML files: do they still work? And how hard would it be to modify convert.py to create them? #1503

TheBloke · 2023-05-17T13:25:20Z

Hi all

Hugging Face has a max file size limit of 50GB, which is a bit annoying. This means it's not possible to upload a q8_0 GGML of a 65B model, or a float16 GGML for a 30B model.

I've had two people ask me to upload q8_0's for my 65B uploads. One of them asked if I could use another file sharing site like Google Drive or something like that. But the other mentioned the possibility of multi-part GGMLs.

I know that llama.cpp used to support multi-part models? It still shows n_parts 1 in the header, implying that it might support 2 parts as well?

So I'd love to know:

Does llama.cpp still support multi-part GGMLs?
And if so, should it be fairly straightforward to modify convert.py to create one?

Here's the method convert.py uses to write the GGML file:

    @staticmethod
    def write_all(fname_out: Path, params: Params, model: LazyModel, vocab: Vocab) -> None:
        check_vocab_size(params, vocab)
        of = OutputFile(fname_out)
        of.write_file_header(params)
        print("Writing vocab...")
        of.write_vocab(vocab)

        def do_item(item: Tuple[str, LazyTensor]) -> NDArray:
            name, lazy_tensor = item
            return lazy_tensor.load().to_ggml().ndarray

        ndarrays = bounded_parallel_map(do_item, model.items(), concurrency=8)
        for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)):
            size = ' x '.join(f"{dim:6d}" for dim in lazy_tensor.shape)
            padi = len(str(len(model)))
            print(f"[{i+1:{padi}d}/{len(model)}] Writing tensor {name:38s} | size {size:16} | type {lazy_tensor.data_type}")
            of.write_tensor_header(name, lazy_tensor.shape, lazy_tensor.data_type)
            ndarray.tofile(of.fout)
        of.fout.close()

Would it just be a case of writing the file header twice, and then just putting the first X layers in the first file, and the rest in the other?

What about the vocab - would that go in both files, or only in the first?

Thanks in advance for any info!

The text was updated successfully, but these errors were encountered:

sw · 2023-05-17T18:09:27Z

The split files were only used because the original LLaMA models came like that, ~~but they are no longer supported AFAIK.~~ Actually llama_model_loader still seems to support n_parts > 1.

But you could just split the file by some other means for distribution, using the split command from the GNU coreutils, or an archiver (RAR, ZIP etc.)

ggerganov · 2023-05-17T19:49:34Z

We used to support split files, and maybe we still do (not sure too)

But I agree with @sw - some alternative common method for splitting a large file independently from ggml would make more sense

slaren · 2023-05-17T19:55:29Z

The loader should still support multi part files, if you are able to hack convert.py to split the models it could work. Or just find the old converter somewhere in the repository log, but that will prevent mmap from working.

LostRuins · 2023-05-18T07:51:01Z

Actually since HF doesn't really enforce model format, uploading it as a multipart .7z or .rar will probably be best. And as a bonus you might get some compression too.

TheBloke · 2023-05-18T09:38:04Z

OK, thanks very much everyone for the info.

If the loader does still support multi-part files then when I have some time I'll see about hacking convert.py. It'd be nice to do this "properly", but certainly not the end of the world if I can't. It's only a nice-to-have.

And yeah, if that doesn't work then either I'll do a manual split which the user can re-join themselves, or else a compressed archive.

Thanks again.

LostRuins · 2023-05-18T10:04:27Z

@TheBloke one risk of using multi part files is that they get mixed up since the formats are prone to change. So if you have a 4 part file and somehow mix one of a q5_0 with q4_1, or worse, q4_1 for ggjtv2 vs q4_1 of ggjtv3 it will fail to work and you won't know why.

TheBloke · 2023-05-18T10:13:31Z

@LostRuins I guess. However I always use separate branches when I do version updates. So eg right now my GGML repos have only ggjtv2 files in main, and the v1 format files are in branch previous_llama. So the user would have to somehow grab one file from one and one from the other.

I guess I'm a little reluctant to use an archive because I have past experience of downloading a couple of models that used them, and it taking forever to uncompress them.

Then again they do have the advantage of it being immediately understandable to all users how to access them. If I split the files I'd have to provide a script to join them, and then check it works on Windows...

Yeah maybe I should just use an archive :)

EliEron · 2023-05-18T11:04:54Z

@LostRuins I guess. However I always use separate branches when I do version updates. So eg right now my GGML repos have only ggjtv2 files in main, and the v1 format files are in branch previous_llama. So the user would have to somehow grab one file from one and one from the other.

I guess I'm a little reluctant to use an archive because I have past experience of downloading a couple of models that used them, and it taking forever to uncompress them.

Then again they do have the advantage of it being immediately understandable to all users how to access them. If I split the files I'd have to provide a script to join them, and then check it works on Windows...

Yeah maybe I should just use an archive :)

It's worth noting that 7z (and probably other archivers) support using no compression when archiving files, referred to as compression level 0 in 7z. In my experience there is essentially no speed penalty when extracting large files that are stored with no compression. So if speed is a big concern then that is the approach I would take.

TheBloke · 2023-05-18T11:07:16Z

It's worth noting that 7z (and probably other archivers) support using no compression when archiving files, referred to as compression level 0 in 7z. In my experience there is essentially no speed penalty when extracting large files that are stored with no compression. So if speed is a big concern then that is the approach I would take.

But in this case the whole reason to compress is to reduce the size of the file 😆 The issue is that HF won't store files larger than 50GB, so any files larger than that either need to be split, or compressed, to get under that limit.

But yeah it might be worth experimenting with compression levels to find the optimum one that both reduces the file size below 50GB, while also being as fast as possible to decompress.

EliEron · 2023-05-18T11:32:27Z

It's worth noting that 7z (and probably other archivers) support using no compression when archiving files, referred to as compression level 0 in 7z. In my experience there is essentially no speed penalty when extracting large files that are stored with no compression. So if speed is a big concern then that is the approach I would take.

But in this case the whole reason to compress is to reduce the size of the file 😆 The issue is that HF won't store files larger than 50GB, so any files larger than that either need to be split, or compressed, to get under that limit.

But yeah it might be worth experimenting with compression levels to find the optimum one that both reduces the file size below 50GB, while also being as fast as possible to decompress.

Oh I was thinking you were just going to use the split archive feature that is natively supported by 7zip/WinRAR itself. Those wouldn't require any external script to split or join the files. But yeah if you don't want split archives at all then compression is unavoidable of course. Sorry for the misunderstanding 😄.

TheBloke · 2023-05-18T11:37:14Z

Oh! I didn't even think of that! :)

That sounds like the best of all worlds. Then I don't need any script to join the files, and it's one simple command for the user to get the file, and also doesn't have any compression slowdown.

Sorry to you for my misunderstanding - thank you! :)

Green-Sky · 2023-05-18T11:39:06Z

I could get up to 20% file size reduction with zstd. but was a while back, so I don't remember the specifics.

Please just use normal zip multiparts, no need for rar or lzma.

$ zip -s 200 7b-q4_0.zip ggml-model-q4_0.bin
  adding: ggml-model-q4_0.bin (deflated 8%)

you can use -0 for store only

$ ll
drwxrwxr-x 2 green green 4,0K Mai 18 13:37 ./
drwxrwxr-x 7 green green 4,0K Mai 17 13:12 ../
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z01
-rw-rw-r-- 1 green green 200M Mai 18 13:34 7b-q4_0.z02
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z03
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z04
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z05
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z06
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z07
-rw-rw-r-- 1 green green 200M Mai 18 13:35 7b-q4_0.z08
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z09
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z10
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z11
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z12
-rw-rw-r-- 1 green green 200M Mai 18 13:36 7b-q4_0.z13
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z14
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z15
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z16
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z17
-rw-rw-r-- 1 green green 200M Mai 18 13:37 7b-q4_0.z18
-rw-rw-r-- 1 green green 100M Mai 18 13:37 7b-q4_0.zip
-rw-rw-r-- 1 green green 4,0G Mai 13 13:42 ggml-model-q4_0.bin

TheBloke · 2023-05-18T11:47:55Z

OK thanks for the test! I agree ZIP is easiest.

Green-Sky · 2023-05-18T11:50:36Z

Yea, just use -0. Since zip has no multithreading, it takes ages to compress.

TheBloke · 2023-05-25T22:17:18Z

So I finally got around to try this, for Tim Dettmer's Guanaco 65B in q8_0. And it doesn't work?

[pytorch2] root@64c767772631:/workspace/process/TheBloke_guanaco-65B-GGML/ggml # zip -0 -s 49000m zip/guanaco-65B.ggmlv3.q8_0.zip guanaco-65B.ggmlv3.q8_0.bin
  adding: guanaco-65B.ggmlv3.q8_0.bin (stored 0%)

[pytorch2] root@64c767772631:/workspace/process/extract # unzip zip/guanaco-65B.ggmlv3.q8_0.zip
Archive:  ../TheBloke_guanaco-65B-GGML/ggml/zip/guanaco-65B.ggmlv3.q8_0.zip
warning [../TheBloke_guanaco-65B-GGML/ggml/zip/guanaco-65B.ggmlv3.q8_0.zip]:  zipfile claims to be last disk of a multi-part archive;
  attempting to process anyway, assuming all parts have been concatenated
  together in order.  Expect "errors" and warnings...true multi-part support
  doesn't exist yet (coming soon).
file #1:  bad zipfile offset (local header sig):  4

EDIT: figured it out. unzip is trash, I need to use something like 7zip

TheBloke · 2023-05-25T23:04:06Z

OK! Update: the ZIP is actually fine. I tested with 7z on macOS and it uncompressed it no problem.

And then on Linux I did apt install 7zip and now it works:

[pytorch2] root@64c767772631:/workspace/process/TheBloke_guanaco-65B-GGML/ggml/xtract # 7zz x ../zip/guanaco-65B.ggmlv3.q8_0.zip

7-Zip (z) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
 64-bit locale=C.UTF-8 Threads:48

Scanning the drive for archives:
1 file, 15629462861 bytes (15 GiB)

Extracting archive: ../zip/guanaco-65B.ggmlv3.q8_0.zip
--
Path = ../zip/guanaco-65B.ggmlv3.q8_0.zip
Type = zip
Physical Size = 15629462861
Embedded Stub Size = 4
64-bit = +
Characteristics = Zip64
Total Physical Size = 67009686861
Multivolume = +
Volume Index = 1
Volumes = 2

  2% - guanaco-65B.ggmlv3.q8_0.bin

So I guess unzip is just old trash :)

Panic over!

TheBloke · 2023-05-26T09:02:59Z

Finally got a 65B q8_0 uploaded :) Thanks again for the ideas!

I'd still love to do this natively in GGML sometime, with a two-part GGML. But for now this is fine and much better than not uploading a q8_0.

Green-Sky · 2023-05-26T10:17:55Z

So I guess unzip is just old trash :)

Personally I use unar - which eg loads in 7z for .7z, etc, so I dont have to choose the right tool 😄

sw mentioned this issue May 17, 2023

Remove unused n_parts parameter #1509

Merged

TheBloke closed this as completed May 18, 2023

hahuyhoang411 mentioned this issue Nov 29, 2023

feat: Support multipart GGUF models menloresearch/jan#779

Closed

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-part GGML files: do they still work? And how hard would it be to modify convert.py to create them? #1503

Multi-part GGML files: do they still work? And how hard would it be to modify convert.py to create them? #1503

TheBloke commented May 17, 2023 •

edited

Loading

sw commented May 17, 2023 •

edited

Loading

ggerganov commented May 17, 2023

slaren commented May 17, 2023

LostRuins commented May 18, 2023

TheBloke commented May 18, 2023

LostRuins commented May 18, 2023

TheBloke commented May 18, 2023 •

edited

Loading

EliEron commented May 18, 2023 •

edited

Loading

TheBloke commented May 18, 2023 •

edited

Loading

EliEron commented May 18, 2023 •

edited

Loading

TheBloke commented May 18, 2023

Green-Sky commented May 18, 2023 •

edited

Loading

TheBloke commented May 18, 2023

Green-Sky commented May 18, 2023

TheBloke commented May 25, 2023 •

edited

Loading

TheBloke commented May 25, 2023 •

edited

Loading

TheBloke commented May 26, 2023 •

edited

Loading

Green-Sky commented May 26, 2023 •

edited

Loading

Multi-part GGML files: do they still work? And how hard would it be to modify convert.py to create them? #1503

Multi-part GGML files: do they still work? And how hard would it be to modify convert.py to create them? #1503

Comments

TheBloke commented May 17, 2023 • edited Loading

sw commented May 17, 2023 • edited Loading

ggerganov commented May 17, 2023

slaren commented May 17, 2023

LostRuins commented May 18, 2023

TheBloke commented May 18, 2023

LostRuins commented May 18, 2023

TheBloke commented May 18, 2023 • edited Loading

EliEron commented May 18, 2023 • edited Loading

TheBloke commented May 18, 2023 • edited Loading

EliEron commented May 18, 2023 • edited Loading

TheBloke commented May 18, 2023

Green-Sky commented May 18, 2023 • edited Loading

TheBloke commented May 18, 2023

Green-Sky commented May 18, 2023

TheBloke commented May 25, 2023 • edited Loading

TheBloke commented May 25, 2023 • edited Loading

TheBloke commented May 26, 2023 • edited Loading

Green-Sky commented May 26, 2023 • edited Loading

TheBloke commented May 17, 2023 •

edited

Loading

sw commented May 17, 2023 •

edited

Loading

TheBloke commented May 18, 2023 •

edited

Loading

EliEron commented May 18, 2023 •

edited

Loading

TheBloke commented May 18, 2023 •

edited

Loading

EliEron commented May 18, 2023 •

edited

Loading

Green-Sky commented May 18, 2023 •

edited

Loading

TheBloke commented May 25, 2023 •

edited

Loading

TheBloke commented May 25, 2023 •

edited

Loading

TheBloke commented May 26, 2023 •

edited

Loading

Green-Sky commented May 26, 2023 •

edited

Loading