Menu

How to identify LZMA files

leafmoon
2008-08-25
2020-12-30
  • leafmoon

    leafmoon - 2008-08-25

    Hello,

    I'm making tar.lzma extractor using LZMA SDK.
    But I have a question: how can I identify LZMA files?
    LZMA files do not have signature like those of gzip or bzip2, so identifying is a little difficult, I think.

    My current method is: decompress first 64KB(OUT_BUF_SIZE in LzmaUtil.c) and check whether the decompression has failed...
    Any advice is welcomed.

     
    • Igor Pavlov

      Igor Pavlov - 2008-08-25

      1) byte 0 < 5 * 5 * 9 (LZMA properties)
      2) bytes 1 - 4 = UInt32 (LZMA dictionary size) = (1 << N) in 99% cases (maybe more than 99%).
      3) bytes 5 - 12 = Int64 = -1 or uncompressed size (can not be too big in most cases).
      4) byte 13 = 0 always.

       
      • spaglia

        spaglia - 2008-08-25

        Igor,

        I noticed that the first byte output by CEncoder::Code() is always 0. Is the by design? What does 0 mean?

         
    • leafmoon

      leafmoon - 2008-08-25

      Then how 7z.exe identify LZMA files?
      "99%" is not complete...

       
      • Igor Pavlov

        Igor Pavlov - 2008-08-25

        7-zip checks dictSize: (1 << N) and (3 << N) are.
        .lzma format allows any dictSize, but my original lzma program always uses (1 << N).
        I suppose you also can compare dictSize with (1 << N) and (3 << N).

         
    • leafmoon

      leafmoon - 2008-08-25

      Your method can sometimes mis-identify non-LZMA files as LZMA, can't it?

       
      • Igor Pavlov

        Igor Pavlov - 2008-08-25

        Maybe, but the probability of that is low.

         
    • Igor Pavlov

      Igor Pavlov - 2008-08-25

      BTW, why don't you use .lmza file extension for identification?

       
      • Ricardo Amanda

        Ricardo Amanda - 2020-12-30

        yes, it looks like the best option, thanks and keep up the good work

         
      • Ricardo Amanda

        Ricardo Amanda - 2020-12-30

        yes, it looks like the best option, thanks and keep up the good work

         
    • leafmoon

      leafmoon - 2008-08-25

      Umm... what I'm making is multi-format archiver, so I want have LZMA files get recognized even if they don't have .lzma or .tlz.

      Firstly I can use:
      > byte 0 < 5 * 5 * 9 (LZMA properties)
      > bytes 1 - 4 = UInt32 (LZMA dictionary size) = (1 << N) or ( 3 << N ) 
      > byte 13 = 0 always.

      Nextly... what I have thought is to decompress first 64KB(IN_BUF_SIZE). What is the probability to mis-identify non-LZMA files if first 64KB was decompressed without error?

      Another solution what I have thought is:
      Filesize < 10MB: decompress all; if no error it is LZMA file
      Filesize > 10MB: check extention...
      Yes, I know people do not use different extention for large files...

       
      • leafmoon

        leafmoon - 2008-08-25

        Oops, OUT_BUF_SIZE, not IN_BUF_SIZE, because I use LZMA SDK with gzip-like wrapper(internal buffering).

        Ah... I admit that 64KB file can be compressed even into 10KB or less. So lzmaread(h, 65536) is not enough?
        How long in output-size should I decompress in order not to mis-identify?

         
        • Igor Pavlov

          Igor Pavlov - 2008-08-25

          - How long in output-size should I decompress in order not to mis-identify?

          I suppose 1 KB COMPRESSED / 64 KB uncompressed (what is first reached) is enough.

          - What is the probability to mis-identify non-LZMA files if first 64KB was decompressed
          without error?

          I suppose it has very low probability, if there are non-zero bytes in compressed stream.

           
    • leafmoon

      leafmoon - 2008-08-27

      Excuse me, but the maintainer of the software to which I'm going to patch wants to know the reason
      >I suppose 1 KB COMPRESSED / 64 KB uncompressed (what is first reached) is enough.
      >I suppose it has very low probability, if there are non-zero bytes in compressed stream.

       
      • Igor Pavlov

        Igor Pavlov - 2008-08-27

        Sequence on ZERO bytes starting from byte 13:
        00 00 00 00 00 00 00 00 00 ...
        looks like normal LZMA .stream.
        LZMA decoder doesn't show any error.

         
    • leafmoon

      leafmoon - 2008-08-27

      How you detect decompression error? Usually decompression goes on(And raises CRC error).
      And what is the first ZEROs(I can check non ZERO byte though)?

       
      • Igor Pavlov

        Igor Pavlov - 2008-08-28

        you try to decompress 1 KB of compressed data. If all bytes of them are zeros, LZMA decoder doesn't show error, but in 99.999% it means that it's not LZMA stream (or broken LZMA stream).

         
    • leafmoon

      leafmoon - 2008-08-28

      All right, then I'll check for the non-ZERO byte when opening.

      So what I want to know now is
      I suppose it has very low probability, if there are non-zero bytes in compressed stream.

       
    • leafmoon

      leafmoon - 2008-09-01

      Sorry but lzma.exe 4.60 did not show error.

       
      • Igor Pavlov

        Igor Pavlov - 2008-09-02

        Next version will show error for such files.
        In .C decoder you must use finishMode = LZMA_FINISH_END.

         

Log in to post a comment.