7-Zip / Discussion / Open Discussion: How to identify LZMA files

leafmoon - 2008-08-25

Hello,

I'm making tar.lzma extractor using LZMA SDK.
But I have a question: how can I identify LZMA files?
LZMA files do not have signature like those of gzip or bzip2, so identifying is a little difficult, I think.

My current method is: decompress first 64KB(OUT_BUF_SIZE in LzmaUtil.c) and check whether the decompression has failed...
Any advice is welcomed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Igor Pavlov - 2008-08-25
  
  1) byte 0 < 5 * 5 * 9 (LZMA properties)
  2) bytes 1 - 4 = UInt32 (LZMA dictionary size) = (1 << N) in 99% cases (maybe more than 99%).
  3) bytes 5 - 12 = Int64 = -1 or uncompressed size (can not be too big in most cases).
  4) byte 13 = 0 always.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - spaglia - 2008-08-25
    
    Igor,
    
    I noticed that the first byte output by CEncoder::Code() is always 0. Is the by design? What does 0 mean?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- leafmoon - 2008-08-25
  
  Then how 7z.exe identify LZMA files?
  "99%" is not complete...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Igor Pavlov - 2008-08-25
    
    7-zip checks dictSize: (1 << N) and (3 << N) are.
    .lzma format allows any dictSize, but my original lzma program always uses (1 << N).
    I suppose you also can compare dictSize with (1 << N) and (3 << N).
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- leafmoon - 2008-08-25
  
  Your method can sometimes mis-identify non-LZMA files as LZMA, can't it?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Igor Pavlov - 2008-08-25
    
    Maybe, but the probability of that is low.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Igor Pavlov - 2008-08-25
  
  BTW, why don't you use .lmza file extension for identification?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ricardo Amanda - 2020-12-30
    
    yes, it looks like the best option, thanks and keep up the good work
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ricardo Amanda - 2020-12-30
    
    yes, it looks like the best option, thanks and keep up the good work
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- leafmoon - 2008-08-25
  
  Umm... what I'm making is multi-format archiver, so I want have LZMA files get recognized even if they don't have .lzma or .tlz.
  
  Firstly I can use:
  > byte 0 < 5 * 5 * 9 (LZMA properties)
  > bytes 1 - 4 = UInt32 (LZMA dictionary size) = (1 << N) or ( 3 << N )
  > byte 13 = 0 always.
  
  Nextly... what I have thought is to decompress first 64KB(IN_BUF_SIZE). What is the probability to mis-identify non-LZMA files if first 64KB was decompressed without error?
  
  Another solution what I have thought is:
  Filesize < 10MB: decompress all; if no error it is LZMA file
  Filesize > 10MB: check extention...
  Yes, I know people do not use different extention for large files...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - leafmoon - 2008-08-25
    
    Oops, OUT_BUF_SIZE, not IN_BUF_SIZE, because I use LZMA SDK with gzip-like wrapper(internal buffering).
    
    Ah... I admit that 64KB file can be compressed even into 10KB or less. So lzmaread(h, 65536) is not enough?
    How long in output-size should I decompress in order not to mis-identify?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Igor Pavlov - 2008-08-25
      
      - How long in output-size should I decompress in order not to mis-identify?
      
      I suppose 1 KB COMPRESSED / 64 KB uncompressed (what is first reached) is enough.
      
      - What is the probability to mis-identify non-LZMA files if first 64KB was decompressed
      without error?
      
      I suppose it has very low probability, if there are non-zero bytes in compressed stream.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- leafmoon - 2008-08-27
  
  Excuse me, but the maintainer of the software to which I'm going to patch wants to know the reason
  >I suppose 1 KB COMPRESSED / 64 KB uncompressed (what is first reached) is enough.
  >I suppose it has very low probability, if there are non-zero bytes in compressed stream.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Igor Pavlov - 2008-08-27
    
    Sequence on ZERO bytes starting from byte 13:
    00 00 00 00 00 00 00 00 00 ...
    looks like normal LZMA .stream.
    LZMA decoder doesn't show any error.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- leafmoon - 2008-08-27
  
  How you detect decompression error? Usually decompression goes on(And raises CRC error).
  And what is the first ZEROs(I can check non ZERO byte though)?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Igor Pavlov - 2008-08-28
    
    you try to decompress 1 KB of compressed data. If all bytes of them are zeros, LZMA decoder doesn't show error, but in 99.999% it means that it's not LZMA stream (or broken LZMA stream).
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- leafmoon - 2008-08-28
  
  All right, then I'll check for the non-ZERO byte when opening.
  
  So what I want to know now is
  I suppose it has very low probability, if there are non-zero bytes in compressed stream.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- leafmoon - 2008-09-01
  
  Sorry but lzma.exe 4.60 did not show error.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Igor Pavlov - 2008-09-02
    
    Next version will show error for such files.
    In .C decoder you must use finishMode = LZMA_FINISH_END.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

How to identify LZMA files

A free file archiver for extremely high compression

Forums

Help

How to identify LZMA files

How to identify LZMA files

A free file archiver for extremely high compression

Forums

Help

How to identify LZMA files document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

How to identify LZMA files