I'm making tar.lzma extractor using LZMA SDK.
But I have a question: how can I identify LZMA files?
LZMA files do not have signature like those of gzip or bzip2, so identifying is a little difficult, I think.
My current method is: decompress first 64KB(OUT_BUF_SIZE in LzmaUtil.c) and check whether the decompression has failed...
Any advice is welcomed.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
7-zip checks dictSize: (1 << N) and (3 << N) are.
.lzma format allows any dictSize, but my original lzma program always uses (1 << N).
I suppose you also can compare dictSize with (1 << N) and (3 << N).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Nextly... what I have thought is to decompress first 64KB(IN_BUF_SIZE). What is the probability to mis-identify non-LZMA files if first 64KB was decompressed without error?
Another solution what I have thought is:
Filesize < 10MB: decompress all; if no error it is LZMA file
Filesize > 10MB: check extention...
Yes, I know people do not use different extention for large files...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oops, OUT_BUF_SIZE, not IN_BUF_SIZE, because I use LZMA SDK with gzip-like wrapper(internal buffering).
Ah... I admit that 64KB file can be compressed even into 10KB or less. So lzmaread(h, 65536) is not enough?
How long in output-size should I decompress in order not to mis-identify?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Excuse me, but the maintainer of the software to which I'm going to patch wants to know the reason
>I suppose 1 KB COMPRESSED / 64 KB uncompressed (what is first reached) is enough.
>I suppose it has very low probability, if there are non-zero bytes in compressed stream.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
How you detect decompression error? Usually decompression goes on(And raises CRC error).
And what is the first ZEROs(I can check non ZERO byte though)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
you try to decompress 1 KB of compressed data. If all bytes of them are zeros, LZMA decoder doesn't show error, but in 99.999% it means that it's not LZMA stream (or broken LZMA stream).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm making tar.lzma extractor using LZMA SDK.
But I have a question: how can I identify LZMA files?
LZMA files do not have signature like those of gzip or bzip2, so identifying is a little difficult, I think.
My current method is: decompress first 64KB(OUT_BUF_SIZE in LzmaUtil.c) and check whether the decompression has failed...
Any advice is welcomed.
1) byte 0 < 5 * 5 * 9 (LZMA properties)
2) bytes 1 - 4 = UInt32 (LZMA dictionary size) = (1 << N) in 99% cases (maybe more than 99%).
3) bytes 5 - 12 = Int64 = -1 or uncompressed size (can not be too big in most cases).
4) byte 13 = 0 always.
Igor,
I noticed that the first byte output by CEncoder::Code() is always 0. Is the by design? What does 0 mean?
Then how 7z.exe identify LZMA files?
"99%" is not complete...
7-zip checks dictSize: (1 << N) and (3 << N) are.
.lzma format allows any dictSize, but my original lzma program always uses (1 << N).
I suppose you also can compare dictSize with (1 << N) and (3 << N).
Your method can sometimes mis-identify non-LZMA files as LZMA, can't it?
Maybe, but the probability of that is low.
BTW, why don't you use .lmza file extension for identification?
yes, it looks like the best option, thanks and keep up the good work
yes, it looks like the best option, thanks and keep up the good work
Umm... what I'm making is multi-format archiver, so I want have LZMA files get recognized even if they don't have .lzma or .tlz.
Firstly I can use:
> byte 0 < 5 * 5 * 9 (LZMA properties)
> bytes 1 - 4 = UInt32 (LZMA dictionary size) = (1 << N) or ( 3 << N )
> byte 13 = 0 always.
Nextly... what I have thought is to decompress first 64KB(IN_BUF_SIZE). What is the probability to mis-identify non-LZMA files if first 64KB was decompressed without error?
Another solution what I have thought is:
Filesize < 10MB: decompress all; if no error it is LZMA file
Filesize > 10MB: check extention...
Yes, I know people do not use different extention for large files...
Oops, OUT_BUF_SIZE, not IN_BUF_SIZE, because I use LZMA SDK with gzip-like wrapper(internal buffering).
Ah... I admit that 64KB file can be compressed even into 10KB or less. So lzmaread(h, 65536) is not enough?
How long in output-size should I decompress in order not to mis-identify?
- How long in output-size should I decompress in order not to mis-identify?
I suppose 1 KB COMPRESSED / 64 KB uncompressed (what is first reached) is enough.
- What is the probability to mis-identify non-LZMA files if first 64KB was decompressed
without error?
I suppose it has very low probability, if there are non-zero bytes in compressed stream.
Excuse me, but the maintainer of the software to which I'm going to patch wants to know the reason
>I suppose 1 KB COMPRESSED / 64 KB uncompressed (what is first reached) is enough.
>I suppose it has very low probability, if there are non-zero bytes in compressed stream.
Sequence on ZERO bytes starting from byte 13:
00 00 00 00 00 00 00 00 00 ...
looks like normal LZMA .stream.
LZMA decoder doesn't show any error.
How you detect decompression error? Usually decompression goes on(And raises CRC error).
And what is the first ZEROs(I can check non ZERO byte though)?
you try to decompress 1 KB of compressed data. If all bytes of them are zeros, LZMA decoder doesn't show error, but in 99.999% it means that it's not LZMA stream (or broken LZMA stream).
All right, then I'll check for the non-ZERO byte when opening.
So what I want to know now is
I suppose it has very low probability, if there are non-zero bytes in compressed stream.
Sorry but lzma.exe 4.60 did not show error.
Next version will show error for such files.
In .C decoder you must use finishMode = LZMA_FINISH_END.