Extracting data from an unknown compressed file


November 2018


178 time


I have a binary file which I need to extract the information from it. I know that it is a compressed file and the first 3 character of the file is zip I am pretty sure LZ Substitution and/or Huffman Coding is being used to compress this file. However the file does not follow any regular archives format such as .rar , .zip and etc.

I have tried to read the file and found out the following schema Part (C) is truncated to make it fit the figure

The file has 3 parts :

Part (A) shows the header which is 16 bytes and includes 8 bytes as signature with the following characters value : 122,105,112,1,0,12,0,0

Part (B) is a list of address (271), each points to a particular address of file, which I believe are the records start point in part (C).

Part (C) is the actual Data

The first address(716 in the figure) shows the first record(chunk) address in part (C), since Part (C) begins exactly when the part (B) ends the first address is the address where part (B) ends and part (C) starts, and also since file ends after part (C) finishes, the last address in the list of part B points to the end of file where last record(chunk) in part (C) ends.

In order to make it fit in the figure I had to cut the records(chunks) in part (C) they have much more character, as you see in the figure the first record(chunk) has 472 Bytes length.

Each chunk has a different length so they are not equal in length. Also the length of the biggest record is stored in header (bytes 13,14,15,16) which is 955 (187,3,0,0) I dont know why it may come handy while reading a compressed file.

As you see all records start with two bytes (120,218) The ending characters are not going to be repeated record by records actually they look very random.

I don't see any similarity between huffman tree or huffman table at the end of the records, but in order to take look at the file, I have uploaded it here.

Any help to extract the compressed data in the file is really appreciated.

Download Part (C)

1 answers


Каждая строка (C) представляет собой часть Zlib сжатого фрагмента файл.

Первые два байта 120218, который является Zlib подпись и 4 последние должны быть adler32.