Bukys Software

Specialized Practice in File Recovery - Case Study

The Problem

An organization compresses its Apache web server access logs in ZIP format, with the "gzip" program, and archives them to another system using an FTP client for file transfer. An error in the script caused the transfers to be performed in "text" mode instead of "binary" mode, corrupting all of the compressed archived for a few months before the error was noticed.

Estimating the difficulty of repair

Approximately 1 in 256 bytes is known to be corrupted, and the corruption is known to occur only in bytes with the value '\012'. So the byte error rate is 1/256 (0.39% of input), and 2/256 bytes (0.78% of input) are suspect. But since only three bits per smashed byte are affected, the bit error rate is only 3/(256*8): 0.15% is bad, 0.29% is suspect.

The typical log file in this set is a million lines, 250 megabytes uncompressed, and 20 megabytes compressed. So the typical damaged compressed file contains 156250 suspect bytes. A brute-force search through all 2156250 possible repairs is not feasible; it would take longer than the known age of the universe.

Using domain knowledge to reduce the search space

An error in the compressed input disrupts the decompression process for all subsequent bytes. To the human eye, or the web log analyzer, attempting to decompress produces obviously bad data from the first line of output.

For example, uncorrupted output that should look like this: - - [04/May/2004:00:00:00 -0400] "GET /learning HTTP/1.1" 301 324 - "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7b) Gecko/20040311 Firefox/0.8" GET "/learning" "HTTP/1.1"
comes out like this: - - [04/May/2004:00:00:00 -0400] "GET /learning HTTP/1.1" 301 324 - "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7b) Gecko/20040311 Firefox/0.8" GET "/learning" "HTTP/1.1":1512^@^@^@ing HTTP/1.1" 30[f110.19 - - [04/May/2004:00:00:00 -0400] "GET /learning HTTP/1.1" 301 324 - "M5s8^@^@^@_r^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@.7b) Gecko/20040311 Firefox/0.8"TT0:00:00 -0400] "GET /learning HTTP/1.1" 301 324 - "M5s8^@^@^@_r^@^@^@^@^@^@^@^@^@^@^@6^@^@^@^@^@^@^@^@^@^@^@^@^@.jpg0:00 -0400] "GE2728 http://www.exam^@^@^@.com/2004:00:00ing HTTP/1.1" 301 324 - "M5s8^@^@^@_r^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@.7b) Gecko/200403_r^@^@^@^@^@^@^@^@^@^@^@6^@^@^@^@^@^@^@^@^@^@^@^@^@.jpgx/0.8"TT0:00:007-048.152.5T /learning HTTP/1.1" 301 324 - "M5s8^@^@^@0:00 -0400] "GE7561arniEchop6^@^@^@5] "Go/200403x/0.8"TT0:00:00 -0400] "GET /learning HTTP/1.1" 301 324 - "M5s8^@^@^@_r^@^@^@^@^@^@^@^@^@^@^@6^@^@^@tit^@^@^@^@^@^@.jpg0:00 -0400] "GE18380 http://www.exam^@^@^@.com/2004:00:00ing ...

The fact that the decompressed output is recognizably bad so quickly is cause for hope -- a search for the correct answer can identify wrong answers quickly.


Ultimately, several techniques were combined to successfully extract reasonable data from these files:

These techniques identify 75% of the necessary repairs with certainty, and the remainder are explored highest-probability-first, so that plausible reconstructions are identified immediately. For this client, these first reconstructions satisfied their requirements.

Back to: Specialized Practice in File Recovery