Free project idea: The File Decoder

0 users browsing Programming.

Main » Programming » Free project idea: The File Decoder

Pages: 1

Screwtape

Posted on 19-01-30, 11:29

Full mod

Post: #100 of 443
Since: 10-30-18

Last post: 1238 days
Last view: 309 days

Sometimes I get a file that's in a weird format that I don't understand, and for whatever reason nobody's published file-format docs, or I can't find them for some reason. If I really want to understand the format, step one is to open it up in a hex editor and poke around.

Of course, a hex editor isn't enough on it own. Binary file formats have all kinds of tropes, like fields that are the offset to the start of some array of values, or the count of values in the array, fields that change meaning depending on flags in other fields, that kind of thing. So as well as a hex-editor, I need to open up a text-editor, or even a real physical notebook, to jot down hypotheses about which bytes are themselves data, and which bytes are just descriptions of where to find data.

Once I have notes, it doesn't stop there. Usually if there's one file in a given format, there's several, and it's a good test of understanding to check whether the same rules that apply to one file apply to others. Manually checking each file is a ton of work, though, so I really need to write an extraction tool based on my understanding, and try it on each input. Writing a tool takes time too, though, especially since so few languages come with convenient ways to describe packed and padded binary data, and even those that do require me to come up with sensible data structures with sensible field names.

Of course the *ultimate* goal is to decode the original file format, and get whatever data it contains out into a more usable format. Once I've got a bespoke extraction tool, it's not too hard to make it write everything out into a sensible file format... but it's still a bunch of work that needs to be done nearly from scratch every time.

To make my life easier, somebody needs to build a tool that's like a hex editor, but lets you interactively describe the file format of the file you're looking at, and save that file-format description. Once you have a file-format description, you can apply it to other files to see how well it fits, browse the decoded data, and export it to JSON.

I imagine it working something like this:

- when it first starts up, it looks just like a regular hex editor, with an undifferentiated stream of hex bytes on the left and ASCII on the right.
- you can select some bytes in the stream and format them as, say, "unsigned 16-bit little-endian integer"
- Now the regular hex-dump goes up to the bytes you selected, and stops. On the next line is the integer you marked, in decimal, followed by "(unnamed 1)". The line after that, the hex-dump resumes from the left-hand column.
- further down, you see some bytes that look like they might be a repeating pattern of some kind. You select the pattern, and it turns out the starting offset of the pattern is equal to the integer you selected earlier, so the integer is highlighted. You format the selection as "Block of 125 bytes at offset [(unnamed 1)]" (or however many bytes it is). It still looks like a hex-dump, but it's on its own lines, instead of sharing with the bytes before and after.
- the block has a repeating pattern, but it's hard to see exactly what it is, since the hex-dump always wraps at 32 bytes, and the pattern's length is not an even multiple of 32. Inside the block, you format the contents as "Array of 27-byte records".
- now the hex dump starts a new line every 27 bytes, regardless of the width of the screen. That's not the right width either, though, so you try a few different record widths until you discover that 36 makes the patterns in each record line up.
- It turns out there's exactly 52 records in the array, so you select the first chunk of the file and search for byte patterns that could be interpreted as 52. It turns out there's one right next to the array offset integer you discovered earlier, so you mark that as an integer, name it "record count", and amend the block to be "Block of [record count] * 36 bytes at offset [(unnamed 1)]"

It would be a lot of work to make something general enough to handle *every* binary file format; something like beat with variable-length integers would be a nightmare. But in my experience there's a *lot* of file-formats that just use fixed-size and length-prefixed blocks, and a tool that could help mark them up in a machine-readable fashion would make this stuff a *lot* easier.

I've got enough weird and goofy projects on my todo list as it is, but maybe somebody else here is looking for something to work on, or has an even better idea to improve it. Go nuts.

The ending of the words is ALMSIVI.

neologix

Posted on 19-01-31, 01:00

Post: #22 of 49
Since: 10-29-18

Last post: 2038 days
Last view: 1923 days

There are a couple of hex editors that let you describe file/struct formats manually and/or highlight a set of bytes and see what the values are in various different data types. I know on the Mac Hex Fiend implemented it after I suggested it, and 010 Editor and Synalyze have it as well. I think a couple of Windows hex editors have it, but I don't often find myself doing hex editing on Windows. I have absolutely no idea what Linux and Unix have in this regard.

Screwtape

Posted on 19-01-31, 07:26

Full mod

Post: #101 of 443
Since: 10-30-18

Last post: 1238 days
Last view: 309 days

I should have known there'd already be tools like this. I did look around a bit, but all the editors I could find were pretty generic traditional hex editors, or very specialised for x86/ARM binaries, rather than generic binary data files.

That Hex Fiend PR looks pretty nice, except that if I'm reading PR correctly, they're using a Turing-complete language to parse rather than something purely declarative, which seems a shame. On the other hand, the Synalize screenshot looks like they've got a proper declarative grammar with a full GUI editor, which makes it look a lot scarier than it probably is.

The ending of the words is ALMSIVI.

Pages: 1

Main » Programming » Free project idea: The File Decoder

bboard