Runtime/bespoke format support like kaitai struct

Question

Runtime/bespoke format support like kaitai struct

remy opened this issue 2 years ago · 13 comments

Not a bug report, but a suggestion that might make supporting any format useful (I tend to work with really weird bespoke binary formats).

Any thoughts on adding support for arbitrary format support. Something like "if the decoded format isn't found, then try to resolve it as a file and load that".

i.e. fq --decode zxtap.go ".headers" game.tap - where zxtap.go follows your format file convention. Then I could make my own formats without having to rely on native support inside of fq (also solves those requests asking for other formats that have been raised already).

Answer 1 · 2023-03-30T13:16:08.000Z

Hey! yes i would love to support some kind of runtime script or declarative format, and i have done some experimentens that i can summaries. What kind of format would you prefer?

I've thought about/done these experiments:

Some kind of DSL using jq syntax, think decode({magic: str(4), len: u32}). Has some issues how to reference a length field etc, use jq bindnings somehow? not so good is few ppl are probably familiar with jq. Maybe a DSL for just very simple structures and types could be useful?
Kaitai struct. I have a very hacky prototype that used to be able to decode mp4, png files. I'm currently rewriting it now when i know more how it should actually work :) you current use it like fq -d kaitai -o source=@file.ksy <query> file but as you already mentioned, i would like this to work also fq -d file.ksy <query> file
TCL, Go, Lua, ... would require including some sort of interpreter and defining a stable public API i guess?

Think i would prefer to priorities support some already defined standard like kaitai, 010 templates, ImHex patterns etc.

Another thing i like about fq is that is has format decoders builtin, but i guess one could add infrastructure to make ex a .ksy file be embedded and behave as a "native" decoder somehow. My idea is that standard/well-known formats should be builtin if possible so they can be used nestedly etc.

If you're in a hurrry and really want write your own decoders go you can check this twitter thread where i describe how you can do it https://twitter.com/mwader/status/1600879549612707840 not great but works.

BTW there is an older issue about this #24 that might be interesting, but this was an update.

Answer 2 · 2023-03-30T14:58:39.000Z

I imagined the lowest hanging fruit would be to allow users to author their own formatters based on your own existing structure: https://github.com/wader/fq/blob/master/format/csv/csv.go - I've only skimmed through a couple of these, but they seemed to have the same structure.

That would get the ball rolling so that you can see whether it's actually of any use to more people than the odd few.

With my dev thinking hat, I personally prefer something like Kaitai (not familiar with it, but going by the quick start it makes sense), then as an author I've got a declarative way of defining a format - it also looks like it would be easier for developers to test their formatters using this method.

If you go via the TCL, Lua route, I'd be more worried about having to support all kinds of extra languages and parsers causing bloat and potential support headache. But that's just me :)

Answer 3 · 2023-03-30T16:01:32.000Z

I imagined the lowest hanging fruit would be to allow users to author their own formatters based on your own existing structure: https://github.com/wader/fq/blob/master/format/csv/csv.go - I've only skimmed through a couple of these, but they seemed to have the same structure.

That would get the ball rolling so that you can see whether it's actually of any use to more people than the odd few.

Aha yeah csv, json, yaml etc are a bit weird formats in fq. They more or less decode to one root value that is a big row blob that is the whole input and that root value has a JSON value. So you will not have bit ranges per JSON values etc. One could write a "proper" fq decoder for these text formats but the output would probably be horrible to use if you want to model whitespace and everything.

With my dev thinking hat, I personally prefer something like Kaitai (not familiar with it, but going by the quick start it makes sense), then as an author I've got a declarative way of defining a format - it also looks like it would be easier for developers to test their formatters using this method.

Yeap i leaning towards getting basic kaitai support working first. I haven't used it much myself, but now when testing things while developing i think it has a good balance between not being too big and complex but stil very expressive.

Could you elaborate on test their formatter? you mean they could also use kaitai's official compiler and tools to verify?

If you go via the TCL, Lua route, I'd be more worried about having to support all kinds of extra languages and parsers causing bloat and potential support headache. But that's just me :)

Agreed! :) i'm quite reluctant to add whole languages implementations, and even more reluctant to add cgo dependecies if that would be needed.

And as i said earlier, ideally if someone wants to add a format that is well-know i would like it to be included in fq itself. I wonder what i could do to make that easier? better documentation/examples?

Answer 4 · 2023-03-30T16:30:58.000Z

Could you elaborate on test their formatter? you mean they could also use kaitai's official compiler and tools to verify?

I'm thinking just during the development process of a formatter. I made a Hex Fiend binary formatter in TCL last year (not a language I'd used before) and the hardest part was when the formatter was failing silently and not parsing properly.

It looked like (on the surface) that Kaitai can compile out to different languages, which, I assume it means I can test a serialisation file own my own, throwing test data at it and seeing the output.

Agree on well known formats being in fq - definitely the route I'd take. I think documenting a couple of examples would be a great way to go. I'd offer up the basic "hello world" - the simplest data format, and then to compliment that something much more complicated - so you've got starting points for different types of devs. I personally learn from code, but that's just one person's perspective.

Answer 5 · 2023-03-30T16:51:46.000Z

I'm thinking just during the development process of a formatter. I made a Hex Fiend binary formatter in TCL last year (not a language I'd used before) and the hardest part was when the formatter was failing silently and not parsing properly.

HexFiend 🥳 i've use it alot and worked on the templating a bit, was a big inspiration for fq, especially the TCL decode DSL. And yeah the developer experience is not great, that is quite improved by using go :)

It looked like (on the surface) that Kaitai can compile out to different languages, which, I assume it means I can test a serialisation file own my own, throwing test data at it and seeing the output.

Exactly and there is an IDE at https://ide.kaitai.io and there is ksdump tool that can run a ksy file and dump to JSON etc. I'm using it via docker run -v "$PWD:/share" -it --entrypoint=ksdump kaitai/ksv -f json file.bin file.ksy atm to try things and generate expected output for tests.

Agree on well known formats being in fq - definitely the route I'd take. I think documenting a couple of examples would be a great way to go. I'd offer up the basic "hello world" - the simplest data format, and then to compliment that something much more complicated - so you've got starting points for different types of devs. I personally learn from code, but that's just one person's perspective.

Thanks for the feedback. So maybe i should pick a good basic existing decoder and document it more carefully with a beginner in mind and then refer to it from the dev documentation.

Answer 6 · 2023-03-31T11:18:38.000Z

Hi again, i can ping you in this issue if you want when i have something to play around with

Answer 7 · 2023-03-31T12:57:04.000Z

Definitely 👍

…

On Fri, 31 Mar 2023, 12:18 Mattias Wadman, ***@***.***> wrote: Hi again, i can ping you in this issue if you want when i have something to play around with — Reply to this email directly, view it on GitHub <#627 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADLBFOQ3GG2GSBYPCLAGDW624RRANCNFSM6AAAAAAWNEUONY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 8 · 2023-06-22T11:43:13.000Z

Just an update, still making progress on kaitai but got a bit derailed by other things and nice weather

Answer 9 · 2023-09-25T18:14:48.000Z

Is there a reason why the fq language itself couldn't be used to implement custom decoders?

For instance, suppose I'm investigating DOOM WAD files with fq. These are a collection of named "lumps" with a format that looks vaguely like this:

struct pwad {
    char[4] type;
    uint16_t n_lumps;
    uint16_t lump_dir_offset;
};
struct lump {
    uint16_t data_offset;
    uint16_t size;
    char[8] name;
};

Right now, I can almost implement this in the fq language, but not quite. I can write something like

def num_le: tobytes|explode|reverse|map(. band 0xff)|tobytes|tonumber;

tobytes as $file | {} 
| .type = $file[:4]
| .n_lumps = ($file[4:8]|num_le)
| .offset = ($file[8:12]|num_le)
| ($file[.offset:]) as $lumpdir 
| .lumps = [
    range(.n_lumps)
    | ($lumpdir[16* . :16*( . +1)]) as $rec | {}
    | .offset = ($rec[:4]|num_le)
    | .size = ($rec[4:8]|num_le)
    | .name = ($rec[8:16]|gsub("\u0000";""))
    | .data = $file[.offset:.offset+.size]
]

This is clunky because "Decode Values" can't be created by ordinary fq functions, so the above would only just be a bare JSON tree with no underlying connection to the bytes underneath. Further, each filter needs to keep track of both the current stream element and the buffer being parsed, so I have to do x as $full_file | y as $symbolic_record | ... a bunch. That's ugly.

I think the core reason behind this impedance mismatch is that the fq language is stream-based and stateless, while the internal go API is imperative and stateful. Ideally, it might make sense to extend the fq language to support the generation of "Decode Types" using a stateful DSL similar to what's inside the internal go API.

We don't need to change the language very much to do this. Here's a hypothetical example for what this could look like in my doom WAD parser:

decode tobytes into
  | .type = read(4) # advances the parser four bytes
  | .n_lumps = uint16
  | .offset = uint16
  | .lumps = seek .offset in [ # another special form; see below
    range(.n_lumps) as $n | {}
    | .offset = uint16
    | .size = uint16
    | .name = (read(8)|string0)
    | .data = seek .offset in read(.size)
]

In this example, we would introduce two special forms:

decode $buffer into $filters;. This form lexically binds $buffer into some stateful parser for the duration of $filters. Within $filters, stateful functions like int16 or read advance the currently-bound parser by a certain number of bytes, like d.FieldUTF8RawLen. These functions couldn't be used outside $filters. For convenience, the filters would be evaluated on a blank object as input so they can mutate it as above.
seek $offset in $filters This form temporarily seeks the currently-bound parser to $offset, evaluates $filters, and then seeks back to the previous location.

The resulting mapping would then be converted to a Decode Value type for future filters to play with.

This is just a sketch of an idea and would take some work to implement, but I think getting away from the artificial "define parsers in go and work with them in fq" duality feels way more elegant to me.

Answer 10 · 2023-09-26T17:45:57.000Z

@gcr Hey! glad to hear someone else i interested in decode in jq 🥳 i've been quite sick this week so a bit slow but be sure i will answer more in length with my attempts and ideas.

But short summary: I've tried to implement some kind of jq decoder API/DSL a couple of times but none of them have felt very nice or neat. And as you also noted to allow the full powers of jq in a decoder some kind of decode context thingy and efficient copy-on-write:ish decode value structure would probably be needed, not sure how to do that. Another problem with jq decoders might be performance, it's one the reason i wanted a compiled language for some format implementations like flac and mp4... also i wanted type checking and nice IDE support :)

Answer 11 · 2023-09-28T12:56:43.000Z

All great points. I imagine performance could be quite limiting if you had dozens of jq-like filters probing for file type support, heh :)

Answer 12 · 2023-10-02T13:16:36.000Z

Is there a reason why the fq language itself couldn't be used to implement custom decoders?

Not that i know of :) the main reasons is mostly that i personally haven't had use of complex custom decoders much yet. I usually work most with standardised formats that has go decoders already (mp4, flac etc) or just need to decode a single field etc. But i've done various attempts at it but haven't found a way that felt neat enough and also would not require lots of effort and rewrites to do.

I think the core reason behind this impedance mismatch is that the fq language is stream-based and stateless, while the internal go API is imperative and stateful. Ideally, it might make sense to extend the fq language to support the generation of "Decode Types" using a stateful DSL similar to what's inside the internal go API.

Yeap i think that summarises the issue quite well.

We don't need to change the language very much to do this. Here's a hypothetical example for what this could look like in my doom WAD parser:
decode tobytes into
  | .type = read(4) # advances the parser four bytes
  | .n_lumps = uint16
  | .offset = uint16
  | .lumps = seek .offset in [ # another special form; see below
    range(.n_lumps) as $n | {}
    | .offset = uint16
    | .size = uint16
    | .name = (read(8)|string0)
    | .data = seek .offset in read(.size)
]
In this example, we would introduce two special forms:

decode $buffer into $filters;. This form lexically binds $buffer into some stateful parser for the duration of $filters. Within $filters, stateful functions like int16 or read advance the currently-bound parser by a certain number of bytes, like d.FieldUTF8RawLen. These functions couldn't be used outside $filters. For convenience, the filters would be evaluated on a blank object as input so they can mutate it as above.

seek $offset in $filters This form temporarily seeks the currently-bound parser to $offset, evaluates $filters, and then seeks back to the previous location.

The resulting mapping would then be converted to a Decode Value type for future filters to play with.

This is just a sketch of an idea and would take some work to implement, but I think getting away from the artificial "define parsers in go and work with them in fq" duality feels way more elegant to me.

Thanks for examples and your thought. I think we're quite close to each other how it could be done:

To keep the syntax jq compatible and also to not have to modify gojq more then needed (fq's fork is here https://github.com/wader/gojq) i've experimented with something similar but use syntax that look like this:

decode(
  ( .type = utf8(4)
  | read32 as $length
  | .length = $length
  | .data = raw($length-4)
  )
)

One way to implement this is to an AST-write of all decode/1 calls where the first argument is not a string. So for example the above could end up something like:

_decode(
  ( _new_stateful_decoder as $_decoder0
  | ( .type = _decode_utf8($_decoder0; 4)
    | _decode_read32($_decoder0) as $length
    | .length = $length
    | .data = raw($_decoder0; $length-4)
    )
  )
)

so it would be a combination of a AST-write (in jq probably) and bunch of native go functions. "intermediate" decode values should be normal decode values so they can be used as normal in expression etc. utf8/1, raw/1 are names that are only allowed inside decode(...) to not clutter the normal namespace. Rewrite to pass the state thingy as an argument instead of input to not clutter input.

Some good things, problems and unknowns with this:

How to handle fork/backtrack, ex: decode(.a = (read8, read16))? should produce two outputs? if not the jq variants allowed inside decode(...) would have to be subset somehow, not so neat. Also a binding to decode value needs to be figured out.
Would be nice if decode(be32) worked i think?
What would decode decode(.a = {a: read8, b: read16}) mean? also one thing to take into account might be that gojq object key order (try gojq -n '{b: ("b" | debug), a: ("a" | debug)}'). Maybe can workaround with ASR-write into some kind of array of pairs?
Symbol mapping would be nice. Sadly jq only has string keys. Workaround could be to have some kind of fuzzy type decode(.a = read8({"0": "a", "0x1": "b"})) etc? support map using a function? decode(.a = read8(.+10))?
Some kind cow-on-change decode value is needed, also thinks like gap fields complicate things. When should they be filled in if at all? only for decode values that are outputted from decode but not "intermediate" somehow?
Sub format decode using nested .a = decode("id3v2")?
Seek i haven't thought about yet
Lots more, can add more once i remember things

Some other simpler alternatives i've played around with is allowing some jq subset like decode({a: u32, b: utf8(4)}) and not much more. But it feels very limited, maybe one could support limited support referencing other fields using some kind of .data = raw($parent.length+4) etc? feels weird.

Note that this also disregards that fq's internals have quite a lot of short cuts and hacks that might needs to be fixed along the way :)

Also i think there will still be a case for having decoders written in go for various reasons, performance, complex formats details like mp4 sample tables might be easer to deal with in go, pcap format reuses existing go code for tcp-reassembly etc.