efficient/catbench

jaguar chokes with MemoryError on large (gigabytes) JSON files

solb opened this issue · 1 comments

solb commented

This makes it really hard to deal with these files because jq and Vim also tend to choke on them. Although the coreutils are mostly tolerant of these files, it's still unclear how to easily/quickly use them to extract the logs because base64 adds line breaks that jaguar saves as escapes; normally it then interprets them on the query side, but this is actually hard to do efficiently with other tools, and base64 -d doesn't handle escapes itself.

Then of course there's the obvious problem that our data extraction/processing scripts barf when jaguar does. Currently I've been working around this by replacing the network_rawnums/timescale_and_cdf superhero dream team with:
$ grep -F "Completed after:" ><temp 1>
$ cut -d" " -f4 <temp 1> ><temp 2>
$ tail -n+"$((1 + n * 30000000))" <temp 2> | head -n30000000 | tail -n20000000 | <...>

  • AND -

$ ./largescale_and_cdf <...>

solb commented

Thomas had a brainstorm: we now work around this by compressing large logs before base-64 encoding them! To tell whether a Jaguar file does this, try reading meta.unpack; if this key exists, execute its value instead of a simple base64 -d to unpack the meta.log entry.