GitHub Archive dump updates error out with JSON parsing error
Closed this issue · 1 comments
tja commented
The latest GitHub archive dump updates fail with the following error:
invalid character '\\x00' looking for beginning of value
Investigation shows that the dump contains a block of null bytes (0x00
). One theory is that the dump was "cleaned up" in a post-processing step, by "wiping out" an existing record.
tja commented
This has been fixed by introducing a wrapping io.Reader
that simply converts all null bytes (0x00
) to line-feeds (0x0a
). Line-feeds are ignored by the JSON parser, i.e. the "wiped out" record is skipped.