Alternative source format JSON extraction bug
Opened this issue · 0 comments
Summary
The JSONL files I've been using to write the parquet files are fine for both the paper files and the extracted mentions from seem to cut off at the end. This prevents us from converting these datasets to parquet. Approximately 40% of files exhibit this problem.
Observed behavior
- .jsonl files for non-PDF source files appear cut off at the end, ending in the middle of an object. The files are otherwise valid.
- gzip reports no problems with the file integrity
Possible causes
Case 1. File streams were not flushed properly before being closed.
Case 2. The files in the original dataset are incomplete.
Case 3. The files were extracted incompletely.
Case 1 seems to be the most likely, but this will be confirmed immediately on inspecting the contents of the original dataset. What's strange here is that the files pass the gzip integrity checks - this means the gzip stream was closed properly, but would mean for some reason the buffered writer wasn't. This is very strange behavior.
Case 2 seems unlikely as in all manually inspected cases, it was the last object in the file that was cut off, never one in the middle.
Case 3 isn't easily verifiable, but is a likely explanation if the flushing code has no issues.
So we'll proceed as if 1 is the case.
Remediation
About 40% of the mentions for alternative source files are corrupted in this way. This is a large enough percentage that it's plausible (but unlikely) that some files are cut off, but happened to be cut off in places where it doesn't cause a JSON parsing error (if the stream exactly wrote at the end of an object before closing). So we'll proceed as if any of the alternative parse files may be corrupted. This also saves us from having to inspect the other parses for errors - we assume any of the files may be incomplete.
- Get the alternative parse files. This can be done by:
- Listing the files in the archive and writing them to a file
- Filtering the file's lines to only keep the ones that are json files with mentions in alternative paper parses.
- Extract these files one-by-one.
- Re-concatenate the compressed JSONL files
- Check for issues. If so, debug and go back to 2. Otherwise done.
I'm not sure about the performance of the plan in step 1. Since the archive has 100 million files and we're individually extracting 2 million, this may be extremely slow. I'll do a test in powers of 10 until 10,000, starting with 1, and decide whether to proceed. If this process will take longer than 6 days, then it's worth re-extracting the entire archive and trimming out the standard paper and PDF mentions files before proceeding to 2.