bxparks/bigquery-schema-generator

Skip bad records instead of throwing an exception?

VictorLeP opened this issue · 8 comments

I have a newline delimited JSON file with a few bad (i.e. undecodable) lines. Currently this results in a JSONDecodeError halting execution.
Given that BigQuery can cope with bad records (--max_bad_records parameter) by skipping them, would it be useful to have a similar option in the schema generator? (This could be useful for e.g. CSV files with missing trailing columns as well.)
Concretely, the issue with my JSON file could be resolved by adding an (optional) try/except to

def json_reader(file):
"""A generator that converts an iterable of newline-delimited JSON objects
('file' could be a 'list' for testing purposes) into an iterable of Python
dict objects.
"""
for line in file:
yield json.loads(line)

Can you filter out the invalid lines of your JSON input file before processing it through generate_schema.py? I mean it's a 12-line python script:

(Edit: Fixed line counter, originally started as an error counter)

#!/usr/bin/env python3
import sys
import json

line_number = 0
for line in sys.stdin:
    line_number += 1
    try:
        json.loads(line)
        print(line, end='')
    except:
        print(f'#{line_number}: invalid JSON', file=sys.stderr)

Put this in a script calledfilter_json.py. Then you would just use normal unix pipes:

$ ./filter_json.py < mydata.json | generate_schema.py

I am hesitant about adding a --max_bad_records flag because then I have to make sure that there is exact compatibility between what bigquery-schema-generator considers to be "bad" versus bq load, and I'd rather not be chasing bq load compatibility for something like this. Plus, different people probably want to handle those failures in different ways. Some people just want to drop them, some people want make a small transformation to bring the data into compliance.

I want to keep generate_schema.py as simple as possible, so I would prefer off-loading the data cleansing and data transformations to the people who know their data best. But I'm not completely opposed to something like --max_bad_records for lines which completely fail to parse as JSON. Maybe you can give me some more compelling reasons why the flag is a good idea.

The core reason for me would be performance, i.e. not having to traverse and parse the JSON file twice.
I agree that naming it max_bad_records may be confusing, especially if it wouldn't cover all types of errors that bq load can ignore. Maybe something like ignore_invalid_json would be better?
I note that there is already some resilience to errors, with certain edge cases being caught and logged using log_error(). There's also --sanitize_names which is another built-in data cleaning step - I'm also not even sure how useful this is, given that I'd think you would still have to change the column names in the original data as well?

Performance is a fair point for large input files. So something like --ignore_invalid_json or --ignore_invalid_lines seems reasonable. I'll noodle over this for a bit, I might get to it sometime this week. With regards to --sanitize_names, that was contributed by someone else (I think?), and I can't remember what it does exactly. :-)

For JSON files, it seems to me that the easiest/shortest solution would be to catch the exception in the json_reader() method mentioned above, although I'm not sure this solution is the cleanest in terms of where the exception is caught (deduce_schema() might be better).
However, it seems that treating the exception elsewhere requires a more convoluted solution with catching and yielding the exception to the consumer (1, 2).

If the option becomes more generic, i.e. ignore_invalid_lines, would this also imply handling 'invalid' lines in CSV files? For now, any line with too few or too many columns is managed gracefully by csv.DictReader (too few: missing columns get a default value denoted by restkey, usually None; too many: those values are added to a special field denoted by restval, usually None).
To me, it seems that also supporting ignore_invalid_lines for CSV files implies that by default, the schema generator should therefore check whether the line isn't invalid (i.e. no trailing None columns nor a value for restval), and explicitly raise an error if it is. It would make the code a bit more complicated though :-)

Implemented --ignore_invalid_lines. It seemed easiest to catch the exception and yield the exception back to the caller, and test for that. My intention was that this would also work transparently for CSV files.. but as soon as I checked the code, I realize that the exception would be thrown in the for loop for the DictReader, so the flag won't work for CSV. Hmm, I think I can fix that in a subsequent commit.

Side comment about --sanitize_names: After reviewing that code (yes, contributed by someone else), I added some comments in the README.md that this flag is most useful for CSV files, because apparently bq load automatically makes that transformation of column names for CSV files only. For JSON files, it throws an error and does not load the file if the column names do not match the requirements.

You can try out my code by checkinout out the develop branch.

Looks like it works well, thanks!
I noticed that even without --ignore_invalid_lines the behavior is slightly different: instead of printing an exception and outputting no schema whatsoever, it prints the Problem on line ... error but also outputs the schema as generated up until then, due to breaking instead of throwing an exception. It's a small difference, I don't know if it's worth documenting.

Thanks for testing! The change in behavior was not intended, I was tired yesterday and was not thinking clearly. I fixed it so that it follows the old behavior when --ignore_invalid_lines is missing, in other words, it stops and throws an exception as soon as the first invalid line is encountered.

I pulled the latest version and I can confirm that without the flag, the script behaves as it did originally (throwing an exception); with the flag, it works as expected (skipping the line).
Once again thanks for the addition!