bxparks/bigquery-schema-generator

add configurable csv.field_size_limit in SchemaGenerator

mzapukhlyak opened this issue · 5 comments

File "/lib/python3.11/site-packages/bigquery_schema_generator/generate_schema.py", line 190, in deduce_schema
for json_object in reader:
File "/lib/python3.11/csv.py", line 111, in next
row = next(self.reader)
^^^^^^^^^^^^^^^^^
_csv.Error: field larger than field limit (131072)

version = '1.5.1'

You have a CSV file with a field that is longer than 128 kiB? Are you sure you don't have a delimiter problem?

In any case, if you are using SchemaGenerator as a python library, you can call csv.field_size_limit() directly from your code, there is no need to add plumbing through the SchemaGenerator class.

If you are calling generate_schema wrapper script from the command line, then we will need to add a command line flag, add tests, add documentation, etc, etc. It's not a high priority for me, so I recommend you hack your copy of the source code. If you are able to polish your code, maybe send me a PR.

Thanks for the prompt answer.
Yes, unfortunately this csv file is the result of flattening certain first levels of a very large and extremely hierarchically nested json file. Ok, I 'll experiment with csv.field_size_limit() first locally on the user level script.

Instead of flattening to CSV, can you flatten to JSON, using something like jq?

Also, you don't have to use the generate_schema wrapper script included by my package. It's very easy to create your own wrapper script. Take a look at https://github.com/bxparks/bigquery-schema-generator/blob/develop/examples/csvreader.py. You can just copy that script, add your csv.field_size_limit(), and customize it as you wish.

Indeed, jq is employed in the intermediate filtering and flattening procedures.
CSV was used as the final format since it simplifies QC of the created schema:)

I had already included field_size_limit in our pipeline script, but it took some time for me to figure out what was going on and why this csv.field_size_limit() was required at all.

So my suggestion (to add it into your script/package) is just to save other people' time.

I think it would be incredibly rare for someone to produce a CSV file with a field longer than 128 kiB. I don't think it is worth spending time and effort on such a rare edge case, since they can workaround it by creating a custom script. But thanks for reporting it. This ticket will provide them with useful information.