add configurable csv.field_size_limit in SchemaGenerator

Question

add configurable csv.field_size_limit in SchemaGenerator

mzapukhlyak opened this issue a year ago · 5 comments

File "/lib/python3.11/site-packages/bigquery_schema_generator/generate_schema.py", line 190, in deduce_schema
for json_object in reader:
File "/lib/python3.11/csv.py", line 111, in next
row = next(self.reader)
^^^^^^^^^^^^^^^^^
_csv.Error: field larger than field limit (131072)

version = '1.5.1'

Answer 1 · 2023-10-23T20:53:21.000Z

You have a CSV file with a field that is longer than 128 kiB? Are you sure you don't have a delimiter problem?

In any case, if you are using SchemaGenerator as a python library, you can call csv.field_size_limit() directly from your code, there is no need to add plumbing through the SchemaGenerator class.

If you are calling generate_schema wrapper script from the command line, then we will need to add a command line flag, add tests, add documentation, etc, etc. It's not a high priority for me, so I recommend you hack your copy of the source code. If you are able to polish your code, maybe send me a PR.

Answer 2 · 2023-10-24T10:17:58.000Z

Thanks for the prompt answer.
Yes, unfortunately this csv file is the result of flattening certain first levels of a very large and extremely hierarchically nested json file. Ok, I 'll experiment with csv.field_size_limit() first locally on the user level script.

Answer 3 · 2023-10-24T16:30:42.000Z

Instead of flattening to CSV, can you flatten to JSON, using something like jq?

Also, you don't have to use the generate_schema wrapper script included by my package. It's very easy to create your own wrapper script. Take a look at https://github.com/bxparks/bigquery-schema-generator/blob/develop/examples/csvreader.py. You can just copy that script, add your csv.field_size_limit(), and customize it as you wish.

Answer 4 · 2023-10-25T07:23:33.000Z

Indeed, jq is employed in the intermediate filtering and flattening procedures.
CSV was used as the final format since it simplifies QC of the created schema:)

I had already included field_size_limit in our pipeline script, but it took some time for me to figure out what was going on and why this csv.field_size_limit() was required at all.

So my suggestion (to add it into your script/package) is just to save other people' time.

Answer 5 · 2023-10-25T21:09:46.000Z

I think it would be incredibly rare for someone to produce a CSV file with a field longer than 128 kiB. I don't think it is worth spending time and effort on such a rare edge case, since they can workaround it by creating a custom script. But thanks for reporting it. This ticket will provide them with useful information.