cldf/csvw

Field limit of 131072 characters?

Closed this issue · 3 comments

I am getting _csv.Error: field larger than field limit (131072) when validating a CLDF dataset.

Sure, I am storing large strings (whole book chapters in mixed markdown/HTML) in a single CSV cell.
If a ChapterTable in a CLDF dataset is not the proper way to store such data, what is?

If that is the proper way and this is not supposed to happen:

import sys
csv.field_size_limit(sys.maxsize)

The cldf command has an option to enlarge field size limit, see

$ cldf -h
usage: cldf [-h] [--log-level LOG_LEVEL] [-z FIELD_SIZE_LIMIT] COMMAND ...

optional arguments:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        log level [ERROR|WARN|INFO|DEBUG] (default: 20)
  -z FIELD_SIZE_LIMIT, --maxfieldsize FIELD_SIZE_LIMIT
                        Maximum length of a single field in any input CSV
                        file. (default: 131072)

I thought about making a larger value the default, but this would have meant

  • overriding Python's default
  • potentially removing a check that is often useful (e.g. when using the wrong cell delimiter for reading).

If you call Dataset.validate from Python code, then yes, explicitly setting field size limit via csv.field_size_limit would be the recommended way.

The cldf command has an option to enlarge field size limit, see

$ cldf -h
usage: cldf [-h] [--log-level LOG_LEVEL] [-z FIELD_SIZE_LIMIT] COMMAND ...

optional arguments:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        log level [ERROR|WARN|INFO|DEBUG] (default: 20)
  -z FIELD_SIZE_LIMIT, --maxfieldsize FIELD_SIZE_LIMIT
                        Maximum length of a single field in any input CSV
                        file. (default: 131072)

Ah, I didn't even check there. Thanks!