Replace infer_schema_length by infer_schema
josevalim opened this issue · 4 comments
Today infer_schema_length has an awkward API, since setting it to nil
is used to infer all columns and 0
is used to disable it.
I propose:
infer_schema: true | false | non_neg_integer()
Where true enables, false disables, and the integer configures the length. The default can be the same as today.
I like this, but what would we use for all rows? IIUC true
-> default (1000 rows).
true
means all rows.
thanks for improving this! just share a way duckdb did.
it has two parameters,
- auto_detect: true | false
- sample_size: BIGINT (-1, mean all rows, default 20480)
ref:
CSV Import – DuckDB
CSV Auto Detection – DuckDB
I am more than happy to take a stab at this
Today infer_schema_length has an awkward API, since setting it to
nil
is used to infer all columns and0
is used to disable it.I propose:
infer_schema: true | false | non_neg_integer()
Where true enables, false disables, and the integer configures the length. The default can be the same as today.
- is it
only
forcsv
or should we also change it onload_ndjson
? - Also one strange thing I didn't get is; polars side doesn't seem to have an option to disable schema inference for ndjson
👉🏼 given Option<NonZeroUsize>)
to infer schema, what I understand is ;
if
it'sNone
will useentire
fileelse
will uselen(given)
rows- will
fail
atcomptime
if you give0
/// Set the JSON reader to infer the schema of the file. Currently, this is only used when reading from
/// [`JsonFormat::JsonLines`], as [`JsonFormat::Json`] reads in the entire array anyway.
///
/// When using [`JsonFormat::JsonLines`], `max_records = None` will read the entire buffer in order to infer the
/// schema, `Some(1)` would look only at the first record, `Some(2)` the first two records, etc.
///
/// It is an error to pass `max_records = Some(0)`, as a schema cannot be inferred from 0 records when deserializing
/// from JSON (unlike CSVs, there is no header row to inspect for column names).
pub fn infer_schema_len(mut self, max_records: Option<NonZeroUsize>) -> Self {
self.infer_schema_len = max_records;
self
}