shshemi/tabiew

[ BUG ] running with separator ¶

greyHairChooseLife opened this issue · 1 comments

Hi, there.

$ tw --separator '¶' ./my.csv
$ tw --infer-schema safe --separator '¶' ./my.csv
$ tw --infer-schema no --separator '¶' ./my.csv

These returns all same.

Error: ComputeError(ErrString("could not parse `1009880005252�` as dtype `str` at column '�' (column number 1)\n\nThe current offset in the file is 154 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `dtypes` argument\n- setting `ignore_errors` to `True`,\n- adding `1009880005252�` to the `null_values` list.\n\nOriginal error: ```invalid utf-8 sequence```"))

It says invalid utf-8 but it is valid utf-8. Any clue, please?

Regards

Hi,

Thank you for reporting this.
The problem is that the underlying CSV library (Polars) assumes that separator and quote characters are ASCII. Therefore, the Pilcrow character is cast into u8 and turned into an invalid character.
A more user-friendly message will be shown in the next version.
Let me know if I can help with anything else.

Bests