cube2222/octosql

`extraneous or missing " in quoted-field` on CSV file

indigoviolet opened this issue · 5 comments

❯ octosql 'select * from ./free_company_dataset.csv limit 10'      
+------------------+---------+----------------------------------------+--------------------------+-------------------------------------------------------------+------------------+--------------------------+--------------+----------+-----------------------------------+
|     country      | founded |                   id                   |         industry         |                        linkedin_url                         |     locality     |           name           |    region    |   size   |              website              |
+------------------+---------+----------------------------------------+--------------------------+-------------------------------------------------------------+------------------+--------------------------+--------------+----------+-----------------------------------+
| <null>           | <null>  | 'arnold-busck'                         | 'retail'                 | 'linkedin.com/company/arnold-busck'                         | <null>           | 'arnold busck'           | <null>       | '51-200' | 'arnoldbusck.dk'                  |
| <null>           | <null>  | 'pegasus-rental-cars-auckland-airport' | 'automotive'             | 'linkedin.com/company/pegasus-rental-cars-auckland-airport' | <null>           | 'pegasus rental cars     | <null>       | '11-50'  | 'carrentalsaucklandairport.co.nz' |


❯ octosql 'select count(*) from ./free_company_dataset.csv'         
...
Error: couldn't run query: couldn't run source: couldn't run source: couldn't decode message: parse error on line 248, column 130: extraneous or missing " in quoted-field

❯ octosql -v                      
octosql version 0.9.2

Hey @indigoviolet!

This part of the error

parse error on line 248, column 130: extraneous or missing " in quoted-field

is the CSV decoder notifying you that your CSV file is invalid, specifically in line 248.

The first query works because the LIMIT 10 interrupts execution before the CSV decoder reaches that line.

You can find more details in this issue: golang/go#24422

Thanks, that makes sense. I would suggest improving the error messaging here as well -- in this case, it would help to actually dump out the offending snippet of data, and losing the usage message.

Note that this was an escaped quote, perhaps this should be handled?

"contacto \" corporación colina"

@indigoviolet it's unintuitive to me as well, but the spec-compliant escape character for csv double quotes is another double quote.

In other words, the line should read:

"contacto "" corporación colina"

Hmm, TIL. I noticed that some csv parsers allow you to specify the escape character, delimiter, quote character etc (eg. xsv input) - perhaps worth adding to the CSV options (https://github.com/cube2222/octosql#file-access). Unfortunately this is a dataset that someone else provided, so I had to figure out how to normalize it (which I eventually did with pandas, and also xsv), but it might be good for octosql to support these non-standard variants.

In any case, thanks!

That's definitely a good idea for the future!

Unfortunately, having tried to add precisely this, I discovered that the Go CSV encoding/decoding package doesn't support specifying custom escape characters, so this addition would be way too much work for way too little gain for the time being. I've seen that i.e. the Python CSV coding library is much more liberal about this.