d99kris/rapidcsv

Quotes in Text

jtaylorme opened this issue · 7 comments

If one has data such as below (note "Norway) , when I do a GetCell() on ADDRESS1 it will pull the data for ADDRESS1 and SALARY1. If the quote is anywhere else in the text it works as expected. Only when it is in the first position is there a problem. This is an issue whether you use AutoQuoting or not. The text is TAB delimited.

Thanks.

Jim

ID_NO NAME1 AGE1 ADDRESS1 SALARY1
2 Allen 25 |Texas| 15000.0
3 Teddy 23 "Norway 20000.0
4 Mark 25 Rich-Mond 65000.0
6 Paul 32 California 20000.0
7 Allen 25 Texas 15000.0

Hi - this type of CSV formatting is not supported today by rapidcsv.

I suppose we can look into
(a) make it configurable to disable the handling of double-quoting of cells,
or
(b) maybe make the quoting character configurable (default ").

Is the | around the Texas cell considered "quotation marks" in this type of CSV data? If yes, maybe option (b) is more suitable.

Don't worry about it unless you think it would be a good thing for which to account. Just wanted to bring it to your attention. The odd characters were simply stress testing for my application. Just wanted to see what would happen with odd characters in the data as I won't have control over the data used.

Also, another question. If set LabelParams(0, -1) like this. Should I be able to see column names using a row Index? I keep getting this turned around in my head. I was thinking if I was using column names rather than indexes the row index wouldn't apply to the column header row.

Thanks again for such a great library.

Jim

Thanks for the quick reply! For the leading " character I may skip adding any special functionality for that for now, as it's very common for CSV parsers/readers to treat such data as a quoted cell (in rapidcsv's case until end double quote or end of file).

The way I reason about LabelParams(pColumnNameIdx, pRowNameIdx) is that it allows me to specify where in the raw (zero-index based) CSV data are the "column and row labels".

If the pColumnNameIdx is specified as 0 (as in your example), it means at row 0 there are labels to be treated as column names (this is a common CSV layout).

With pRowNameIdx specified as -1 (again, as in your example), it means that there are no row labels.

When using functions like GetCell() we are accessing data excluding the labels set up. So for a file like https://github.com/d99kris/rapidcsv/blob/master/examples/colrowhdr.csv then GetCell("Open", "2017-02-24") would be equivalent to calling GetCell(0, 0).

If you want to access the entire raw CSV file using GetCell, GetColumn, GetRow, you need to tell rapidcsv to not treat any data as labels, by setting LabelParams(-1, -1).

Hope this clarifies.

Understood. I just thought if I told it to use column names then I wouldn't be able to use an index number in the row and see the column returned as a value. Another thing that isn't a big deal as long as I know what to expect. Just caught me off guard when I was returning rows and it returned the column headers as a row.

Thanks again.

Jim

Ok, yeah, so rapidcsv allows you to use column index even after specifying that there's for example a column with row labels. But that column index will be zero-based and point to after the column of row labels..

No problem. Expectations is the main problem, rather than how it works... was just expecting it to act a certain way :-)

jim

Totally understandable. I've been thinking of some graphical illustration to show how it works (and how labels map to indices), but I've not come around to putting something together yet.

I'll close this issue for now. Feel free to re-open for any follow-up questions.