davedelong/CHCSVParser

Fails to parse record with unescaped parenthesis

marvin-yorke opened this issue · 10 comments

given the record

16681;6;Orehovyj boulevard, ul. Musy Dzhalilja (odd side);20;out;55.6141571054;37.7460757208;800;34;34;0;0;0;0;0;1

library fails to parse the 3rd field with the following error:

Unexpected delimiter. Expected ';' (0x3B), but got '(' (0x28)

Is there any way to parse this data without altering it (e.g adding quotes)?

Thanks for reporting this. I added a unit test to parse the exact text you provided, and it seems to have no problem with it. I tried parsing it as the in-memory string, and as a file written to disk (which is similar to what your code was doing). Both tests pass without modification to the parser, so I'm not sure what the issue is here.

Are the URLs you're parsing remote (coming in over a network connection) or local file URLs? Do you have an example of either that I could try?

Hi Dave,
I'm downloading an archive from the server, unpack it into Documents directory and supply a URL to the file in Documents dir.
You can find the files I'm using in the following archive: http://metro4all.org/data/msk.zip
The file I've encountered the problem in is portals_ru.csv

Thanks @marvin-yorke. I incorporated the portals.csv file into the unit tests, but they're still passing on my machine. 😕

Hm, ok, I've cloned the repo and run the tests and it works on my machine too. I should have mentioned that original case was observed on iOS, not OS X. Could this make any difference? Also I've installed the library from Cocoapods, not from github, although there's no major difference to the latest code.. Anyway, I'll try again with my iOS app and let you know about the results

I've checked the issue again and here's the line that breaks the parsing
17530;2;"Крокус Экспо" (павильон 1, 2);215;both;55.8235522598;37.3855503584;800;56;0;0;0;400;950;23;0
Turns out that it's not parentheses that cause the issue, but quotes. And now I'm not quite sure whether it's a parser problem or my data is malformed. What do you think?

Yes, that is a problem with the data. When the parser encounters a field that starts with ", it assumes the field ends with the corresponding closing ". And then since the next character after the closing " isn't a delimiter (;), it aborts with an error.

Then could you please help me on how to correct my data?

The solutions seems pretty clear: don't start an field with quoted text; or if a field starts with quoted text wrap the whole field in quotes.

Is there a property that can turn of such behavior. Or some work around without me having to edit the file I am parsing.

Edit: added one seems to work fine now :)