twosigma/ngrid

ngrid coughs on UTF-8 with BOM in Python 2.7

djma opened this issue · 11 comments

djma commented

$ ngrid ngrid_fails.csv
Traceback (most recent call last):
File "/usr/local/bin/ngrid", line 9, in
load_entry_point('ngrid==0.0.0', 'console_scripts', 'ngrid')()
File "build/bdist.macosx-10.10-x86_64/egg/ngrid/main.py", line 76, in main
File "build/bdist.macosx-10.10-x86_64/egg/ngrid/grid.py", line 998, in show_model
File "build/bdist.macosx-10.10-x86_64/egg/ngrid/grid.py", line 687, in show
File "build/bdist.macosx-10.10-x86_64/egg/ngrid/grid.py", line 807, in __print
File "build/bdist.macosx-10.10-x86_64/egg/ngrid/text.py", line 80, in palide
File "build/bdist.macosx-10.10-x86_64/egg/ngrid/text.py", line 70, in elide
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

$ cat ngrid_fails.csv
"Physician_ID","Physician_Profile_ID","Physician_Profile_First_Name","Physician_Profile_Middle_Name","Physician_Profile_Last_Name","Physician_Profile_Suffix_Name","Physician_Profile_Address1","Physician_Profile_Address2","Physician_Profile_City","Physician_Profile_State","Physician_Profile_Zip_Code","Physician_Profile_Country","Physician_Profile_Province","Physician_Registration_Address1","Physician_Registration_Address2","Physician_Registration_City","Physician_Registration_State","Physician_Registration_Zip_Code","Physician_Registration_Country","Physician_Registration_Province","Physician_Specialty","Physician_Additional_Specialty1","Physician_Additional_Specialty2","Physician_Additional_Specialty3","Physician_Additional_Specialty4","Physician_Additional_Specialty5","Physician_States_on_Licenses1","Physician_States_on_Licenses2","Physician_States_on_Licenses3","Physician_States_on_Licenses4","Physician_States_on_Licenses5"
"302447","531101","SAUD","M",". FAROOQI",,"ON025 WINFIELD RD.",,"WINFIELD","IL","60190","United States",,,,,,,,,"Other Service Providers/ Specialist",,,,,,"IL",,,,
"93922","85212","ENENGE",,"A'BODJEDI",,"32 STRAWBERRY HILL CT","BENNETT BEHAVIORAL HEALTH CENTER","STAMFORD","CT","06902","United States",,,,,,,,,"Allopathic & Osteopathic Physicians/ Psychiatry & Neurology/ Psychiatry",,,,,,"CT",,,,
"331431","358185","THOMAS","MARSHALL","AABERG","JR.","2757 LEONARD ST NE","SUITE 200","GRAND RAPIDS","MI","49525","United States",,,,,,,,,"Allopathic & Osteopathic Physicians/ Ophthalmology",,,,,,"MI","CA",,,
"714","526333","AAZY","A","AABY",,"1955 NW NORTHRUP ST",,"PORTLAND","OR","97209","United States",,,,,,,,,"Allopathic & Osteopathic Physicians/ Ophthalmology",,,,,,"OR",,,,
"904","329191","ABDUL","AZIZ","AADAM",,"WASHINGTON UNIVERSITY SCHOOL OF MEDICINE","660 S. EUCLID AVE. CAMPUS BOX 8121","ST. LOUIS","MO","63110","United States",,,,,,,,,"Allopathic & Osteopathic Physicians/ Internal Medicine",,,,,,"MO","IL",,,
"168","404325","AARON","ARTHUR","AADLAND",,"1729 S CLIFF AVE",,"SIOUX FALLS","SD","57105","United States",,,,,,,,,"Dental Providers/ Dentist/ General Practice",,,,,,"SD",,,,
"162995","57417","JON","P","AAGAARD",,"2001 W WIESBROOK RD",,"WHEATON","IL","60189","United States",,,,,,,,,"Allopathic & Osteopathic Physicians/ Family Medicine",,,,,,"IL",,,,
"280763","419137","ROBERT","O","AAGARD",,"120 N 1220 E","STE 7","AMERICAN FORK","UT","84003","United States",,,,,,,,,"Allopathic & Osteopathic Physicians/ Obstetrics & Gynecology",,,,,,"UT",,,,
"107629","182277","GEETHA","N","AAKALU",,"11 WILBUR RD",,"THIELLS","NY","10984","United States",,,,,,,,,"Allopathic & Osteopathic Physicians/ Psychiatry & Neurology/ Psychiatry",,,,,,"NY",,,,

Hi David. Would you mind sending me the output of "printenv | grep LC" ?

djma commented

It's empty

And I imagine "printenv LANG" also prints nothing? It's odd that you don't have a UTF-8 terminal set up. But I will fix it so that it can accommodate.

djma commented

$ printenv LANG
en_US.UTF-8

OK one more:

python -c 'import sys; print(sys.version); print(sys.stdout.encoding)'
djma commented

$ python -c 'import sys; print(sys.version); print(sys.stdout.encoding)'
2.7.8 (default, Oct 19 2014, 16:02:00)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)]
UTF-8

I suspect your file has some non-ASCII character in it that is getting lost when you cat it. Do you think you could email it to me directly?

djma commented

https://dl.dropboxusercontent.com/u/7618199/ngrid_fails.csv

Yeah I think you're right..
$ xxd -p ngrid_fails.csv
efbbbf2250687973696369616e5f4944222c2250687973696369616e5f50
...

What's this efbbf?!

djma commented

@jylin fixes the file by removing the BOM characters
dd if=ngrid_fails.csv bs=1c skip=3 > fixed.csv

But it might make sense to support it. I'm not familiar with standards/best practices.

Yeah, that file is UTF-8 encoded with BOM. It'll take a bit of work to make the implementation work correctly with UTF-8 inputs in both Python 2 and Python 3.

It's easy enough to strip the BOM, but we want it to work with arbitrary Unicode characters.