FYI on OAI non-image data
kuhlaid opened this issue · 1 comments
kuhlaid commented
Hi @epierson9, this is not an issue with your code, but I downloaded the latest OAICompleteData_ASCII from OAI and several of the MIF text files contained non-ASCII characters which caused errors running the code (specifically the code would throw UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 15: invalid start byte
errors within the non_image_data_processing.py script when trying to open the files). I found the non-ASCII characters by running:
cd <your non-images data directory> # change to the 'non-images data directory'
LC_ALL=C find . -type f -exec grep -c -P -n "[^\x00-\x7F]" {} + # from within the non-images data directory, list the files and number of non-ASCII characters in them (this assumes you only have the .txt files in this directory)
LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" somefile.txt # shows where non-ASCII characters are found in the file (NOTE: copy the results to an empty text file or somewhere to reference)
I thought I would share this since this will likely come up again for others unless OAI removes the non-ASCII characters from those files.
epierson9 commented
Thanks, good to know!