FYI on OAI non-image data

Question

FYI on OAI non-image data

kuhlaid opened this issue 3 years ago · 1 comments

Hi @epierson9, this is not an issue with your code, but I downloaded the latest OAICompleteData_ASCII from OAI and several of the MIF text files contained non-ASCII characters which caused errors running the code (specifically the code would throw UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 15: invalid start byte errors within the non_image_data_processing.py script when trying to open the files). I found the non-ASCII characters by running:

cd <your non-images data directory>                                             # change to the 'non-images data directory'
LC_ALL=C find . -type f -exec grep -c -P -n "[^\x00-\x7F]" {} +       # from within the non-images data directory, list the files and number of non-ASCII characters in them (this assumes you only have the .txt files in this directory)
LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" somefile.txt         # shows where non-ASCII characters are found in the file (NOTE: copy the results to an empty text file or somewhere to reference)

I thought I would share this since this will likely come up again for others unless OAI removes the non-ASCII characters from those files.

Answer 1 · 2021-10-03T13:42:04.000Z

Thanks, good to know!