callahantiff/PheKnowLator

Handling Unicode Encoding Errors in Ontology Metadata

Closed this issue · 3 comments

Problem: unicode errors occurring when writing out knowledge graph metadata locally --depending on the OS and Python version used.

Script: metadata.py

Current Solution: encode/decode ontology term labels, definitions, and synonyms and explicitly ignore UnicodeEncodeError.

Proposed Solution: Add functionality to better handle processing of UnicodeEncodeError

I've generally had success with this (ugly) method, to default reading input as UTF8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

It's worth noting this is generally discouraged for reasons well explained here.

I've generally had success with this (ugly) method, to default reading input as UTF8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

It's worth noting this is generally discouraged for reasons well explained here.

Thanks for the suggestion! I think this really only applies to Python 2, but it's good to know about!

I believe I have a solid solution now (testing at scale as we speak) and will post it here once the test finishes.

OK, I have the solution, which will work for all unicode characters, including characters in foreign languages. The changes I made are described below for each changed script.


Dockerfile

  • Add the following line to ensure that the Python environment within the Docker container has the correct encoding
RUN export PYTHONIOENCODING=utf-8

pkt_kg/metadata.py

  • Modifying the output_knowledge_graph_metadata() method to:
    • Force file writing to use utf-8 encoding
    • Adding some error handling to properly encode and decode variables that trigger the UnicodeEncodingError

Will close this error now, feel free to re-open if need be.