Handling Unicode Encoding Errors in Ontology Metadata

Question

Handling Unicode Encoding Errors in Ontology Metadata

Closed this issue 4 years ago · 3 comments

Problem: unicode errors occurring when writing out knowledge graph metadata locally --depending on the OS and Python version used.

Script: metadata.py

Current Solution: encode/decode ontology term labels, definitions, and synonyms and explicitly ignore UnicodeEncodeError.

Proposed Solution: Add functionality to better handle processing of UnicodeEncodeError

Answer 1 · 2020-05-14T00:38:01.000Z

I've generally had success with this (ugly) method, to default reading input as UTF8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

It's worth noting this is generally discouraged for reasons well explained here.

Answer 2 · 2020-05-14T17:39:53.000Z

I've generally had success with this (ugly) method, to default reading input as UTF8:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
It's worth noting this is generally discouraged for reasons well explained here.

Thanks for the suggestion! I think this really only applies to Python 2, but it's good to know about!

I believe I have a solid solution now (testing at scale as we speak) and will post it here once the test finishes.

Answer 3 · 2020-05-14T22:45:27.000Z

OK, I have the solution, which will work for all unicode characters, including characters in foreign languages. The changes I made are described below for each changed script.

Dockerfile

Add the following line to ensure that the Python environment within the Docker container has the correct encoding

RUN export PYTHONIOENCODING=utf-8

pkt_kg/metadata.py

Modifying the output_knowledge_graph_metadata() method to:
- Force file writing to use utf-8 encoding
- Adding some error handling to properly encode and decode variables that trigger the UnicodeEncodingError

Will close this error now, feel free to re-open if need be.