handle non-unicode characters
georgid opened this issue · 8 comments
-
when decoding characters, make sure their meaning is not lost (now they are ignored as a workaround here) but they are not showed in the .lab output
e.g. try different encodings or try to guess encoding http://unicodebook.readthedocs.io/guess_encoding.html -
make sure they are encoded properly here
maybe use .encode('utf-8').strip()
instead of str()
def isUTF8(data):
try:
data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
return True
This code might be useful, too:
# s = list(words_ortho)
#@@@ combine two-char diacritics:
# TODO: not optimal has too loop in word for each diacritic type
# # turkish diaeresis
# s = combineDiacriticsChars(s, u'\u0308')
#
# # telugu macron
# s = combineDiacriticsChars(s, u'\u0304')
#
# # telugu acute
# s = combineDiacriticsChars(s, u'\u0301')
#
# # telugu dot below
# s = combineDiacriticsChars(s, u'\u0323')
#
# # telugu dot above
# s = combineDiacriticsChars(s, u'\u0307')
This is true for any issues with accute , etc. accents like in spanish and french.
Convert letters with such accents to the same letter without the accent.
The
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) is solved by removing the
/usr/local/lib/python2.7/site-packages/mir_eval-0.4-py2.7.egg-info and leaving only the
/usr/local/lib/python2.7/site-packages/mir_eval-0.3-py2.7.egg-info.
The 0.3 is installed correctly (has a file installed-files.txt) unlike the 0.4 version
/Users/joro/Documents/VOICE_magix/smule/dataset/692653830_3071180/timed_lyrics.txt has á on phrase ‘no más’
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 5: ordinal not in range(128)
Decode error solved in commit by assuming encoding latin-1 and replacing manually all accents, macron etc. by their repsective character without them e.g. á is replaced by a . Since there is no spanish dict , this results in e.g. más becoming mas and then replaced by closest english word mask.
Read this for full understanding of unicode in python 3.
TODO: represent all diacritics by their sign , so that we do not need to handle manually all cases.
as in given code on 5th January above.