handle non-unicode characters

Question

handle non-unicode characters

georgid opened this issue 7 years ago · 8 comments

when decoding characters, make sure their meaning is not lost (now they are ignored as a workaround here) but they are not showed in the .lab output
e.g. try different encodings or try to guess encoding http://unicodebook.readthedocs.io/guess_encoding.html
make sure they are encoded properly here

maybe use .encode('utf-8').strip() instead of str()

Answer 1 · 2017-12-25T11:32:20.000Z

def isUTF8(data):
    try:
        data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        return True

Answer 2 · 2018-01-05T16:30:07.000Z

This code might be useful, too:

#     s = list(words_ortho)
    
        #@@@ combine two-char diacritics: 
        # TODO: not optimal has too loop in word for each diacritic type 
        
#         # turkish diaeresis
#         s = combineDiacriticsChars(s, u'\u0308')
#          
#         # telugu macron
#         s = combineDiacriticsChars(s, u'\u0304')
#          
#         # telugu acute
#         s = combineDiacriticsChars(s, u'\u0301') 
#          
#         # telugu dot below
#         s = combineDiacriticsChars(s, u'\u0323')                      
#          
#         # telugu dot above
#         s = combineDiacriticsChars(s, u'\u0307')

Answer 3 · 2018-01-22T16:50:11.000Z

This is true for any issues with accute , etc. accents like in spanish and french.
Convert letters with such accents to the same letter without the accent.

Answer 4 · 2018-02-07T10:56:28.000Z

The
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) is solved by removing the
/usr/local/lib/python2.7/site-packages/mir_eval-0.4-py2.7.egg-info and leaving only the
/usr/local/lib/python2.7/site-packages/mir_eval-0.3-py2.7.egg-info.
The 0.3 is installed correctly (has a file installed-files.txt) unlike the 0.4 version

Answer 5 · 2018-03-08T16:08:14.000Z

/Users/joro/Documents/VOICE_magix/smule/dataset/692653830_3071180/timed_lyrics.txt has á on phrase ‘no más’

Answer 6 · 2018-03-08T16:08:38.000Z

https://community.esri.com/thread/149400

Answer 7 · 2018-05-26T17:11:21.000Z

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 5: ordinal not in range(128)

Answer 8 · 2018-05-29T14:54:43.000Z

Decode error solved in commit by assuming encoding latin-1 and replacing manually all accents, macron etc. by their repsective character without them e.g. á is replaced by a . Since there is no spanish dict , this results in e.g. más becoming mas and then replaced by closest english word mask.

Read this for full understanding of unicode in python 3.

TODO: represent all diacritics by their sign , so that we do not need to handle manually all cases.
as in given code on 5th January above.