georgid/AlignmentDuration

handle non-unicode characters

georgid opened this issue · 8 comments

  1. when decoding characters, make sure their meaning is not lost (now they are ignored as a workaround here) but they are not showed in the .lab output
    e.g. try different encodings or try to guess encoding http://unicodebook.readthedocs.io/guess_encoding.html

  2. make sure they are encoded properly here

maybe use .encode('utf-8').strip() instead of str()

def isUTF8(data):
    try:
        data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        return True

This code might be useful, too:

#     s = list(words_ortho)
    
        #@@@ combine two-char diacritics: 
        # TODO: not optimal has too loop in word for each diacritic type 
        
#         # turkish diaeresis
#         s = combineDiacriticsChars(s, u'\u0308')
#          
#         # telugu macron
#         s = combineDiacriticsChars(s, u'\u0304')
#          
#         # telugu acute
#         s = combineDiacriticsChars(s, u'\u0301') 
#          
#         # telugu dot below
#         s = combineDiacriticsChars(s, u'\u0323')                      
#          
#         # telugu dot above
#         s = combineDiacriticsChars(s, u'\u0307') 
 

This is true for any issues with accute , etc. accents like in spanish and french.
Convert letters with such accents to the same letter without the accent.

The
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) is solved by removing the
/usr/local/lib/python2.7/site-packages/mir_eval-0.4-py2.7.egg-info and leaving only the
/usr/local/lib/python2.7/site-packages/mir_eval-0.3-py2.7.egg-info.
The 0.3 is installed correctly (has a file installed-files.txt) unlike the 0.4 version

/Users/joro/Documents/VOICE_magix/smule/dataset/692653830_3071180/timed_lyrics.txt has á on phrase ‘no más’

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 5: ordinal not in range(128)

Decode error solved in commit by assuming encoding latin-1 and replacing manually all accents, macron etc. by their repsective character without them e.g. á is replaced by a . Since there is no spanish dict , this results in e.g. más becoming mas and then replaced by closest english word mask.

Read this for full understanding of unicode in python 3.

TODO: represent all diacritics by their sign , so that we do not need to handle manually all cases.
as in given code on 5th January above.