UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
fabrahman opened this issue · 2 comments
fabrahman commented
Hi,
I tried using pretrained model to annotate a corpus, I first tried a small example where the sentences.txt file has only 5 sentences and it worked well.
Then I switched to my own dataset which is a lot bigger, and I am getting this error in the first step when running targetid prediction:
Any suggestion?
_____________________
COMMAND: /home/hannah/open-sesame/sesame/targetid.py --mode predict --model_name fn1.7-pretrained-targetid --raw_input stories.dev
MODEL FOR TEST / PREDICTION: logs/fn1.7-pretrained-targetid/best-targetid-1.7-model
PARSING MODE: predict
_____________________
Reading data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll ...
# examples in data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll : 19391 in 3413 sents
# examples with missing arguments : 526
Combined 19391 instances in data into 3413 instances.
Reading the lexical unit index file: data/fndata-1.7/luIndex.xml
# unique targets = 9421
# total targets = 13572
# targets with multiple LUs = 4151
# max LUs per target = 5
Reading pretrained embeddings from data/glove.6B.100d.txt ...
Traceback (most recent call last):
File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/hannah/open-sesame/sesame/targetid.py", line 87, in <module>
instances = [make_data_instance(line, i) for i,line in enumerate(fin)]
File "sesame/raw_data.py", line 18, in make_data_instance
for i in range(len(tokenized))]
File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/site-packages/nltk/stem/wordnet.py", line 41, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1909, in _morphy
forms = apply_rules([form])
File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1889, in apply_rules
if form.endswith(old)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
fabrahman commented
So I realized the problem only exist for python2 in the nltk lemmatizer.
I resolve it by using:
import io
and changing Line 86 in /sesame/targetid.py to:
with io.open(options.raw_input, "r", encoding='utf8') as fin:
I close the issue.
edivadiranatnom commented
Hi there,
I fixed line 86 in targetid.py but now I'm getting another error with the encoding:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 8: ordinal not in range(128)
in raw_data.py on line 21:
File "/home/davide/open-sesame/sesame/targetid.py", line 88, in <module>
instances = [make_data_instance(line, i) for i,line in enumerate(fin)]
File "sesame/raw_data.py", line 21, in make_data_instance
i+1, tokenized[i], lemmatized[i], pos_tagged[i], index) for i in range(len(tokenized))]
Any idea?