ValueError: max() arg is an empty sequence
victoriastuart opened this issue · 2 comments
Two issues:
-
Others (e.g. issues #20 , #41 ) asked what a 'tokenized sentence' is; that puzzled me too.
Answer: any sentence is 'tokenized'; e.g.Victoria was born in 1961 in Halifax, Nova Scotia, Canada.
-
If your input file contains blank lines, e.g.
Victoria was born in 1961 in Halifax, Nova Scotia, Canada. Victoria used to work at NIEHS in North Carolina.
then tagger.py
| utils.py
throws an error:
...
max_length = max([len(word) for word in words])
ValueError: max() arg is an empty sequence
You can solve that, simply, by changing the following lines in tagger.py
Original:
print 'Tagging...'
with codecs.open(opts.input, 'r', 'utf-8') as f_input:
count = 0
for line in f_input:
words = line.rstrip().split()
Modified:
print 'Tagging...'
with codecs.open(opts.input, 'r', 'utf-8') as f_input:
count = 0
for line in f_input:
if len(line) <= 1:
line = ''
words = line.rstrip().split()
Added lines:
if len(line) <= 1:
line = ''
@victoriastuart Thanks a lot, you just saved me a lot of time!
Hi @victoriastuart @nkruglikov I am new to python can you please help me out with training the model using GoogleNews word embeddings? I am trying to train using the script
python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method=adam --tag_scheme=iob --pre_emb=GoogleNews-vectors-negative300.bin --all_emb=300
I am stuck with this issue for about 2 months and couldn't resolve it. Thanks in advance.