Tokenization failing for IITB Monolingual corpus

Question

Tokenization failing for IITB Monolingual corpus

Closed this issue 7 years ago · 2 comments

Getting the below error while trying to do the tokenization for IITB monolingual corpus while same is working fine for the parallel corups(target language - Hindi)

Traceback (most recent call last):
File "indic_tokenize.py", line 67, in
for line in ifile.readlines():
File "/usr/lib/python2.7/codecs.py", line 676, in readlines
return self.reader.readlines(sizehint)
File "/usr/lib/python2.7/codecs.py", line 585, in readlines
data = self.read()
File "/usr/lib/python2.7/codecs.py", line 474, in read
newchars, decodedbytes = self.decode(data, self.errors)
MemoryError

Answer 1 · 2018-04-27T13:33:13.000Z

The problem is that the command-line interface reads in the entire file before processing. I have now changed it to read line by line, hence the memory problem for large files. Let me know if that solves the problem.

Answer 2 · 2018-04-27T15:06:39.000Z

Thanks Anoop for the quick fix. I have tried and its working now without memory error. I am closing the issue.