zelenski/stanford-cpp-library

premature EOF when reading DawgLexicon file

jlutgen opened this issue · 1 comments

My copy of EnglishWords.dat (a lexicon data file in the DAWG format) contains the byte 1A at position 0x80 (fairly early in the file, in one of the first few edge structs). When using an istream's read() to read data from a file opened in "text" mode on Windows, 1A is treated as an end-of-file marker, unfortunately, so input.read((char*) edges, numBytes) in DawgLexicon::readBinaryFile(std::istream& input) does not read all numBytes bytes. This leads to a segmentation fault in DawgLexicon::countDawgWords(Edge* ep).

My fix is to pass in the std::ios::binary flag when opening files in Lexicon::addWordsFromFile(string &filename) and in DawgLexicon::addWordsFromFile(string &filename).

Fixed as part of 6ef5b5c