-
Google has collection of consecutive words from scanned books. Question is to what extent large chunks of the book text can be reconstructed from this. It is a sequence assembly problem, but on a big alphabet (words)
-
Our system downloads books from Project Gutenberg, splits these books into NGrams (N=2 to 10) and then runs a sequence assembly program to reconstruct the books
-
Our system achieved a peak performance by assembling 22 million 10Grams in 2 min with 100% accuracy on a machine with 2.1 Ghz CPU and 4 GB of memory
-
Used C, C++ STLs, FNV Hash, Trie, STL hacks, Linux
himanshujindal/Reconstructing-Books-from-Google-Ngrams
Reconstructs whole books from the ngrams provided by google at http://books.google.com/ngrams/datasets
C