Information theory on sequences (probably mostly language modeling and transformers)
See the Project Gutenberg page on robot access for more information about downloading.
For now we will just get txt file versions of the English books.
wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"
Currently working on tokenization and BPE stuff.