Use stylometry to decide author of a given book. First time you run the project, all authors from Project Gutenberg will be scraped. Then n authors are selected and their corpus is downloaded. A few books are extracted from the corpus to use for validation. Then the algorithms perform their magic on the remaining corpus, using these steps:
- Merge all raw text
- Find the x most frequent features (words)
- Find the mean and standard deviation for each feature
- For each feature and subcorpus (all books written by an author), calculate the z-value:
- For each book in the validation set, compare its z-values with each of the subcorpus. The author of the subcorpus that gives the lowest delta is the most likely to have written the given book
- Download the books in txt format, for instance using the torrent files here
- Create a file config.json in the root folder with the following structure:
{
"scp": {
"ip": "<IP>",
"port": PORT,
"username": "USERNAME",
"password": "PASSWORD",
"path": "/path/to/remote/txt/folder/"
},
"local_book_lib": "/path/to/local/txt/folder/",
"min_books": 10,
"max_books": 10000,
"number_of_authors": 5
}
I had the Gutenberg text files downloaded on a remote computer so used SCP for that. If you prefer having the textbooks locally, skip the SCP section of the JSON and add your local path to local_book_lib instead.
- Source for raw text books is Project Gutenberg
- Introduction to stylometry by François Dominic Laramée