Introduction

Use stylometry to decide author of a given book. First time you run the project, all authors from Project Gutenberg will be scraped. Then n authors are selected and their corpus is downloaded. A few books are extracted from the corpus to use for validation. Then the algorithms perform their magic on the remaining corpus, using these steps:

Merge all raw text
Find the x most frequent features (words)
Find the mean and standard deviation for each feature
For each feature and subcorpus (all books written by an author), calculate the z-value:
For each book in the validation set, compare its z-values with each of the subcorpus. The author of the subcorpus that gives the lowest delta is the most likely to have written the given book

Setup

Download the books in txt format, for instance using the torrent files here
Create a file config.json in the root folder with the following structure:

{
  "scp": {
    "ip": "<IP>",
    "port": PORT,
    "username": "USERNAME",
    "password": "PASSWORD",
    "path": "/path/to/remote/txt/folder/"
  },
  "local_book_lib": "/path/to/local/txt/folder/",
  "min_books": 10,
  "max_books": 10000,
  "number_of_authors": 5
}

I had the Gutenberg text files downloaded on a remote computer so used SCP for that. If you prefer having the textbooks locally, skip the SCP section of the JSON and add your local path to local_book_lib instead.

References

Source for raw text books is Project Gutenberg
Introduction to stylometry by François Dominic Laramée

andvra/bookdiff

Introduction

Setup

References