Authorship Attribution

  • This software performs a statistical analysis of an unattributed text for comparison to the statistics of known authors. Based on this analysis, the most likely author of the unattributed text is determined

  • A system for inputting works by known authors to train the database is in development.

  • Method primarily based on techniques described in "Text Classification For Authorship Attribution Analysis" by M. Sudheep Elayidom, Chinchu Jose, Anitta Puthussery, and Neenu K Sasi

Characteristics considered:

  • Average word length: This is simply the average number of characters per word, calculated after the punctuation has been stripped.
  • Type-Token Ratio: the number of different words used in a text divided by the total number of words. It's a measure of how repetitive the vocabulary is.
  • Hapax Legomena Ratio is the ratio of unique words to total words. Unique words appear exactly once in the text.
  • Average number of words per sentence.