Test task
Write a routine that gets a set of email body texts and assigns a spam probability to each of them depending on the similarity to the other emails in the set. The more similar it is to the other emails, the more likely it is a spam email.
Realization
Compare similarity in texts with W-shingling algorithm.
Information about algorithm: http://en.wikipedia.org/wiki/W-shingling
How to use
Help node index.js --help
Cli node index.js -p ./testFiles/
There are 4 options for cli command
-
Path to dir with textes
-
Similarity index [0..1], 0.5 by default. What we shoud mark as the same? 1 == full copy
-
Min number of duplicates to mark as a not unique, 3 by default
-
Length of one shingle in words, 4 by default, min 3
How to test
-
npm run test
for run tests -
There are 4 scenario for quick testing
npm run one
,npm run two
,npm run three
,npm run four