This project showcases a simple document distance computation implemented in Python. The method used for evaluating similarity (distance) is the popular SSD (Sum of Squares Distance) algorithm, however the Cosine Distance method is available as well.
Stop words are always excluded as defined by the stop-word-list.txt
virtualenv venv -p python2.7
. venv/Scripts/activate # Git bash in Windows
pip install -r requirements.txt
# Compare inline text.
./dist.py -i "First sentence." "Second sentence" ... "Nth sentence."
# Compare files
./dist.py -f file1 file2 ... fileN
- to use cosine distance, use the
-m cos
option - to output the computed vectors, use the
-v
option for verbosity - to specify different stop-words file, use the
-s path_to_file
option - you can combine files and inline inputs (files are indexed before inlines)
./dist.py \
--method cos \
--verbose \
--inline \
"Today is a nice day!" \
"Tomorrow will also be a nice day..." \
"This sentence has nothing in common with the other two."
Sample output
Comparing documents 1 and 2. The cos distance is 0.3333
Comparing documents 1 and 3. The cos distance is 1.0000
Comparing documents 2 and 3. The cos distance is 1.0000
sentence day common tomorrow today nice
0 1 0 0 1 1
0 1 0 1 0 1
1 0 1 0 0 0
You can test the program on email spam messages. An example file spam_samples.txt is provided. You can also use the provided examples a.txt, b.txt and c.txt.
--
The problem author is Dr.sc. Marko Horvat, dipl. ing.