multithreaded-programming-exercise

利用Multi-threading來同時計算餘弦相似係數(Cosine similarity coefficient)

在多份文件中尋找關鍵文件，所謂關鍵文件說明如下，在一個有M個文件的文件集合 $D={d_1,d_2 , ... , d_M}$ 中，關鍵文件 $d_k$ 就是與其他文件的相似度平均值最高的文件。

先算出每一篇文章中的詞在文章中出現的數量，這個數量稱之為詞頻（term frequency）。
按照所有文章的全部詞彙將每一篇文章轉換成一個詞頻向量。
接著算出來源文章向量（ $V_s$ ）與其他每一篇文章向量（ $V_x$ ）的向量cosine 值。這個公式如下： $$Sim(V_s,V_x)=cos(V_s,V_x)=\frac{V_s\cdot V_x}{|V_s|\times|V_x|}=\frac{ \sum_{i=1}^n{v_{s,i}\times v_{x,i}}}{\sqrt{\sum_{i=1}^n v_{s,i}^2}\times\sqrt{\sum_{i=1}^n v_{x,i}^2}}$$
最後就可以算出平均餘弦相似係數。

文件內容
yuan ze university is a good university this university has many very good students
there are many students in yuan ze university many students are very good
there are good books in yuan ze university students love to read these books
there are many good teachers in yuan ze university these teachers concern these students

計算出的詞頻（term frequency）

各別的文章向量

$$V_s=[1,1,3,1,1,2,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0]$$

$$V_1=[1,1,1,0,0,1,0,0,2,2,1,2,1,1,0,0,0,0,0,0,0]$$

$$V_2=[1,1,1,0,0,1,0,0,0,1,1,1,1,0,2,1,1,1,1,0,0]$$

$$V_3=[1,1,1,0,0,1,0,0,1,1,1,1,1,0,0,0,0,0,2,2,1]$$

文件ID
文件內容

0001 
this is a book 
0002
this is a pen
0003
a good book is a book
0004
that book is good

> g++ filename.cpp -o filename -lpthread
> ./filename.cpp input.txt