PageCompare

This is the 3rd program which is used to compare the target page to the historical pages to tag the timestamps for the paragraphs.

How to use

Compile mvn compile
Package mvn clean package -Dmaven.test.skip=true
Skip the test because the AppTest.class is not implemented.
Run java -jar PageCompare.jar /CluewebDocFolder /HistoricalDocFolder /ResultFolder [Length Threshold] [Similarity Threshold]

/CluewebDocFolder is the file path of the folder whose files are clueweb docs need to be tagged.

/HistoricalDocFolder is the file path of the folder who has all the historical docs crawled by IA-Downloader.

/ResultFolder is the file path of the folder who contains the tagged result.

Length Threshold is the length threshold of sub-documents. Only sub-docs whose lengths are longer than the threshold are considered informative.

Similarity Threshold is the similarity threshold of the comparison. if the similarity of two sub-docs is higher than the threshold, we think they are the same. The comparison method is JaroWinklerTFIDF is SecondString

Example

/Clueweb12_Crawled/clueweb12-0000tw/ /AWSCrawled/Disk1_ClueWeb12_00_0000tw-CRAWLED/ /Desktop/TestFolder 50 0.7

Update

0.1.2

Add the procedure to deal with the historical page files who are gzipped.

0.1.1