COMS6111 Project1
Project 1 Group 27
Qianwen Zheng (qz2271)
Jiajun Zhang (jz2793)
└── group27-proj1
├── libs
│ ├── google-api-client-1.22.0.jar
│ ├── google-api-client-jackson2-1.22.0.jar
│ ├── google-api-services-customsearch-v1-rev56-1.22.0.jar
│ ├── google-http-client-1.22.0.jar
│ ├── google-http-client-jackson2-1.22.0.jar
│ ├── httpclient-4.0.1.jar
│ └── jackson-core-2.1.3.jar
├── sup
│ ├── proj1-stopword.txt
│ └── formula.png
├── Makefile
├── Doc.java
├── Filter.java
├── GoogleSearch
├── GoogleSearch.java
├── Query.java
├── Rocchio.java
├── README.md
└── transcript.txt
-
Install Java
Run under Ubuntu 14.04 LTS
sudo apt-get update
Install git if you haven't
sudo apt-get install git
Install make if you have't
sudo apt-get install build-essential
Install oracle java 8 if you haven't
sudo add-apt-repository ppa:webupd8team/java # make sure you do this sudo apt-get update # remeber to choose to agree the license sudo apt-get install oracle-java8-installer
Set default version if you have other versions
sudo update-alternatives --config java sudo update-alternatives --config javac sudo update-alternatives --config javaws
After this when you check java version by
java -version
you can see the version is 8
java version "1.8.0_144" Java(TM) SE Runtime Environment (build 1.8.0_144-b01) Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
-
Clone project
git clone https://github.com/petercanmakit/group27-proj1.git
-
Navigate to folder
cd ./group27-proj1
-
Install dependencies
make
-
Run
./GoogleSearch <google api key> <google engine id> <precision> <query>
<google api key> -- your Google Custom Search API Key
<google engine id> -- your Google Custom Search Engine ID
<precision> -- the target value for precision@10, a real number between 0 and 1
<query> -- your query, separated by space
-
Google Custom Search API Key
AIzaSyARFSgO3Kiuu3IOtEL8UwdIbrS7SiB43qo
-
Google Custom Search Engine ID
018258045116810257593:z1fmkqqt_di
-
We get the original query from users input. Then, we stores query as a customized Query type. In this class, we compute the initial qi vector (tf-idf weight).
-
After searching the query in Google CSE, user can get 10 results and each result contains URL, title and its summary. The result will display one by one, and at the end of each result, user is requested to determine whether it is relevant to what he wants to search by inputing “Y” as yes and “N” as no.
-
We combine each result’s title and summary as a String, and then use a filter method (which is defined in Filter class) to eliminate stopwords. Note that instead of strictly eliminating all words that show in "proj1-stopword.txt" file, we skip those words which are contained in the original query. Then store the filtered string as a Doc type (Doc is a customized class). For each Doc, we compute its term frequency and tf-idf weight, which are stored in two HashMap.
-
Based on user’s feedback, we have two ArrayLists. One list stores relevant Docs, and the other store non-relevant Docs. According to these two list, we compute two HashMap. One stores word-weight pairs in relevant docs, and the other stores pairs in non-relevant docs.
-
We put the main computing part in a while loop, and the loop will only be broken when the desired precision is reached or the precision is 0. Therefore, if the precision is still below the desired precision, given the q vector and two word-weight HashMap, we implement Rocchio’s Algorithm to get new query. The expanded query will be automatically searched in Google CSE and a new round starts.
-
Given the original query, firstly compute the qi vector, in which each element is the tf-idf weight of each word.
-
Based on Google Search results and user’s feedback, two vectors are computed. One stores tf-idf weight of words in relevant results and the other stores tf-idf weight of words in non-relevant results. Then normalized them separately.
The formula to compute tf-idf weight is :
tf-idf = tf * (1 + idf) = tf * (1 + log(N/df))
Reference:
Tobias Liland Bjormyr, Deep Learning with emphasis on extracting information from text data, Section 3.2.2.2
-
Implement Rocchio’s Algorithm to compute new weight of words ( vector qi+1).
That is,
new weight of a word = Alpha * (original weight in vector qi) + Beta * (tf-idf weight in relevant results) - Gamma * (tf-idf weight in non-relevant results)
In our project, we implement Alpha = 1.0, Beta = 0.75, Gamma = 0.15.
-
According to the new vector, choose two terms that have the heavies weight and are not in the old query. Sort two new terms and old query according to their weight. The order of keywords may vary during each iteration.
-
Start a new round until the desired precision is reached.