COMS6111 Project 1

COMS6111 Project1

Group Name

Project 1 Group 27

Group Member

Qianwen Zheng (qz2271)

Jiajun Zhang (jz2793)

Files

└── group27-proj1
    ├── libs
    │   ├── google-api-client-1.22.0.jar
    │   ├── google-api-client-jackson2-1.22.0.jar
    │   ├── google-api-services-customsearch-v1-rev56-1.22.0.jar
    │   ├── google-http-client-1.22.0.jar
    │   ├── google-http-client-jackson2-1.22.0.jar
    │   ├── httpclient-4.0.1.jar
    │   └── jackson-core-2.1.3.jar
    ├── sup
    │   ├── proj1-stopword.txt
    │   └── formula.png
    ├── Makefile
    ├── Doc.java
    ├── Filter.java
    ├── GoogleSearch
    ├── GoogleSearch.java
    ├── Query.java
    ├── Rocchio.java
    ├── README.md
    └── transcript.txt

Run

Install Java

Run under Ubuntu 14.04 LTS

 sudo apt-get update

Install git if you haven't

 sudo apt-get install git

Install make if you have't

 sudo apt-get install build-essential

Install oracle java 8 if you haven't

 sudo add-apt-repository ppa:webupd8team/java
 	# make sure you do this
 sudo apt-get update
 	# remeber to choose to agree the license
 sudo apt-get install oracle-java8-installer

Set default version if you have other versions

 sudo update-alternatives --config java
 sudo update-alternatives --config javac
 sudo update-alternatives --config javaws

After this when you check java version by

 java -version

you can see the version is 8

 java version "1.8.0_144"
 Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Clone project

 git clone https://github.com/petercanmakit/group27-proj1.git

Navigate to folder
```
 cd ./group27-proj1
```
Install dependencies
```
 make
```
Run
```
 ./GoogleSearch <google api key> <google engine id> <precision> <query>
```
<google api key> -- your Google Custom Search API Key

<google engine id> -- your Google Custom Search Engine ID

<precision> -- the target value for precision@10, a real number between 0 and 1

<query> -- your query, separated by space

Keys

Google Custom Search API Key

  AIzaSyARFSgO3Kiuu3IOtEL8UwdIbrS7SiB43qo

Google Custom Search Engine ID
```
  018258045116810257593:z1fmkqqt_di
```

Internal Design

We get the original query from users input. Then, we stores query as a customized Query type. In this class, we compute the initial q_i vector (tf-idf weight).
After searching the query in Google CSE, user can get 10 results and each result contains URL, title and its summary. The result will display one by one, and at the end of each result, user is requested to determine whether it is relevant to what he wants to search by inputing “Y” as yes and “N” as no.
We combine each result’s title and summary as a String, and then use a filter method (which is defined in Filter class) to eliminate stopwords. Note that instead of strictly eliminating all words that show in "proj1-stopword.txt" file, we skip those words which are contained in the original query. Then store the filtered string as a Doc type (Doc is a customized class). For each Doc, we compute its term frequency and tf-idf weight, which are stored in two HashMap.
Based on user’s feedback, we have two ArrayLists. One list stores relevant Docs, and the other store non-relevant Docs. According to these two list, we compute two HashMap. One stores word-weight pairs in relevant docs, and the other stores pairs in non-relevant docs.
We put the main computing part in a while loop, and the loop will only be broken when the desired precision is reached or the precision is 0. Therefore, if the precision is still below the desired precision, given the q vector and two word-weight HashMap, we implement Rocchio’s Algorithm to get new query. The expanded query will be automatically searched in Google CSE and a new round starts.

Query modification method

Given the original query, firstly compute the q_i vector, in which each element is the tf-idf weight of each word.
Based on Google Search results and user’s feedback, two vectors are computed. One stores tf-idf weight of words in relevant results and the other stores tf-idf weight of words in non-relevant results. Then normalized them separately.

The formula to compute tf-idf weight is :
```
  tf-idf = tf * (1 + idf) = tf * (1 + log(N/df))
```

Reference:

Tobias Liland Bjormyr, Deep Learning with emphasis on extracting information from text data, Section 3.2.2.2

Implement Rocchio’s Algorithm to compute new weight of words ( vector q_i+1).

That is,

  new weight of a word =  Alpha * (original weight in vector qi) 
                          + Beta * (tf-idf weight in relevant results) 
 		 	  - Gamma * (tf-idf weight in non-relevant results)

In our project, we implement Alpha = 1.0, Beta = 0.75, Gamma = 0.15.

According to the new vector, choose two terms that have the heavies weight and are not in the old query. Sort two new terms and old query according to their weight. The order of keywords may vary during each iteration.
Start a new round until the desired precision is reached.