HW: Search Engine

In this assignment you will create a highly scalable web search engine.

Due Date: Sunday, 9 May

Learning Objectives:

Learn to work with a moderate large software project
Learn to parallelize data analysis work off the database
Learn to work with WARC files and the multi-petabyte common crawl dataset
Increase familiarity with indexes and rollup tables for speeding up queries

Task 0: project setup

Fork this github repo, and clone your fork onto the lambda server
Ensure that you'll have enough free disk space by:
1. bring down any running docker containers
2. run the command
```
$ docker system prune
```

Task 1: getting the system running

In this first task, you will bring up all the docker containers and verify that everything works.

There are three docker-compose files in this repo:

docker-compose.yml defines the database and pg_bouncer services
docker-compose.override.yml defines the development flask web app
docker-compose.prod.yml defines the production flask web app served by nginx

Your tasks are to:

Modify the docker-compose.override.yml file so that the port exposed by the flask service is different.
Run the script scripts/create_passwords.sh to generate a new production password for the database.
Build and bring up the docker containers.
Enable ssh port forwarding so that your local computer can connect to the running flask app.
Use firefox on your local computer to connect to the running flask webpage. If you've done the previous steps correctly, all the buttons on the webpage should work without giving you any error messages, but there won't be any data displayed when you search.
Run the script
```
$ sh scripts/check_web_endpoints.sh
```
to perform automated checks that the system is running correctly. All tests should report [pass].

Task 2: loading data

There are two services for loading data:

downloader_warc loads an entire WARC file into the database; typically, this will be about 100,000 urls from many different hosts.
downloader_host searches the all WARC entries in either the common crawl or internet archive that match a particular pattern, and adds all of them into the database

Task 2a

We'll start with the downloader_warc service. There are two important files in this service:

services/downloader_warc/downloader_warc.py contains the python code that actually does the insertion
downloader_warc.sh is a bash script that starts up a new docker container connected to the database, then runs the downloader_warc.py file inside that container

Next follow these steps:

Visit https://commoncrawl.org/the-data/get-started/
Find the url of a WARC file. On the common crawl website, the paths to WARC files are referenced from the Amazon S3 bucket. In order to get a valid HTTP url, you'll need to prepend https://commoncrawl.s3.amazonaws.com/ to the front of the path.
Then, run the command
```
$ ./download_warc.sh $URL
```
where $URL is the url to your selected WARC file.
Run the command
```
$ docker ps
```
to verify that the docker container is running.
Repeat these steps to download at least 5 different WARC files, each from different years. Each of these downloads will spawn its own docker container and can happen in parallel.

You can verify that your system is working with the following tasks. (Note that they are listed in order of how soon you will start seeing results for them.)

Running docker logs on your download_warc containers.
Run the query
```
SELECT count(*) FROM metahtml;
```
in psql.
Visit your webpage in firefox and verify that search terms are now getting returned.

Task 2b

The download_warc service above downloads many urls quickly, but they are mostly low-quality urls. For example, most URLs do not include the date they were published, and so their contents will not be reflected in the ngrams graph. In this task, you will implement and run the download_host service for downloading high quality urls.

The file services/downloader_host/downloader_host.py has 3 FIXME statements. You will have to complete the code in these statements to make the python script correctly insert WARC records into the database.

HINT: The code will require that you use functions from the cdx_toolkit library. You can find the documentation here. You can also reference the download_warc service for hints, since this service accomplishes a similar task.
Run the query
```
SELECT * FROM metahtml_test_summary_host;
```
to display all of the hosts for which the metahtml library has test cases proving it is able to extract publication dates. Note that the command above lists the hosts in key syntax form, and you'll have to convert the host into standard form.
Select 5 hostnames from the list above, then run the command
```
$ ./downloader_host.sh "$HOST/*"
```
to insert the urls from these 5 hostnames.

Task 3: speeding up the webpage

Since everyone seems pretty overworked right now, I've done this step for you.

There are two steps:

create indexes for the fast text search
create materialized views for the count(*) queries

Submission

Edit this README file with the results of the following queries in psql. The results of these queries will be used to determine if you've completed the previous steps correctly.
1. This query shows the total number of webpages loaded:
```
select count(*) from metahtml;
```
1. This query shows the number of webpages loaded / hour:
```
select * from metahtml_rollup_insert order by insert_hour desc limit 100;
```
1. This query shows the hostnames that you have downloaded the most webpages from:
```
select * from metahtml_rollup_host order by hostpath desc limit 100;
```
Take a screenshot of an interesting search result. Add the screenshot to your git repo, and modify the <img> tag below to point to the screenshot.
Commit and push your changes to github.
Submit the link to your github repo in sakai.

anan-ara/search_engine