HW: Search Engine

Due date: Wednesday 27 April (no collaboration extension possible)

In this assignment you will create a highly scalable web search engine.

Learning Objectives:

Learn to work with a moderate large software project
Learn to parallelize data analysis work off the database
Learn to work with WARC files and the multi-petabyte common crawl dataset
Increase familiarity with indexes and rollup tables for speeding up queries
Learn to debug database performance problems

Task 0: project setup

Fork this github repo, and clone your fork onto the lambda server
Ensure that you'll have enough free disk space by:
1. delete the contents of the databases in your $HOME/bigdata folder by
  1. see the instructions in the twitter_postgres_indexes repo for deleting the contents of the folders created in those assignments
2. delete your named volumes and previously built docker images by
  1. bringing down any running docker containers
  2. running the command
```
$ docker system prune -a
```

Task 1: getting the system running

In this first task, you will bring up all the docker containers and verify that everything works.

There are three docker-compose files in this repo:

docker-compose.yml defines the database and pg_bouncer services
docker-compose.override.yml defines the development flask web app
docker-compose.prod.yml defines the production flask web app served by nginx

Your tasks are to:

Run the script scripts/create_passwords.sh to generate the file .env.prod containing production credentials for the database. Recall that this file is sensitive and should not be added to your git repo for any reason.
Modify the docker-compose.override.yml file so that the port exposed by the flask service is the same as your userid.
Build and bring up the docker containers by running the commands
```
$ docker-compose build
$ docker-compose up -d
```
Note that the containers have many dependencies, and so building them the first time can take an hour or more. Further recall that when the -f command line flag is not specified, then docker-compose will use both the docker-compose.yml and docker-compose.override.yml configuration files, but will not use the docker-compose.prod.yml configuration file. For the purposes of this assignment, you won't need to use the production file.
Verify that you can connect to your database by running the command:
```
$ docker-compose exec pg psql --user=novichenko
```
Historical Note: Notice that the database username (as defined in the .prod.env file) is novichenko and not postgres or root. Novichenko was a Soviet military officer who saved Kim Il Sung from a grenade assassination attempt.
Enable ssh port forwarding so that your local computer can connect to the running flask app.
Use firefox on your local computer to connect to the running flask webpage. If you've done the previous steps correctly, all the buttons on the webpage should work without giving you any error messages, but there won't be any data displayed when you search.
Edit the script scripts/check_web_endpoints.sh so that the port that the script connects to matches the port that the flask server is exposed to on the docker-compose.overrider.yml file.

Then, run the script
```
$ sh scripts/check_web_endpoints.sh
```
to perform automated integration checks that the system is running correctly. All tests should report [pass].

Task 2: loading data

There are two services for loading data:

downloader_warc loads an entire WARC file into the database; typically, this will be about 100,000 urls from many different hosts.
downloader_host searches the all WARC entries in either the common crawl or internet archive that match a particular pattern, and adds all of them into the database

Task 2a

We'll start with the downloader_warc service. There are two important files in this service:

services/downloader_warc/downloader_warc.py contains the python code that actually does the insertion
downloader_warc.sh is a bash script that starts up a new docker container connected to the database, then runs the downloader_warc.py file inside that container

Next follow these steps:

Visit https://commoncrawl.org/the-data/get-started/
Find the url of a WARC file. On the common crawl website, the paths to WARC files are referenced from the Amazon S3 bucket. In order to get a valid HTTP url, you'll need to prepend https://data.commoncrawl.org/ to the front of the path.
Then, run the command
```
$ ./downloader_warc.sh $URL
```
where $URL is the url to your selected WARC file.

This command will spawn a docker container that downloads the WARC file and inserts it into the database. Run the command
```
$ docker ps
```
to verify that the docker container is running. Since it is downloading and processing a 1GB file, this container will run for a long time. Once the WARC file is fully downloaded, the container will automatically stop itself.

Note: The first time this script is run, a docker image is built. Building this image takes a long time (potentially hours), and so the first run of this script will be slow. Subsequent runs do not have to rebuild the container from scratch and will be much faster.
Repeat these steps to download at least 5 different WARC files, each from different years. Each of these downloads will spawn its own docker container and can happen in parallel.

You can verify that your system is working with the following tasks. (Note that they are listed in order of how soon you will start seeing results for them.)

Running docker logs on your downloader_warc containers.
Run the query
```
SELECT count(*) FROM metahtml;
```
in psql.
Visit your webpage in firefox and verify that search terms are now getting returned.

Task 2b

The downloader_warc service above downloads many urls quickly, but they are mostly low-quality urls. For example, most URLs do not include the date they were published, and so their contents will not be reflected in the ngrams graph. In this task, you will implement and run the downloader_host service for downloading high quality urls.

The file services/downloader_host/downloader_host.py has 3 FIXME statements. You will have to complete the code in these statements to make the python script correctly insert WARC records into the database.

HINT: The code will require that you use functions from the cdx_toolkit library. You can find the documentation here. You can also reference the downloader_warc service for hints, since this service accomplishes a similar task.
Run the query
```
SELECT * FROM metahtml_test_summary_host;
```
to display all of the hosts for which the metahtml library has test cases proving it is able to extract publication dates. Note that the command above lists the hosts in key syntax form, and you'll have to convert the host into standard form.
Select 5 hostnames from the list above, then run the command
```
$ ./downloader_host.sh "$HOST"
```
to insert the urls from these 5 hostnames.

Task 3: speeding up the webpage

Every time you run a web search on this website, several sql queries are run. In particular, the full text search is run using a query that contains something like

to_tsvector() @@ to_tsquery()

in its where clause. There is currently no index to speed up this query. So the query will use a sequential scan and the runtime will be linear in the amount of data searched.

That's bad!

Your goal in this task is to create an index that speeds up the query. It should be a RUM index to speed up the @@ operator and take advantage of a LIMIT clause using an index scan. The RUM index is already installed on this pg instance, and that's one of the reason building the images took a long time.

The problem is that I'm not going to tell you what the query is that you need to speed up. And there's a LOT of code to try to search through... so it's impractical to find the python that causes this query.

Instead, we'll use postgres to tell us what queries are slow without needing to search through the code. Postgres maintains a relation called pg_stat_statements which records the performance of all queries that get run. You can find a tutorial for using this relation at https://www.cybertec-postgresql.com/en/postgresql-detecting-slow-queries-quickly/.

Running the following query in psql will give you the most expensive queries that have been run on the database:

SELECT query,
      calls,
      round(total_exec_time::numeric, 2) AS total_time,
      round(mean_exec_time::numeric, 2) AS mean_time,
      round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS percentage
FROM pg_stat_statements
ORDER BY total_exec_time DESC;

If you run this query, you'll see that it contains MANY results inside of it because there are many queries that have been run on the database.

In order to find our text search query, modify the SELECT statement above to add a WHERE clause that requires that the query contain the @@ symbol. Now you should see as the first result the SELECT query that does full text search on your webpage. You can verify this by running a few more queries on the webpage and checking that the calls column in the SELECT query goes up.

Now that you know what the query looks like, write a RUM index to speed up this query. (One fact that you need which may not be included in the results of your SELECT query is that the language parameter to to_tsquery is 'simple'.)

Add this RUM index to your services/pg/sql/schema.sql file and run it directly in psql. You should notice when you run a search, the runtimes of the searches are now faster.

Submission

Edit this README file to contain the RUM query you created above right here:
```
CREATE INDEX metahtml_idx_rum ON metahtml USING rum(content);
```

Edit this README file with the results of the following queries in psql. The results of these queries will be used to determine if you've completed the previous steps correctly.

This query shows the total number of webpages loaded:

select count(*) from metahtml;
 count
 --------
 236896
 (1 row)

This query shows the number of webpages loaded / hour:

select * from metahtml_rollup_insert order by insert_hour desc limit 100;

 hll_count |  url   | hostpathquery | hostpath |  host  |      insert_hour
 -----------+--------+---------------+----------+--------+------------------------
  1 |  10301 |          9933 |     9585 |   8701 | 2022-05-10 02:00:00+00
  5 | 224626 |        217223 |   199426 | 131910 | 2022-05-10 01:00:00+00
 (2 rows)

This query shows the hostnames that you have downloaded the most webpages from:

select * from metahtml_rollup_host order by hostpath desc limit 100;

 url | hostpathquery | hostpath |              host
 -----+---------------+----------+---------------------------------
   70 |            70 |       70 | net,tiexue,bbs)
   69 |            69 |       69 | com,tripadvisor,no)
   65 |            65 |       65 | org,wikipedia,en)
   64 |            64 |       64 | com,cymax)
   74 |            74 |       61 | com,mlb)
   60 |            60 |       60 | net,blogmarks)
   59 |            59 |       59 | de,tripadvisor)
   54 |            54 |       54 | com,popsugar)
   52 |            52 |       52 | com,grouprecipes)
   52 |            52 |       52 | com,thestreet)
   51 |            51 |       51 | com,rentalhouses)
   51 |            51 |       51 | com,agoda)
   51 |            51 |       51 | com,threadless,keziacole)
   50 |            50 |       50 | com,bandsintown)
   50 |            50 |       50 | com,cdw)
   49 |            49 |       49 | com,gameboomers)
   49 |            49 |       49 | com,apartmenttherapy)
   48 |            48 |       48 | com,sears)
   47 |            47 |       47 | pt,encycolorpedia)
   46 |            46 |       46 | com,scribd)
   46 |            46 |       46 | com,epicsports,football)
   46 |            46 |       46 | com,ritetag)
   46 |            46 |       46 | com,clker)
   45 |            45 |       45 | cl,tripadvisor)
   44 |            44 |       44 | com,espncricinfo)
   57 |            57 |       44 | edu,illinois,carli,collections)
   44 |            44 |       44 | com,simon)
   44 |            44 |       44 | com,imgur)
   43 |            43 |       43 | com,foodily)
   43 |            43 |       43 | com,theguardian)
   43 |            43 |       42 | com,trulia)
   42 |            42 |       42 | com,funnyordie)
   42 |            42 |       42 | br,com,tripadvisor)
   42 |            42 |       42 | uk,co,huffingtonpost)
   41 |            41 |       41 | com,songlyrics)
   41 |            41 |       41 | com,smugmug,photos)
   41 |            41 |       41 | com,pandora)
   41 |            41 |       41 | org,denverlibrary,digital)
   41 |            41 |       40 | com,dafont)
   40 |            40 |       40 | com,seatgeek)
   39 |            39 |       39 | com,gamefaqs)
   39 |            39 |       39 | org,wikipedia,es)
   39 |            39 |       39 | com,tripadvisor)
   41 |            41 |       39 | com,legacy)
   39 |            39 |       39 | id,co,tripadvisor)
   38 |            38 |       38 | eg,com,tripadvisor)
   38 |            38 |       38 | com,oracle,docs)
   38 |            38 |       38 | com,scribdassets,imgv2-2)
   38 |            38 |       38 | com,engadget)
   38 |            38 |       38 | com,steepster)
   38 |            38 |       37 | com,dailypuppy)
   36 |            36 |       36 | com,newgrounds)
   36 |            36 |       36 | com,upi)
   36 |            36 |       36 | org,wikipedia,it)
   35 |            35 |       35 | org,wikipedia,de)
   35 |            35 |       35 | org,wikidata)
   35 |            35 |       35 | com,dailymotion)
   35 |            35 |       35 | org,wiktionary,en)
   35 |            35 |       35 | uk,co,expedia)
   35 |            35 |       35 | com,iheart)
   36 |            36 |       35 | com,dailykos)
   35 |            35 |       35 | com,weheartit)
   36 |            36 |       35 | com,snagajob)
   35 |            35 |       35 | me,torrentdownloads)

Take a screenshot of an interesting search result. Ensure that the timer on the bottom of the webpage is included in the screenshot. Add the screenshot to your git repo, and modify the <img> tag below to point to the screenshot.
Commit and push your changes to github.
Submit the link to your github repo in sakai.

DestrosCMC/search_engine