Due date: Wednesday 27 April (no collaboration extension possible)
In this assignment you will create a highly scalable web search engine.
Learning Objectives:
- Learn to work with a moderate large software project
- Learn to parallelize data analysis work off the database
- Learn to work with WARC files and the multi-petabyte common crawl dataset
- Increase familiarity with indexes and rollup tables for speeding up queries
- Learn to debug database performance problems
-
Fork this github repo, and clone your fork onto the lambda server
-
Ensure that you'll have enough free disk space by:
- delete the contents of the databases in your
$HOME/bigdata
folder by- see the instructions in the
twitter_postgres_indexes
repo for deleting the contents of the folders created in those assignments
- see the instructions in the
- delete your named volumes and previously built docker images by
- bringing down any running docker containers
- running the command
$ docker system prune -a
- delete the contents of the databases in your
In this first task, you will bring up all the docker containers and verify that everything works.
There are three docker-compose files in this repo:
docker-compose.yml
defines the database and pg_bouncer servicesdocker-compose.override.yml
defines the development flask web appdocker-compose.prod.yml
defines the production flask web app served by nginx
Your tasks are to:
-
Run the script
scripts/create_passwords.sh
to generate the file.env.prod
containing production credentials for the database. Recall that this file is sensitive and should not be added to your git repo for any reason. -
Modify the
docker-compose.override.yml
file so that the port exposed by the flask service is the same as your userid. -
Build and bring up the docker containers by running the commands
$ docker-compose build $ docker-compose up -d
Note that the containers have many dependencies, and so building them the first time can take an hour or more. Further recall that when the
-f
command line flag is not specified, thendocker-compose
will use both thedocker-compose.yml
anddocker-compose.override.yml
configuration files, but will not use thedocker-compose.prod.yml
configuration file. For the purposes of this assignment, you won't need to use the production file. -
Verify that you can connect to your database by running the command:
$ docker-compose exec pg psql --user=novichenko
Historical Note: Notice that the database username (as defined in the
.prod.env
file) isnovichenko
and notpostgres
orroot
. Novichenko was a Soviet military officer who saved Kim Il Sung from a grenade assassination attempt. -
Enable ssh port forwarding so that your local computer can connect to the running flask app.
-
Use firefox on your local computer to connect to the running flask webpage. If you've done the previous steps correctly, all the buttons on the webpage should work without giving you any error messages, but there won't be any data displayed when you search.
-
Edit the script
scripts/check_web_endpoints.sh
so that the port that the script connects to matches the port that the flask server is exposed to on thedocker-compose.overrider.yml
file.Then, run the script
$ sh scripts/check_web_endpoints.sh
to perform automated integration checks that the system is running correctly. All tests should report
[pass]
.
There are two services for loading data:
downloader_warc
loads an entire WARC file into the database; typically, this will be about 100,000 urls from many different hosts.downloader_host
searches the all WARC entries in either the common crawl or internet archive that match a particular pattern, and adds all of them into the database
We'll start with the downloader_warc
service.
There are two important files in this service:
services/downloader_warc/downloader_warc.py
contains the python code that actually does the insertiondownloader_warc.sh
is a bash script that starts up a new docker container connected to the database, then runs thedownloader_warc.py
file inside that container
Next follow these steps:
-
Find the url of a WARC file. On the common crawl website, the paths to WARC files are referenced from the Amazon S3 bucket. In order to get a valid HTTP url, you'll need to prepend
https://data.commoncrawl.org/
to the front of the path. -
Then, run the command
$ ./downloader_warc.sh $URL
where
$URL
is the url to your selected WARC file.This command will spawn a docker container that downloads the WARC file and inserts it into the database. Run the command
$ docker ps
to verify that the docker container is running. Since it is downloading and processing a 1GB file, this container will run for a long time. Once the WARC file is fully downloaded, the container will automatically stop itself.
Note: The first time this script is run, a docker image is built. Building this image takes a long time (potentially hours), and so the first run of this script will be slow. Subsequent runs do not have to rebuild the container from scratch and will be much faster.
-
Repeat these steps to download at least 5 different WARC files, each from different years. Each of these downloads will spawn its own docker container and can happen in parallel.
You can verify that your system is working with the following tasks. (Note that they are listed in order of how soon you will start seeing results for them.)
- Running
docker logs
on yourdownloader_warc
containers. - Run the query
in psql.
SELECT count(*) FROM metahtml;
- Visit your webpage in firefox and verify that search terms are now getting returned.
The downloader_warc
service above downloads many urls quickly, but they are mostly low-quality urls.
For example, most URLs do not include the date they were published, and so their contents will not be reflected in the ngrams graph.
In this task, you will implement and run the downloader_host
service for downloading high quality urls.
-
The file
services/downloader_host/downloader_host.py
has 3FIXME
statements. You will have to complete the code in these statements to make the python script correctly insert WARC records into the database.HINT: The code will require that you use functions from the cdx_toolkit library. You can find the documentation here. You can also reference the
downloader_warc
service for hints, since this service accomplishes a similar task. -
Run the query
SELECT * FROM metahtml_test_summary_host;
to display all of the hosts for which the metahtml library has test cases proving it is able to extract publication dates. Note that the command above lists the hosts in key syntax form, and you'll have to convert the host into standard form.
-
Select 5 hostnames from the list above, then run the command
$ ./downloader_host.sh "$HOST"
to insert the urls from these 5 hostnames.
Every time you run a web search on this website, several sql queries are run. In particular, the full text search is run using a query that contains something like
to_tsvector() @@ to_tsquery()
in its where clause. There is currently no index to speed up this query. So the query will use a sequential scan and the runtime will be linear in the amount of data searched.
That's bad!
Your goal in this task is to create an index that speeds up the query.
It should be a RUM index to speed up the @@
operator and take advantage of a LIMIT
clause using an index scan.
The RUM index is already installed on this pg instance,
and that's one of the reason building the images took a long time.
The problem is that I'm not going to tell you what the query is that you need to speed up. And there's a LOT of code to try to search through... so it's impractical to find the python that causes this query.
Instead, we'll use postgres to tell us what queries are slow without needing to search through the code.
Postgres maintains a relation called pg_stat_statements
which records the performance of all queries that get run.
You can find a tutorial for using this relation at https://www.cybertec-postgresql.com/en/postgresql-detecting-slow-queries-quickly/.
Running the following query in psql will give you the most expensive queries that have been run on the database:
SELECT query,
calls,
round(total_exec_time::numeric, 2) AS total_time,
round(mean_exec_time::numeric, 2) AS mean_time,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS percentage
FROM pg_stat_statements
ORDER BY total_exec_time DESC;
If you run this query, you'll see that it contains MANY results inside of it because there are many queries that have been run on the database.
In order to find our text search query, modify the SELECT
statement above to add a WHERE
clause that requires that the query contain the @@
symbol.
Now you should see as the first result the SELECT
query that does full text search on your webpage.
You can verify this by running a few more queries on the webpage and checking that the calls
column in the SELECT
query goes up.
Now that you know what the query looks like, write a RUM index to speed up this query.
(One fact that you need which may not be included in the results of your SELECT
query is that the language parameter to to_tsquery
is 'simple'
.)
Add this RUM index to your services/pg/sql/schema.sql
file and run it directly in psql.
You should notice when you run a search, the runtimes of the searches are now faster.
-
Edit this README file to contain the RUM query you created above right here:
CREATE INDEX metahtml_idx_rum ON metahtml USING rum(content);
-
Edit this README file with the results of the following queries in psql. The results of these queries will be used to determine if you've completed the previous steps correctly.
-
This query shows the total number of webpages loaded:
select count(*) from metahtml; count -------- 236896 (1 row)
-
This query shows the number of webpages loaded / hour:
select * from metahtml_rollup_insert order by insert_hour desc limit 100; hll_count | url | hostpathquery | hostpath | host | insert_hour -----------+--------+---------------+----------+--------+------------------------ 1 | 10301 | 9933 | 9585 | 8701 | 2022-05-10 02:00:00+00 5 | 224626 | 217223 | 199426 | 131910 | 2022-05-10 01:00:00+00 (2 rows)
-
This query shows the hostnames that you have downloaded the most webpages from:
select * from metahtml_rollup_host order by hostpath desc limit 100; url | hostpathquery | hostpath | host -----+---------------+----------+--------------------------------- 70 | 70 | 70 | net,tiexue,bbs) 69 | 69 | 69 | com,tripadvisor,no) 65 | 65 | 65 | org,wikipedia,en) 64 | 64 | 64 | com,cymax) 74 | 74 | 61 | com,mlb) 60 | 60 | 60 | net,blogmarks) 59 | 59 | 59 | de,tripadvisor) 54 | 54 | 54 | com,popsugar) 52 | 52 | 52 | com,grouprecipes) 52 | 52 | 52 | com,thestreet) 51 | 51 | 51 | com,rentalhouses) 51 | 51 | 51 | com,agoda) 51 | 51 | 51 | com,threadless,keziacole) 50 | 50 | 50 | com,bandsintown) 50 | 50 | 50 | com,cdw) 49 | 49 | 49 | com,gameboomers) 49 | 49 | 49 | com,apartmenttherapy) 48 | 48 | 48 | com,sears) 47 | 47 | 47 | pt,encycolorpedia) 46 | 46 | 46 | com,scribd) 46 | 46 | 46 | com,epicsports,football) 46 | 46 | 46 | com,ritetag) 46 | 46 | 46 | com,clker) 45 | 45 | 45 | cl,tripadvisor) 44 | 44 | 44 | com,espncricinfo) 57 | 57 | 44 | edu,illinois,carli,collections) 44 | 44 | 44 | com,simon) 44 | 44 | 44 | com,imgur) 43 | 43 | 43 | com,foodily) 43 | 43 | 43 | com,theguardian) 43 | 43 | 42 | com,trulia) 42 | 42 | 42 | com,funnyordie) 42 | 42 | 42 | br,com,tripadvisor) 42 | 42 | 42 | uk,co,huffingtonpost) 41 | 41 | 41 | com,songlyrics) 41 | 41 | 41 | com,smugmug,photos) 41 | 41 | 41 | com,pandora) 41 | 41 | 41 | org,denverlibrary,digital) 41 | 41 | 40 | com,dafont) 40 | 40 | 40 | com,seatgeek) 39 | 39 | 39 | com,gamefaqs) 39 | 39 | 39 | org,wikipedia,es) 39 | 39 | 39 | com,tripadvisor) 41 | 41 | 39 | com,legacy) 39 | 39 | 39 | id,co,tripadvisor) 38 | 38 | 38 | eg,com,tripadvisor) 38 | 38 | 38 | com,oracle,docs) 38 | 38 | 38 | com,scribdassets,imgv2-2) 38 | 38 | 38 | com,engadget) 38 | 38 | 38 | com,steepster) 38 | 38 | 37 | com,dailypuppy) 36 | 36 | 36 | com,newgrounds) 36 | 36 | 36 | com,upi) 36 | 36 | 36 | org,wikipedia,it) 35 | 35 | 35 | org,wikipedia,de) 35 | 35 | 35 | org,wikidata) 35 | 35 | 35 | com,dailymotion) 35 | 35 | 35 | org,wiktionary,en) 35 | 35 | 35 | uk,co,expedia) 35 | 35 | 35 | com,iheart) 36 | 36 | 35 | com,dailykos) 35 | 35 | 35 | com,weheartit) 36 | 36 | 35 | com,snagajob) 35 | 35 | 35 | me,torrentdownloads)
-
-
Take a screenshot of an interesting search result. Ensure that the timer on the bottom of the webpage is included in the screenshot. Add the screenshot to your git repo, and modify the
<img>
tag below to point to the screenshot. -
Commit and push your changes to github.
-
Submit the link to your github repo in sakai.