Wikipedia dumps raw database tables in a gzipped SQL format for the English language Wikipedia
(enwiki
) approximately once a month (e.g.
dump from February 1, 2018). The
entire database layout is not required, and
the database creation script only downloads, trims, and parses three tables:
page
- Contains the ID and name (among other things) for all pages.pagelinks
- Contains the source and target pages all links.redirect
- Contains the source and target pages for all redirects.
For performance reasons, the files are downloaded from the
dumps.wikimedia.your.org
mirror. By default,
the script grabs the latest dump (available at
https://dumps.wikimedia.your.org/enwiki/latest/),
but you can also call the database creation script with a download date in the format YYYYMMDD
as
the first argument.
SDOW only concerns itself with actual Wikipedia articles, which belong to namespace 0 in the Wikipedia data.
The compressed sdow.sqlite.gz
files generated for use in this project are available for download
from "requester pays" Google Cloud Storage
buckets. Check the pricing page for the full details. In
general, copying should be free within Google Cloud Platform (e.g., to another Google Cloud Storage
bucket or to a Google Cloud Engine VM) and around $0.05 per compressed SQLite file otherwise.
Use the following gsutil
and
pigz commands to download a file, making sure to replace
<GCP_PROJECT_ID>
with your Google Cloud Platform project ID and <YYYYMMDD>
with the date of the
database dump:
$ gsutil -u <GCP_PROJECT_ID> cp gs://sdow-prod/dumps/<YYYYMMDD>/sdow.sqlite.gz .
$ pigz -d sdow.sqlite.gz
Here is a list of historical files currently available for download:
gs://sdow-prod/dumps/20180201/sdow.sqlite.gz
(2.99 GB)gs://sdow-prod/dumps/20180301/sdow.sqlite.gz
(3.01 GB)
The result of running the database creation script is a single sdow.sqlite
file which contains
the following three tables:
pages
- Page information for all pages, including redirects.id
- Page ID.title
- Sanitized page title.is_redirect
- Whether or not the page is a redirect (1
means it is a redirect;0
means it is not)
links
- Outgoing and incoming links for each non-redirect page.id
- The page ID of the source page, the page that contains the link.outgoing_links_count
- The number of pages to which this page links to.incoming_links_count
- The number of pages which link to this page.outgoing_links
- A|
-separated list of page IDs to which this page links.incoming_links
- A|
-separated list of page IDs which link to this page.
redirects
- Source and target page IDs for all redirects.source_id
- The page ID of the source page, the page that redirects to another page.target_id
- The page ID of the target page, to which the redirect page redirects.
Generating the SDOW database from a dump of Wikipedia takes approximately one hour given the following instructions:
- Create a new Google Compute Engine instance
from the
sdow-db-builder
instance template, which is configured with the following specs:- Name:
sdow-db-builder-1
- Zone:
us-central1-c
- Machine Type: n1-highmem-8 (8 vCPUs, 52 GB RAM)
- Boot disk: 256 GB SSD, Debian GNU/Linux 8 (jessie)
- Notes: Allow full access to all Cloud APIs. Do not use Debian GNU/Linux 9 (stretch) due to degraded performance.
- Name:
- SSH into the machine:
$ gcloud compute ssh sdow-db-builder-1
- Install required operating system dependencies:
$ sudo apt-get -q update $ sudo apt-get -yq install git pigz sqlite3
- Clone this directory via HTTPS:
$ git clone https://github.com/jwngr/sdow.git
- Move to the proper directory and create a new screen in case the VM connection is lost:
$ cd sdow/database/ $ screen # And then press <ENTER> on the screen that pops up
- Run the database creation script, providing
an optional date for the backup:
$ (time ./buildDatabase.sh [<YYYYMMDD>]) &> output.txt
- Detach from the current screen session by pressing
<CTRL> + <a>
and then<d>
. To reattach to the screen, runscreen -r
. Make sure to always detach from the screen cleanly so it can be resumed! - Copy the script output and the resulting SQLite file to the
sdow-prod
GCS bucket:$ gsutil -u sdow-prod cp output.txt gs://sdow-prod/dumps/<YYYYMMDD>/ $ gsutil -u sdow-prod cp dump/sdow.sqlite.gz gs://sdow-prod/dumps/<YYYYMMDD>/
- Run the Wikipedia facts queries and update the corresponding JSON file.
- Delete the VM to prevent incurring large fees.
- Create a new Google Compute Engine instance
from the
sdow-web-server
instance template, which is configured with the following specs::- Name:
sdow-web-server-1
- Zone:
us-central1-c
- Machine Type: f1-micro (1 vCPU, 0.6 GB RAM)
- Boot disk: 16 GB SSD, Debian GNU/Linux 8 (jessie)
- Notes: Click "Set access for each API" and use default values for all APIs except set Storage to "Read Write". Do not use Debian GNU/Linux 9 (stretch) due to degraded performance.
- Name:
- SSH into the machine:
$ gcloud compute ssh sdow-web-server-1
- Install required operating system dependencies to run the Flask app:
$ sudo apt-get -q update $ sudo apt-get -yq install git pigz sqlite3 python-pip $ sudo pip install --upgrade pip setuptools virtualenv # OR for Python 3 #$ sudo apt-get -q update #$ sudo apt-get -yq install git pigz sqlite3 python3-pip #$ sudo pip3 install --upgrade pip setuptools virtualenv
- Clone this directory via HTTPS and navigate into the repo:
$ git clone https://github.com/jwngr/sdow.git $ cd sdow/
- Create and activate a new
virtualenv
environment:$ virtualenv -p python2 env # OR virtualenv -p python3 env $ source env/bin/activate
- Install the required Python libraries:
$ pip install -r requirements.txt
- Copy the latest compressed SQLite file from the
sdow-prod
GCS bucket:$ gsutil -u sdow-prod cp gs://sdow-prod/dumps/<YYYYMMDD>/sdow.sqlite.gz sdow/
- Decompress the SQLite files:
$ pigz -d sdow/sdow.sqlite.gz
- Create the
searches.sqlite
file:Note: Alternatively, copy a backed-up version of$ sqlite3 sdow/searches.sqlite ".read database/createSearchesTable.sql"
searches.sqlite
from thesdow-prod
GCS bucket atgs://sdow-prod/backups/<YYYYMMDD>/searches.sql.gz
. - Ensure the VM has been assigned SDOW's static external IP address.
- Install required operating system dependencies to generate an SSL certificate (this and the
following instructions are based on these
blog
posts):
$ echo 'deb http://ftp.debian.org/debian jessie-backports main' | sudo tee /etc/apt/sources.list.d/backports.list $ sudo apt-get -q update $ sudo apt-get -yq install nginx $ sudo apt-get -yq install certbot -t jessie-backports
- Add this
location
block inside theserver
block in/etc/nginx/sites-available/default
:location ~ /.well-known { allow all; }
- Start NGINX:
$ sudo systemctl restart nginx
- Ensure the server has the proper static IP address (
sdow-web-server-static-ip
) by editing it on the GCP console if necessary. - Create an SSL certificate using Let's Encrypt's
certbot
:$ sudo certbot certonly -a webroot --webroot-path=/var/www/html -d api.sixdegreesofwikipedia.com --email wenger.jacob@gmail.com
- Ensure auto-renewal of the SSL certificate is configured properly:
$ sudo certbot renew --dry-run
- Run
crontab -e
and add the following cron jobs to that file to auto-renew the SSL certificate, regularly restart the web server (to ensure it stays responsive), and backup the searches database weekly:0 4 * * * /usr/bin/certbot renew --noninteractive --renew-hook "/bin/systemctl reload nginx" >> /var/log/le-renew.log */10 * * * * /home/jwngr/sdow/env/bin/supervisorctl -c /home/jwngr/sdow/config/supervisord.conf restart gunicorn 0 6 * * 0 /home/jwngr/sdow/database/backupSearchesDatabase.sh
- Generate a strong Diffie-Hellman group to further increase security (note that this can take a
couple minutes):
$ sudo openssl dhparam -out /etc/ssl/certs/dhparam.pem 2048
- Copy over the NGINX configuration, making sure to back up the original configuration:
$ sudo cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup $ sudo cp ./config/nginx.conf /etc/nginx/nginx.conf
- Restart
nginx
:$ sudo systemctl restart nginx
- Install the Stackdriver monitoring agent:
$ curl -sSO https://repo.stackdriver.com/stack-install.sh $ sudo bash stack-install.sh --write-gcm
- Activate the
virtualenv
environment:$ cd sdow/ $ source env/bin/activate
- Start the Flask web server via Supervisor which runs
Gunicorn:
$ cd config/ $ supervisord
- Use
supervisorctl
to manage the running web server:Note:$ supervisorctl status # Get status of running processes $ supervisorctl stop gunicorn # Stop web server $ supervisorctl start gunicorn # Start web server $ supervisorctl restart gunicorn # Restart web server
supervisord
andsupervisorctl
must be run from theconfig/
directory or specify the configuration file via the-c
argument or else they will return an obscure"http://localhost:9001 refused connection"
error message. Note: Log output fromsupervisord
is written to/tmp/supervisord.log
and log output fromgunicorn
is written to/tmp/gunicorn-stdout---supervisor-<HASH>.log
. Logs are also written to Stackdriver Logging.
Historical search results are stored in a separate searches.sqlite
database which contains a
single searches
table with the following schema:
source_id
- The page ID of the source page at which to start the search.target_id
- The page ID of the target page at which to end the search.duration
- How long the search took, in seconds.degrees_count
- The number of degrees between the source and target pages.paths_count
- The number of paths found between the source and target pages.paths
- Stringified JSON representation of the paths of page IDs between the source and target pages.t
- Timestamp when the search finished.
Search results are kept in a separate SQLite file to avoid locking the sdow.sqlite
database as
well as to make it easy to update the sdow.sqlite
database to a more recent Wikipedia dump.
ID | Title | Sanitized Title |
---|---|---|
50899560 | 🦎 | 🦎 |
725006 | " | \" |
438953 | 4′33″ | 4′33″ |
32055176 | Λ-ring | Λ-ring |
11760 | F-110 Spectre | F-110_Spectre |
8695 | Dr. Strangelove | Dr._Strangelove |
337903 | Farmers' market | Farmers\'_market |
24781873 | Lindström (company) | Lindström_(company) |
54201777 | Disinformation (book) | Disinformation_(book) |
1514 | Albert, Duke of Prussia | Albert,_Duke_of_Prussia |
35703467 | "A," My Name is Alex - Parts I & II | \"A,\"\_My_Name_is_Alex_-_Parts_I_&_II |
54680944 | N,N,N′,N′-tetramethylethylenediamine | N,N,N′,N′-tetramethylethylenediamine |
24781871 | Jack in the Green: Live in Germany 1970–1993 | Jack_in_the_Green:_Live_in_Germany_1970–1993 |
Source Page Title | Target Page Title | Notes |
---|---|---|
Hargrave Military Academy | Illiosentidae | Cool graph |
Arthropod | Haberdashers' Aske's Boys' School | Cool graph |
AC power plugs and sockets | Gymnobela abyssorum | 1,311 paths of 6 degrees |
Nyctipolus | Philemon Quaye | 2,331 paths of 6 degrees |
Six Degrees of Kevin Bacon | Phinney | Only 6 paths, but of 6 degrees |
Erlang (programming language) | Barbra Streisand | 2,274 paths of 4 degrees |
Lion Express | Phinney | 1,246 paths of 9 degrees |
2016 French Open | Brachmia melicephala | 11 paths of 6 degrees |
- Six Degrees of Wikipedia by Stephen Dolan.
- The general concept of six degrees of separation.
- Wikiracing and its implementations in apps and website such as The Wiki Game by Alex Clemesha and WikiRace by 2pages.
See the contribution page for details.