parallel_shakespeare

This repository contains an example use case of multithreading in Python.

It uses the JSTOR Shakespeare citational dataset, which you can download here: https://rice.box.com/s/z9o1f09bf4n4nvkwl794xcw7hcvni5da (too large for my free Github account)

The basic outline of the project is this:

We are given a list of citations of Shakespeare play lines in JSTOR articles
We want to transform this into a co-citational list that shows us which lines are cited alongside which other lines in which articles
We want to do this quickly, so we break the work up and spread it over an arbitrary number of processes
We want to maximize our efficiency, so we include a few performance testing scripts
Extra credit: turn this into a recommendation engine!

Files & functions

SOURCE DATABASE

shakespeare.db -- download from https://rice.box.com/s/z9o1f09bf4n4nvkwl794xcw7hcvni5da -- an sqlite database that contains 3 tables
- articles: basic metadata on 71,639 JSTOR articles that quote Shakespeare
- play_lines: basic metadata on the 94,049 known Shakespearean play lines quoted by those articles
- matches: 623,428 matches (one to one) of articles to play lines

DESTINATION DATABASE

ariel.db -- a template into which we will pour our transformed data -- an sqlite database that contains 3 tables
- docs: corresponds directly to the "articles" table in shakespeare.db
- lines: the complete shakespearean dramatic corpus from the Folger library, line by line (114,985 entries)
- lines_and_docs_matches: pairs of co-cited lines and the citing article

SCRIPTS

shakespeare.py
- input: the row id in shakespeare.db for a play line
- output: an array of co-citations ready for ariel.db, each in the form (source_line,target_line,citing_article,boolean)
- transformational quirk: an artifact of the process is that, when run through the whole dataset, it produces 2 copies of the same data -- we use the boolean slot to separate those into 2 streams, rather than deleting them, because it's a great for validation
multiproc_on_shakespeare.py
- options:
  - -m = the number of lines to look up co-citations for. m=0 (default) runs through the full set of 94,049 lines
  - -n = the number of processes to spread the work over
- it creates intermediary csv files to store the different processes' work in
- it creates ariel-a.db and ariel-b.db for the symmetrical outputs, injects the csv data into these, and deletes the intermediary csv files

The other scripts do some performance benchmarking to help you optimize the job.

JohnMulligan/parallel_shakespeare