This project consists of a few parts:
- A web scraper to gather data from the CJ IRIS database
- A differencing library which compares changes gathered
- A streamlined pipeline for gathering and differencing data
- A web application for displaying and navigating data
My routine is fairly straightforward: I run pipeline.py
to get the most current data, then I run flaskapp_v1.py
to view the updated data via localhost.
If you decide to try to run this, make sure to change the name of sample_repo
to just repo
(trust me, it's easier this way). You also need to download the most recent chromedriver here.
run_scraper.sh
accesses CJ IRIS. Don't be alarmed, but it may open a window and navigate its links. You can do other stuff while this runs, the program doesn't need browser focus to work.
run_flaskapp.sh
compiles the data into a simple web application which can be accessed at 127.0.0.1:5000
. The home page includes an explanation of what you're looking at.
I don't have a requirements.txt or anything like that, so you'll have to configure your environment with Flask, Waitress, and Selenium.
bpl_scraper.py
is by far the most heavily-documented module. It should not need to be changed at all. If you want to manually access the scraper module instead of using pipeline.py
, the method you want is scrape_all_data
.
diffy.py
contains the ORMs used to manage and compare the data gathered by the scraper. The end product is DiffEntryHistory
, which tracks all changes over the lifespan of the archive. You can access an alternate repository with DiffEntryHistory.open(directory)
. The directory I use is repo/diff
.
pipeline.py
consists primarily of the scrape_and_diff_today_from_yesterday
method. If you specify the directory
parameter you can scrape into an alternate repository.
flaskapp_v1.py
contains the implementation of a simple web application for navigating the data. This only works for a repo named repo
, so if you want to change that you will have to edit this file. But it's just easier to give the repo the proper name in the first place, I promise.
Let's say your archive has grown to a few hundred differences. Maybe it's making the data unwieldy to look at, or maybe it's taking a bit too long to load.
The solution is to rebase your archive. To start, select and open the DiffEntryHistory you want to rebase:
deh = DiffEntryHistory.open("path/to/diff")
Let's say you want to reduce your archive by 200 days. Then call the rebase function like this:
deh.rebase(200)
And voila, your archive has been rebased! This doesn't delete any files, and also produces a _meta_old.txt
file which contains the pre-rebase data in case you want to undo.
If you want to automate this process, you could do something like this:
if len(deh) > 365:
deh.rebase(len(deh) - 365)
This will ensure your archive is always a reasonable size. It is ultimately up to you how much data you want the app to handle.