/deepform

Experimental form data extraction for journalism

Primary LanguagePythonMIT LicenseMIT

Deepform

Python build Docker image

Deepform is a project to extract information from TV and cable political advertising disclosure forms using deep learning. This public data, maintained by the FCC, is valuable to journalists but locked in PDFs. Our goal is to provide the 2020 dataset for NLP/AI researchers and to make our method available to future data scientists working in this field. Past projects have managed to produce similar data sets only with great manual effort or by addressing only the most common form types, ignoring the tail of hundreds of rare form types. This work uses deep learning models that are able to generalize over form types and "learn" how to find five fields:

  • Contract number (multiple documents can have the same number as a contract for future air dates is revised)
  • Advertiser name (offen the name of a political comittee but not always)
  • Start and end air dates dates (often known as "flight dates")
  • Total amount paid for the ads

The initial attempt to use deep learning for this work, in summer 2019 by Jonathan Stray achieved 90% accuracy extracting total spending from the PDFs in the (held out) test set, which shows that deep learning can generalize surprisingly well to previously unseen form types. For a discussion of how the 2019 prototype works, see this post.

Why?

This project is timely and relevant for a variety of reasons, some of them pertaining to this particular dataset and others to the method we are following.

Election trasnsparency is an increasingly important component of the US electoral process and making this data available to journalists at low or no cost is key to that transparency. As the data is archived in tens of thousands of non-machine-readable PDF files in hundreds of different formats, it is beyond the capacity of journalistic entities to extract it by hand in a useful way. The data is available for purchase from private entities but we interviewed journalists who mentioned that the data comes with a price tag of $100K or more per newspaper which wishes to use it.

Past projects have used volunteer labor or hand-coded form layouts to produce usable datasets. Project Deepform replicates this data extraction using modern deep learning techniques. This is desirable because we are not only positioned to produce a usable dataset in the context of the 2020 election but the method will be available to our team and other data science teams in the run up to all future US elections to produce similar datasets in the future.

For our own purposes as members of the investigative data science community, Project Deepform functions as an open source springboard for future form extraction projects. Projects of this kind are becoming widely popular as the tools have improved within the past half decade to make this work possible. The general problem is known as "knowledge base construction" in the research community, and the current state of the art is achieved by multimodal systems such as Fonduer. A group at Google released a paper earlier in 2020 which describes a related process, Google also supports Document Cloud AI and others have made progress using graph convolutional networks.

Finally, we have prepared this project dataset and its goals as a benchmark project on Weights and Biases. Here, other data scientists are encouraged to improve on the baseline success rates we have attained.

Setting up the Environment

The project is primarily intended to be run with Docker, which eases issues with Python virtual environments, but it can also be run locally -- this is easiest to do with Poetry.

Docker

To use Docker, you'll have to be running the daemon, which you can find and install from https://www.docker.com/products/docker-desktop. Fortunately, that's all you need.

The project has a Makefile that covers most of the things you might want to do with the project. To get started, simply run

make train

or see below for other commands.

Poetry - dependency management and running locally

Deepform manages its dependencies with Poetry, which you only need if you want to run it locally or alter the project dependencies. You can install Poetry using any of the methods listed in their documentation.

If you want to run Deepform locally:

  • run poetry install to install the deepform package and all of it's dependencies into a fresh virtual environment
  • enter this environment with poetry shell
  • or run a one-off command with poetry run <command>

Since deepform is an installed package inside the virtual environment Poetry creates, run the code as modules, e.g. python -m deepform.train instead of python deepform/train.py -- this insures that imports and relative paths work the way they should.

To update project dependencies:

  • poetry add <package> adds a new python package as a requirement
  • poetry remove <package> removes a package that's no longer needed
  • poetry update updates all the dependencies to their latest non-conflicting versions

These three commands alter pyproject.toml and poetry.lock, which should be committed to git. Using them ensures that our project has reproducible builds.

Training Data

Getting the Training Data

Running make train will acquire all the data you need and will train the model. The total training data for this project consists of three label manifests (discussed below in detail) and 20,000 .parquet files containing the tokens and geometry from the PDFs used in training. Running make train will automatically run, in sequence, a series of commands which acquire, restructure and label the training data. These commands can alternatively be run manually, in sequence.

  1. make data/tokenized downloads all the unlabeled .parquet files (training and test) from an S3 bucket to the folder data/tokenized.

  2. make data/token_frequency.csv constructs a vocabulary of tokens from all these .parquet files.

  3. make data/3_year_manifest.csv combines three label manifests from three different election years (2012, 2014 and 2020) into a single manifest (data/3_year_manifest.csv) and includes a column 'year' to differentiate between the three years' data.

  4. make data/doc_index.parquet will utilize the unlabeled .parquet files in the folder data/tokenized along with 3_year_manifest.csv (already in the repo) to generate a new set of labeled .parquet files in the folder data/training containing the token and geometry along with a new columns for each of the five target types. This column is used to store the match percentage (for each token) between that token and the target in question. This script also computes other relevant features such as whether the token is a date or a dollar amount which are fed into the model as additional features. Some targets are more than one token in length so in these cases, this new column contains the likelihood that each token is a member of the target token string.

These multi-token matching process receives a value for the maximum number of tokens (n) which might match the target ("Obama For America" being 3 tokens long while "1/12/2020" is one token long.) Due to OCR errors, some dates and dollar amounts are more than one token in length. We then calculate a match percentage for all strings of tokens of lengths (n, n-1, ... , 1). The highest match is achieved when the number of tokens is correct and the tokens match the target from the label manifest. Finally, since each token will participate in many match attempts, each token is assigned a match percentage which corresponds to the highest match it participated in. This table shows how "Obama for America" might be found.

...
token, n=1, n=2, n=3, n=4, n=5, ...
contract,.1,.2,.2,.2,.1,...
obama,.7,.6,.5,.4,.3,...
$45,000,.03,.6,.5,.3,.65,...
committee,.1,.6,.4,.75,.65,...
obama,.7,.8,1.0,.75,.65,...
for,.5,.8,1.0,.75,.65,...
america,.67,.81,1.0,.75,.65,...
11/23/12,.03,.4,.4,.5,.6,...
11/29/12,.03,.03,.2,.3,.2,.6,...
...

Form of the training data

All the data (training and test) for this project was originally raw PDFs, downloaable from the FCC website with up to 100,000 PDFs per election year. The training data consists of some 20,000 of these PDFs, drawn from three different election years (2012, 2014 and 2020) according to available labels (see below), and three label manifests.

The orignal PDFs were OCRd, tokenized, and turned into .parquet files, one for each PDF. The .parquet files are each named with the document slug and contain all of that document's tokens and their geometry on the page. Geometry is given in 1/100ths of an inch.

The .parquet files are formatted as "tokens plus geometry" like this:

473630-116252-0-13442821773323-_-pdf.parquet contains

page,x0,y0,x1,y1,token
0,272.613,438.395,301.525,438.439,$275.00
0,410.146,455.811,437.376,455.865,Totals
0,525.84,454.145,530.288,454.189,6
0,556.892,454.145,592.476,454.189,"$1,170.00"
0,18.0,480.478,37.998,480.527,Time
0,40.5,480.478,66.51,480.527,Period
...

The document name (the slug) is a unique document identifier, ultimately from the source TSV. The page number runs from 0 to 1, and the bounding box is in the original PDF coordinate system. The actual token text is reproduced as token.

These .parquet files still lack labels however. Lables are provided in three "label manifests" for these three election years (2012, 2014 and 2020), each of which is a .csv or .tsv containing a column of file IDs (called slugs) and columns containing labels for each of the fields of interest for each document. Each year has a slighty different set of extracted fields, sometimes including additional extracted fields not used by the model in this repo. All three manifests are combined in data/3_year_manifest.csv. All three label manifests and the combined manifest are available in the data folder. If they are not present they can be recovered from various sources as detailed below.

Using the labels in 3_year_manifest.csv and the 20,000 unlabeled token files, labeled token files are produced in the folder data/training which have the following form. These are the training data as provided to the model.

page	x0	y0	x1	y1	token	contract_num	advertiser	flight_from	flight_to	gross_amount	tok_id	length	digitness	is_dollar	log_amount	label
0	18	17.963	48.232	26.899	Contract	0	0.27	0	0	0	53	8	0	0	0	0
0	50.456	17.963	89.584	26.899	Agreement	0	0.33	0	0	0	115	9	0	0	0	0
0	474.001	17.963	505.137	26.899	1/15/20	0.4	0.26	0.38	0.88	0.22	0	8	0.75	0	0	0
0	414.781	65.213	445.917	74.149	1475302	1	0.26	0.4	0.27	0.67	0	7	1	1	14.204374	1
0	495.842	65.213	550.978	74.149	WOC12348242	0.33	0.26	0.32	0.32	0.19	663	11	0.72727275	0	0	0
0	183.909	90.193	298.949	101.363	www.gray.tv/advertising	0	0.58	0.06	0.06	0.06	1796	23	0	0	0	0
0	309.002	90.923	326.786	99.859	Mike	0	1	0	0	0	664	4	0	0	0	2
0	329.01	90.923	371.234	99.859	Bloomberg	0	1	0	0	0	821	9	0	0	0	2
0	373.458	90.923	393.474	99.859	2020,	0.33	1	0.31	0.46	0.67	0	5	0.8	0	0	2
0	395.698	90.923	407.258	99.859	Inc	0	1	0	0	0	166	3	0	0	0	2
0	491.041	90.683	522.177	99.619	12/31/19	0.27	0.74	0.88	0.5	0.22	0	8	0.75	0	0	0
0	308.251	103.463	338.483	112.399	Contract	0	0.24	0	0	0	53	8	0	0	0	0
0	340.707	103.463	361.603	112.399	Dates	0	0.23	0	0	0	18	5	0	0	0	0
0	407.251	103.463	438.371	112.399	Estimate	0	0.26	0	0	0	23	8	0	0	0	0
0	308.251	115.703	339.387	124.639	12/30/19	0.4	0.26	1	0.5	0.33	0	8	0.75	0	0	3
0	346.499	115.703	377.635	124.639	1/12/20	0.27	0.21	0.5	1	0.22	0	8	0.75	0	0	4
...

N.B. As it is written currently, the model only trains on the one thousand documents of 2020 data.

Where the labels come from

2012 Label Manifest

In 2012, ProPublica ran the Free The Files project (you can read how it worked) and hundreds of volunteers hand-entered information for over 17,000 of these forms. That data drove a bunch of campaign finance coverage and is now available from their data store.

The label manifest for 2012 data was downloaded from Pro Publica and is located at data/2012_manifest.tsv (renamed from ftf-all-filings.tsv which is the filename it downloads as). If the manifest is not present, it can be recovered from their website. This file contains the crowdsourced answers for some of our targets (omitting flight dates) and the PDF url.

2014 Label Manifest

In 2014 Alex Byrnes automated a similar extraction by hand-coding form layouts. This works for the dozen or so most common form types but ignores the hundreds of different PDF layouts in the long tail.

The label manifest for 2014 data, acquired from Alex's Github is data/2014_manifest.tsv. If the manifest is not present, it can be recovered from his github (renamed from 2014-orders.tsv which is the filename it downloads as). This file contains the crowdsourced answers for some of our targets (omitting 'gross amount').

2020 Label Manifest

All 2020 PDFs

Pdfs for the 2020 political ads and associated metadata were uploaded to Overview Docs. To collect the pdfs, the file names were pulled from the FCC API OPIF file search using the search terms: order, contract, invoice, and receipt. The search was run with filters for campaign year set to 2020 and source service code set to TV.

The FCC API search also returns the source service code (entity type, i.e. TV, cable), entity id, callsign, political file type (political ad or non-candidate issue ad), office type (presidential, senate, etc), nielsen dma rank (tv market area), network affiliation, and the time stamps for when the ad was created and last modified were pulled. These were added to the overview document set along with the search term, URL for the FCC download, and the date of the search.

For these .pdfs, the following steps were followed to produce training data:

  • Convert pdf to a series of images
  • Determine angle of each page and rotate if needed
  • Use tesseract to OCR each image
  • Upload processed pdf to an S3 bucket and add URL to overview
  • Upload additional metadata on whether OCR was needed, the original angle of each page, and any errors that occurred during the OCR process.
A Subset for Training

A sample of 1000 documents was randomly chosen for hand labeling as 2020 training data.

The label manifest for 2020 data is data/2020_manifest.csv (renamed from fcc-data-2020-sample-updated.csv which is the filename it downloads as). If the manifest is not present, it can be recovered from this overview document set. This file contains our manually entered answers for all of our five targets for the 1000 randomly chosen documents.

Where the PDFs and token files come from

Acquiring .parquet files directly

The best way to run this project is to acquire the 20,000 .parquet files containing the tokens and geometry for each PDF in the training set. The token files are downloaded from our S3 bucket by running make data/tokenized. If you run make train, the program will automatically run make data/tokenized as this is a dependency for make train. These .parquet files are then located in the folder data/tokenized. This is the easiest way to get this data.

Acquiring Raw PDFs

To find the original PDFs, it is always possible to return to the FCC website and download them directly using the proper filters (search terms: order, contract, invoice, and receipt, filters: campaign year = 2020, source service code = TV). Each of the 2012, 2014 and 2020 data which was used by Pro Publica, by Alex Byrnes or by ourselves to create the three label manifests can be found in a differnt location also as follows:

2012 Training PDFs

90% of the original PDFs from the Free the Files Project are available on DocumentCloud and can be recovered by running 'curl' on url = 'https://documentcloud.org/documents/' + slug + '.pdf'. These PDFs can also be found in this folder. If you download PDFs from one of these sources, locate them in the folder data/PDFs

2014 Training PDFs

Alex Byrnes github directs users back to the FCC website to get his data. He does not host it separately. The PDFs are also available in this google drive folder. If you download PDFs from one of these sources, locate them in the folder data/PDFs

2020 Training PDFs

The one thousand 2020 PDFs we hand labeled are available on Overview Docs as this dataset

These PDFs can also be acquired from the FCC database by running make data/pdfs. This command will locate all the PDFs associated with 2020 training data in the folder data/PDFs

Converting Raw PDFs to .parquet files

If you have a set of PDF files located in data/PDFs and would like to tokenize those PDFs then you can run a line in the make file which is typically commented out. Uncomment make data/tokenized: data/pdfs and the associated lines below and comment out the other make command called data/tokenized. This command will create the folder data/tokenized containing the .parquet files of tokens and geometry corresponding to each of the PDFs in data/PDFs.

Training

How the model works

The easiest fields are contract number and total. This uses a fully connected three-layer network trained on a window of tokens from the data, typically 20-30 tokens. Each token is hashed to an integer mod 1000, then converted to 1-hot representation and embedded into 64 dimensions. This embedding is combined with geometry information (bounding box and page number) and also some hand-crafted "hint" features, such as whether the token matches a regular expression for dollar amounts. For details, see the talk.

We also incorporate custom "hint" features. For example, the total extractor uses an "amount" feature that is the log of the token value, if the token string is a number.

Running in Docker

  • make test to run all the unit tests for the project
  • make docker-shell will spin up a container and drop you into a bash shell after mounting the deepform folder of code so that commands that you run there reflect the code as you are editing it.
  • make train runs deepform/train.py with the default configuration. If it needs to it will download and preprocess the data it needs to train on.
  • make test-train runs the same training loop on the same data, but with very strongly reduced settings (just a few documents for a few steps) so that it can be used to check that it actually works.
  • make sweep runs a hyperparameter sweep with Weights & Biases, using the configuration in sweep.yaml

Some of these commands require an .env file located at the root of the project directory.

If you don't want to use Weights & Biases, you can turn it off by setting use_wandb=0. You'll still need an .env file, but it can be empty.

Running Locally using Poetry

For each of the above commands, rather than running a make command which automatically runs in docker, run the python command which is a subsection of the make command. I.e. rather than running ,make test-train, run python -um deepform.train --len-train=100 --steps-per-epoch=3 --epochs=2 --log-level=DEBUG --use-wandb=0 --use-data-cache=0 --save-model=0 --doc-acc-max-sample-size=20 --render-results-size=3

Code quality and pre-commit hooks

The code is currently automatically formatted with black, uses autoflake to remove unused imports, isort to sort them, and flake8 to check for PEP8 violations. These tools are configured in pyproject.toml and should Just Work™ -- you shouldn't have to worry about them at all once you install them.

To make this as painless as possible, .pre-commit-config.yaml contains rules for automatically running these tools as part of git commit. To turn these git pre-commit hook on, you need run pre-commit install (after installing it and the above libraries with Poetry or pip). After that, whenever you run git commit, these tools will run and clean up your code so that "dirty" code never gets committed in the first place.

GitHub runs a "python build" Action whenever you push new code to a branch (configured in python-app.yml). This also runs black, flake8, and pytest, so it's best to just make sure things pass locally before pushing to GitHub.

Looking Forward

This is a difficult data set that is very relevant to journalism, and improvements in technique will be immediately useful to campaign finance reporting.

Our next steps include additional pre-processing steps to rotate improperly scanned documents and to identify and separate concatenated documents. The default parameter settings we are using are fairly good but can likely be improved further. We have leads on additional training data which was produced via hand-labeling in a couple of different related projects which we are hoping to incorporate. We believe there is potential here for some automated training data creation. Finally, we are not at present making use of the available 2012 and 2014 training data and these daya may be able to dramatically improve model accuracy.

We would love to hear from you! Contact jstray on twitter or through his blog.