Exploratory project using data from:
- U.S. legal code
- ProPublica Congress API
- congress.gov
- https://www.govtrack.us/developers/data
See also: congressapi
Interactive results will be hosted at us-legislation-data.appspot.com
- https://www.aclweb.org/anthology/D/D16/D16-1221.pdf
- Political Science Dept. at MIT: http://cwarshaw.scripts.mit.edu/papers/CandidatePositions160810.pdf
Use congressdb.convert2mongod
to import to bulk Congress data into a MongoDB database. This vastly simplifies the process of
generating training data sets later.
Use congressdb to build a dataset of the bills introduced by the House in the 114th congress:
python -m congressdb.build --src=/data/congress --output=house-introduced-114 \
--type=hr --version=is --congress=114
This will create a directory house-introduced-114
with training, validation, and test data splits and a vocab file. See congressdb/build.py
for more info.
A known set of good hyperparameter are in hparams.yaml
.
python -m lm.train --data_path=house-introduced-114 --model_dir=/tmp/house-model --hparams=hparams.yaml
This will start training a model using the dataset we just generated. Snapshots of the model will go to /tmp/house-model
.
To visualize/monitor the training process, start TensorBoard pointed at the model directory.
tensorboard --logdir=/tmp/house-model
To create a sample from the language model, using the latest snapshot:
python -m lm.generate --model_dir=/tmp/house-model --data_path=house-introduced-114 \
--hyperparams=hparams.yaml --max_length=1000 --temp=1.1 \
--output=sample.txt
Omitting the --output
flag will print to stdout.
python -m lm.evaluate --model_dir=/tmp/house-model --data_path=house-introduced-114 \
--hparams=hparams.yaml
Using the hyperparameters in hparams, training for 274K iterations (~9 epochs), we end up with a test set perplexity of 13.1.
And creates clauses of legislation that look like this:
(2) Preservation of actions.-- The guidelines submitted to determine all right of any action shall be resolved in the federal register on the final patent repayment plan, a imputed known as a national examiner, a media order, contact authority, and information it determines that a portion of the exemption from tax is allocated. Such center shall not receive such transfers for the total amount of payment of funds with respect to work eligibility under the alternative limit by reason of section 408b.
Compared to a real snippet hr1347/text-versions/ih/document.txt
(2) Preservation of records.--The State shall ensure that the records of the independent redistricting commission are retained in the appropriate State archive in such manner as may be necessary to enable the State to respond to any civil action brought with respect to Congressional redistricting in the State.
CAVEAT: Spacing around punctuation symbols fixed manually. Capitalization stripped from original model and thus re-introduced above.
Real snippet found by searching text for "Preservation of".
Fun Fact: The phrase "Preservation of actions" does not show up in any bill introduced by the 114th House of Representatives.
You'll need to have TensorFlow Serving installed. See https://tensorflow.github.io/serving/
First, export the model:
python -m lm.export --model_dir=/tmp/house-model --data_path=house-introduced-114 \
--hparams=hparams.yaml --export_dir=/tmp/serve/house-model --version=1
Build the default server,
bazel build //tensorflow_serving/model_servers:tensorflow_model_server
and bring it up pointing to our export directory:
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server \
--port=9000 \
--model_base_path=/tmp/serve/house-model \
--model_name=lm
With a commandline client...
python -m serving.lm_client --server=127.0.0.1:9000 --num_tests=1 --token=5
But also, there's an interactive webpage in the works for presenting the models.
cd site
./start_server.sh
And then in a browser go to localhost:5000
. You should see a poor man's bar chart.
Honestly, it's awful right now. Just followed the intro D3 tutorial. But just you wait.
It should all be contained in the app/ directory. pip install all requirements.txt in a virtual env
virtualenv --python=/usr/local/lib/python2.7.13/bin/python env
source env/bin/activate
pip install -t lib -r requirements.txt
The -t lib
is important!
NOTE: Because I was using Ubuntu 14.04, I followed instructions here: http://mbless.de/blog/2016/01/09/upgrade-to-python-2711-on-ubuntu-1404-lts.html to upgrade to python 2.7.13 (when using virtualenv, so that requests library works properly).
To use congressapi, you'll have to add an api_keys.py file with PROPUBLICA_CONGRESS_API_KEY
constant.