(C) Copyright 2016, Allison J.B. Chaney
This software is distributed under the MIT license. See LICENSE.txt
for details.
dat
doc
out
scripts
src
Readme.md
this file
The doc
folder contains the LaTex source for our EMNLP paper, including source for generating the figures.
This is the best resource to learn about the Capsule model and inference details,
and may be cited as follows.
@inproceedings{Chaney2016,
author = {Chaney, Allison J.B. and Wallach, Hanna and Connelly, Matthew and Blei, David M.},
title = {Detecting and Characterizing Events},
booktitle = {EMNLP},
year = {2016},
}
This folder also contains PDF slides for various presentations.
The doc
folder contains events.csv
, a file containing a list of real-world events with corresponding sources; this is used to check the results of Capsule on the U.S. State Department cables data from the 1970s.
The cables data may be obtained from the History Lab at Columbia University, or if you obtain their permission, I can share my processed version of the data. While the data is publically accessible, The History Lab's version is cleaner.
Absent this data or your own data of interest, you can simulate data using the script dat/src/simulate_data.py
.
This script has most of the simulation parameters hard-coded on lines 5-16 and takes two command line arguments: the shape of the event decay (step
, linear
, or exp
) and an integer random seed. The simulated data is created in the same directory that the script is run.
Thus, to create a simulated data set, one should create a directory, move to that directory, and run the script from that directory, such as in the following example.
mkdir dat/sim
cd dat/sim
python ../src/simulate_data.py exp 372552
To run Capsule using your own data, four files are needed:
meta.tsv
train.tsv
test.tsv
validation.tsv
The first file, meta.tsv
is a tab-separated file with three integer-valued columns:
doc.id author.id time.id
This should include the meta-data for all documents included in the training, test, and validation sets. If time is continuous in your original data, it shoud be binned to include a minimum number of documents (e.g., 10) per titime interval. It may also be worth omitting authors who have written too few documents (e.g., <5). When processing your data, you should retain a mapping of these ids to their original values.
The remaining three files are for document word counts; they are also tab-separated with three integer-valued columns:
doc.id term.id count
Each term.id
refers to the index of a particular vocubuary term; like with topic models, this vocabulary should be chosen with care. Capsule may require a larger vocabulary than a typical topic model, as terms related to events are more rare.
We recommend spitting terms into roughly 90% training, 9% testing and 1% validation, if your data is sufficiently large.
You should split by (document, vocabulary-term) pairs, not by entire documents; this way, document-specific parmeters are still learned.
If you wish to train on the full data, and do not care about a testing set, the validation and test sets are allowed to contain duplicate data.
If you intend to use the Capsule visualization, you should check that your author, time, and vocabulary term mappings are all consistent with its required format.
- Clone the repo:
git clone https://github.com/ajbc/capsule.git
- Navigate to the
capsule/src
directory - Compile with
make
- Run the executable, e.g.:
./capsule --data ~/my-data/ --out my-fit
Compilation requires Armadillo, a C++ linear algebra library.
A note on notation: the paper uses γ (gamma) to represent event topics, but to avoid confusion with the gamma distribution, the code uses pi
to represent this same variable.
Option | Arguments | Help | Default |
---|---|---|---|
help | print help information | ||
verbose | print extra information while running | off | |
out | dir | save directory, required | |
data | dir | data directory, required | |
svi | use stochastic VI (instead of batch VI) | off for < 10M doc-term counts in training | |
batch | use batch VI (instead of SVI) | on for < 10M doc-term counts in training | |
a_phi | a | shape hyperparameter to phi (entity general concerns) | 0.3 |
b_phi | b | rate hyperparameter to phi (entity general concerns) | 0.3 |
a_xi | a | shape hyperparameter to xi (entity-specific concern) | 0.3 |
b_xi | b | rate hyperparameter to xi (entity-specific concern) | 0.3 |
a_psi | a | shape hyperparameter to psi (event strength) | 0.3 |
b_psi | b | rate hyperparameter to psi (event strength) | 0.3 |
a_theta | a | shape hyperparameter to theta (documents' general topics) | 0.3 |
a_zeta | a | shape hyperparameter to zeta (documents' entity topics) | 0.3 |
a_epsilon | a | shape hyperparameter to epsion (documents' event topics) | 0.3 |
a_beta | a | hyperparameter to beta (general topics) | 0.3 |
a_eta | a | shape hyperparameter to eta (entity topics) | 0.3 |
a_pi | a | hyperparameter to pi (event topics; gamma in paper) | 0.3 |
no_topics | don't consider general topics | include general topics | |
no_entity | don't consider entity topics | include entity topics | |
no_events | don't consider event topics | include event topics | |
event_dur | d | event duration | 7 |
event_decay | d | event decays; options: exponential, linear, step | exponential |
seed | seed | the random seed | time |
save_freq | f | the saving frequency. Negative value means no savings for intermediate results. | 20 |
eval_freq | f | the intermediate evaluating frequency. Negative means no evaluation for intermediate results. | -1 |
conv_freq | f | the convergence check frequency | 10 |
max_iter | max | the max number of iterations | 300 |
min_iter | min | the min number of iterations | 30 |
converge | c | the change in rating log likelihood required for convergence | 1e-6 |
final_pass | do a final pass on all users and items | no final pass | |
overwrite | overwrite old results | keep only latest | |
sample | sample_size | the stochastic sample size | 1000 |
svi_delay | tau | SVI delay >= 0 to down-weight early samples | 1024 |
svi_forget | kappa | SVI forgetting rate (0.5,1] | default 0.75 |
K | K | the number of general topics | 100 |