This is a repo of Python (2.7) programs for variational inference for DP mixture of HDP ngram backoff model (generalized by HDP topic model). See
Morita, T. (2018) "Unsupervised Learning of Lexical Subclasses from Phonotactics." MIT Ph.D Thesis.
for details on the model.
Python 3 is not supported.
- Prepare your data.
- .tsv file
- Data column should have strings of comma-separated symbols (e.g. a,b,c for the string "abc").
- Example data available at toy_data/DP-ngram-mixture_simulated_data_200-samples.tsv.
- Move to the
code/
directory. - Run the following:
python learning.py PATH/TO/YOUR/DATA
with the following options:
- Specifying data.
-k
/--data_column
- Column name for the inputs
- default='IPA_csv'
- Saving results.
-r
/--result_path
- Path to the directory where you want to save results. (Several subdirectories will be created. See below.)
- default='../results_debug'
-j
/--jobid
- Job ID #. Used as a part of the path to the directory where results are saved (useful for computing clusters).
- default=Start date & time (e.g. "18-10-20-12-30-14-551728")
- Model parameters.
-n
/--ngram
- Context length of ngram. Only 2 and longer grams are currently supported (i.e., no support for 1gram).
- default=3
-S
/--shape_of_sublex_concentration
- Shape parameter of the Gamma prior on the concentration of the sublexicon DP.
- default=10.0
-R
/--rate_of_sublex_concentration
- Rate (= inverse of scale) parameter of the Gamma prior on the concentration of the sublexicon DP.
- default=10.0
-c
/--topic_base_counts
- Concentration for top level dirichlet distribution.
- default=1.0
- Variational inference.
-i
/--iterations
- Maxmum # of iterations
- default=2500
-T
/--tolerance
- Tolerance level to detect convergence
- default=0.1
-s
/--sublex
- Max # of sublexica
- default=10
The program will create subdirectories [data_filename]/[job_id]
in the directory specified by the --result_path
option.
For example, if
- your
--result_path
is../results_eg
- your data is
../data/example.tsv
, and - the
-j
option is10
,
then the results will be saved in ../results_eg/example/10
.
You'll get four files.
SubLexica_assignment.csv
- Classification probabilities of words (indexed by "customer_id", following the CRP convention).
symbol_coding.csv
- Code map from b/w data symbols and their integer id.
variational_parameters.h5
- Variatrional parameters of the model.
VI_DP_ngram.log
- Log of the update.
- The recorded "var_bound" (i.e. ELBO) doesn't include constant terms (for computational efficiency).
- To get the constant term, run
get_var_bound_constant.py
.
- Japanese words appearing in BCCWJ (Morita, 2018; Morita & O'Donnell, to appear).
- English words in CELEX (Morita, 2018; Morita & O'Donnell, in prep.).
- Tigrinya words collected by Dr. Kevin Scannell.
- Morita, Takashi. 2018. Unspervised Learning of Lexical Subclasses from Phonotactics. Ph.D Thesis. Doctoral dissertation. MIT, Cambridge, MA.
- Morita, Takashi and Timothy J. O'Donnell. To appear. Statistical Evidence for learnable lexical subclasses in Japanese. Accepted with major revisions for Linguistic Inquiry.