Snagajob/alto-boot

add getopts to preprocessing bash scripts (new_dataset.sh and train_mallet.sh, etc)

Opened this issue · 0 comments

this is the script that builds a new dataset, from snagajob postings specifically. It's pretty quick and dirty still, I need to add getopts, shabang header and a help string, etc. I'll also add documentation to the README.

It requires the unzipped tree-TM codebase somewhere on the machine, to pass to script as MALLET_HOME, a python joblib serialized list of mongo posting ids, and access to our mongo cluster.

It's called like:

bash scripts/new_dataset.sh
postings_samp . en_lang_postings_samp_small.pkl 20 ~/tree-TM/bin/ 8
$MONGO_USER $MONGO_PASSWORD $MONGO_HOST $MONGO_PORT $MONGO_DB
and it creates all the necessary files on disk, creates the expected directory trees, and trains the topic model.