Discrete Text Representation

This is the codebase for our ACL 2020 paper: Discrete Latent Variable Representations for Low-Resource Text Classification (ACL portal, video, slides).

@inproceedings{jin2020discrete,
    title = "Discrete Latent Variable Representations for Low-Resource Text Classification",
    author = "Shuning Jin and Sam Wiseman and Karl Stratos and Karen Livescu",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    year = "2020",
    url = "https://www.aclweb.org/anthology/2020.acl-main.437"
}

Table of Contents

1 Software Environment

2 Data

text classification datasets: AG News, DBPedia, Yelp Review Full

  • from paper: Character-level Convolutional Networks for Text Classification (Zhang et al, NIPS 2015)

  • google drive link shared by the author

  • alternatively, you can get data by running the commands (specify /path/to/data/dir)

    # gdown is the library to download files from google drive
    # https://pypi.org/project/gdown/
    # already included in the envrionment.yml
    pip install gdown
    # download data from google drive
    bash scripts/download_data.sh /path/to/data/dir
    # random sample dev set 5000
    python scripts/train_dev_split.py /path/to/data/dir
  • data directory should look like this

3 Run Code

  • set environment variables

    export DISCRETE_DATA_DIR=/path/to/data/dir
    export DISCRETE_PROJECT_DIR=/path/to/project/dir
  • TensorBoard (optional): to see tensorboard output of pretraining

    # local server
    tensorboard --logdir [dir: tensorboard_train, tensorboard_val]
    open http://localhost:6006
    
    # remote server
    tensorboard --logdir [dir: tensorboard_train, tensorboard_val] --bind_all
    # replace remotehost name as prompted by the above command
    open http://remotehost:6006
  • command examples:

    # Caveat: use SINGLE QUOTES in the commands
    # double quotes sometimes cause problems
    
    # to resume a previously interrupted pretraining, change `ckpt_path=none` to `ckpt_path=current`
    
    # vq pretrain
    python main.py \
    -c config/base.conf \
    -o 'expname=demo, runname=ag_sentence_vq,
    quantizer.level=sentence, quantizer.M=4, quantizer.K=256, quantizer.type=vq, vq.commitment_cost=1e-3, vq.use_ema=0,
    phase=pretrain, pretrain.use_noam=0, ckpt_path=none'
    
    # vq target train: 200 examples
    python main.py \
    -c config/base.conf \
    -o 'expname=demo, runname=ag_sentence_vq,
    quantizer.level=sentence, quantizer.M=4, quantizer.K=256, quantizer.type=vq, vq.commitment_cost=1e-3, vq.use_ema=0,
    phase=target_train, target=${target-tmpl}${target-200-tmpl}{test=0}, sub_runname=cls200, ckpt_path=current'
    
    # vq output pretrained encodings
    python main.py \
    -c config/base.conf \
    -o 'expname=demo, runname=ag_sentence_vq,
    quantizer.level=sentence, quantizer.M=4, quantizer.K=256, quantizer.type=vq, vq.commitment_cost=1e-3, vq.use_ema=0,
    phase=analyze, ckpt_path=current'
    
    # em pretrain
    python main.py -c config/base.conf \
    -o 'expname=demo, runname=ag_sentence_em,
    quantizer.level=sentence, quantizer.M=4, quantizer.K=256, quantizer.type=em,
    phase=pretrain, pretrain.em_iter=3, pretrain.use_noam=1, ckpt_path=none'
    
    # TODD: cat-vae, retrieval
    # more examples comming soon
  • command explanation: quick intro, detailed intro comming soon.

Acknowledgement

The coding logic is largely borrowed from and inspired by the jiant library

@misc{wang2019jiant,
    author = {Alex Wang and Ian F. Tenney and Yada Pruksachatkun and Phil Yeres and Jason Phang and Haokun Liu and Phu Mon Htut and and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R. Thomas McCoy and Roma Patel and Yinghui Huang and Edouard Grave and Najoung Kim and Thibault F\'evry and Berlin Chen and Nikita Nangia and Anhad Mohananey and Katharina Kann and Shikha Bordia and Nicolas Patry and David Benton and Ellie Pavlick and Samuel R. Bowman},
    title = {\texttt{jiant} 1.3: A software toolkit for research on general-purpose text understanding models},
    howpublished = {\url{http://jiant.info/}},
    year = {2019}
}