/wrangl

Parallel data preprocessing for NLP and ML.

Primary LanguagePythonApache License 2.0Apache-2.0

Wrangl

Tests

Parallel data preprocessing and fast experiments for NLP and ML. See docs here.

Why?

I built this library to prototype ideas quickly. In essence it combines Hydra, Pytorch Lightning, moolib, and Ray for some fast data processing and (supervised/reinforcement) learning. The following are supported with command line or config tweaks (e.g. no additional boilerplate code):

  • checkpointing
  • early stopping
  • auto git diffs
  • logging to S3 (along with auto-generated seaborn plot), wandb
  • Slurm launcher

Installation

pip install -e .  # add [dev] if you want to run tests and build docs.

# for latest
pip install git+https://github.com/vzhong/wrangl

# pypi release
pip install wrangl

If moolib install fails because you do not have CUDA you can try installing it yourself with env USE_CUDA=0 pip install moolib.

Usage

See the documentation for how to use Wrangl. Examples of projects using Wrangl are found in wrangl.examples. In particular wrangl.examples.learn.xor_clf shows an example of using Wrangl to quickly set up a supervised classification task. wrangl.examples.learn.atari_rl shows an example of reinforcement learning using IMPALA VTrace. For parallel data preprocessing wrangl.examples.preprocess.using_stanza shows an example of using Stanford NLP Stanza to parse text in parallel across CPU cores.

If you find this work helpful, please consider citing

@misc{zhong2021wrangl,
  author = {Zhong, Victor},
  title = {Wrangl: Parallel data preprocessing for NLP and ML},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/vzhong/wrangl}}
}