/HiPool

Modified version of HiPool to support additional experiments

Primary LanguagePython

status

HiPool

A work-in-progress modified implementation of HiPool to support experiments on CuRIAM.

HiPool, from Hierarchical Pooling, is described in the paper "HiPool: Modeling Long Documents Using Graph Neural Networks" from ACL 2023.

This is not the original repo for HiPool and I am not an author on the HiPool paper. Please see that repo here.

Links

Setup

  1. Create conda/mamba environment.
    mamba env create -f environment.yml
    mamba activate hipool
    
  2. Install hipool locally.
    pip install --upgrade build
    pip install -e .
    
  3. Download datasets.
    • CuRIAM: Included with repo.
    • IMDB: I think this is the dataset.
      • I renamed the main csv file to imdb_sample.csv and removed most rows for faster debugging, since this dataset is not important for what I'm experimenting with.

Misc

[work-in-progress]

This repo uses jaxtyping and typeguard to enforce correct tensor dimensions at runtime. If you see an unfamiliar type annotation or decorators like in the example the below, it's for type checking.

@jaxtyped(typechecker=typechecker)
def some_function(x: Float[torch.Tensor, "10, 768"]):
    pass

I recommend taking a look at the jaxtyping docs.

TODOs

  • Some long documents are too big for GPU vram right now
  • Batching right now should allow for single documents, but worth testing
  • Eval needs final pieces put together and then needs to be tested
  • Decide on consistent variables for type annotations

Cite HiPool

@inproceedings{li2023hipool,
  title={HiPool: Modeling Long Documents Using Graph Neural Networks},
  author={Li, Irene and Feng, Aosong and Radev, Dragomir and Ying, Rex},
  booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
  year={2023},
  url={https://aclanthology.org/2023.acl-short.16/}
}