/kg-augmented-lm

Leveraging knowledge graphs to learn a more factually grounded language model for retrieval and question answering downstream tasks.

Primary LanguageJupyter Notebook

Knowledge Graph-Text Fusion for enhanced Language Modeling

PyTorch Lightning Config: Hydra Template
Paper Conference

Description

What it does

Installation

It is highly recommended to run on a UNIX system. If you are on Windows, you can use the Windows Subsystem for Linux to run the code.

Clone Repository

git clone https://github.com/marcomoldovan/kg-augmented-lm
cd kg-augmented-lm

Virtual Environment & Dependencies

Option 1: Poetry

# install dependencies
poetry install

# activate environment
source $(poetry env info --path)/bin/activate

Option 2:PyEnv

# create environment with specified python version
pyenv virtualenv <python-version> .venv
pyenv activate .venv

# install requirements
pip install -r requirements.txt

Option 3: Conda

# create conda environment and install dependencies
conda env create -f environment.yaml -n .venv

# activate conda environment
conda activate .venv

Option 4: Native Python

# create environment with specified python version
python -m venv .venv
source .venv/bin/activate

# install requirements
pip install -r requirements.txt

How to run

Download and Prepare Data

Running the downloading and preprocessing scripts is optional. The respective DataModules handle all of this automatically. This is only necessary if you want to run specfic tests before training which need access to the data.

Wikigraphs
bash scripts/download_data/download_wikigraphs.sh
bash scripts/preprocess_data/preprocess_wikigraphs.sh
Wikidata5m
bash scripts/download_data/download_wikidata5m.sh
bash scripts/preprocess_data/preprocess_wikidata5m.sh

Run Tests

# run all tests
pytest

# run tests from specific file
pytest tests/test_train.py

# run all tests except the ones marked as slow
pytest -k "not slow"

Single Training Run

Train model with default configuration

# train on CPU
python src/train.py trainer=cpu

# train on GPU
python src/train.py trainer=gpu

# train on multiple GPUs
python src/train.py trainer=ddp

Train model with chosen experiment configuration from configs/experiment/

python src/train.py experiment=experiment_name.yaml

You can override any parameter from command line like this

python src/train.py trainer.max_epochs=20 data.batch_size=64

Hyperparameter Search

To run a hyperparameter search with Optuna you can use the following command

python train.py -m hparams_search=fashion_mnist_optuna experiment=example

Running a hyperparameter sweep with Weights and Biases is also supported.

wandb sweep configs/hparams_search/fashion_mnist_wandb.yaml
wandb agent <sweep_id>

SLURM

bash scripts/slurm/slurm_train.sh

Docker

docker build -t kg-augmented-lm .
docker run --gpus all -it kg-augmented-lm

Inference

# load checkpoint
# accepted model input
# linking to a KG

Results

Results and some graphs.

Contributing

Contributions are very welcome. If you know how to make this code better, don't hesitate to open an issue or a pull request.

License

This project is licensed under the terms of the MIT license. See LICENSE for additional details.

Acknowledgements

  • [Lightning-Hydra Template]

References

Citation

@article{KGTextFusion,
  title={Knowledge Graph-Text Fusion for enhanced Language Modeling},
  author={Marco Moldovan},
  journal={arXiv preprint arXiv:1001.2234},
  year={2023}
}