/ClinGen

[ACL 2024 Findings] This is the code for our paper "Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models".

Primary LanguagePythonMIT LicenseMIT

ClinGen

This is the code for our paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models, to appear on ACL 2024 Findings.

Model Framework

ClinGen

Dataset

Generated Datasets

The original train/validation/test data, and the generated synthetic training data has been uploaded in Huggingface Dataset Hub (note that KG and LLM stands for two ways of incorporating external knowledge):

Corpus # Train # Test # Class Task Link-KG Link-LLM
LitCovid 24960 6238 7 Text Classification litcovid litcovid
HOC 3091 898 10 Text Classification hoc hoc
GAD 4750 350 1 Relation Extraction gad gad
CDR 8431 2522 1 Relation Extraction cdr cdr
ChemProt 8793 10807 5 Relation Extraction chemprot chemprot
MedNLI 11232 1422 3 Natural Language Inference mednli mednli
MEDIQA-NLI - 405 3 Natural Language Inference mediqa-nli mediqa-nli
MEDIQA-RQE 8588 302 2 Natural Language Inference mediqa-rqe mediqa-rqe
PUBHEALTH 9804 1231 4 Fact Verification pubhealth pubhealth
HealthVer 10591 1824 3 Fact Verification healthver healthver
MQP 10 3033 2 Sentence Similarity mqp mqp
BC5CDR-Disease 4882 5085 1 Named Entity Recognition bc5cdr-disease bc5cdr-disease
BC5CDR-Chemical 4882 5085 1 Named Entity Recognition bc5cdr-chemical bc5cdr-chemical
NCBI-Disease 5336 921 1 Named Entity Recognition ncbi-disease ncbi-disease
CHEMDNER 14522 12430 1 Named Entity Recognition chemdner chemdner
CASI 5 100 6 Attribute Extraction casi casi

Note:

  • Due to privacy constraint, we are not able to release the training set for MedNLI/MediQA-NLI.
  • train.jsonl stands for the synthetic training set (may contain noise)
  • train_few.jsonl stands for the initial few-shot demonstrations
  • test.jsonl stands for data from the test set

Training Data Generation

First of all, please apply an OpenAI API key here, if you don't have one yet. Then, replace the YOUR_API_KEY in clingen.py with your own API key. Finally, run bash run_clingen.sh with your specified dataset name and keyword type.

Questions?

Feel free to contact ran.xu at emory.edu for any questions regarding this repo. Please try to specify the problem with details so we can help you better and quicker!

Citation

If you find this repository helpful, please kindly consider citing the corresponding paper. Thanks in advance!

@inproceedings{xu2024knowledge,
  title={Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models},
  author={Xu, Ran and Cui, Hejie and Yu, Yue and Kan, Xuan and Shi, Wenqi and Zhuang, Yuchen and Jin, Wei and Ho, Joyce and Yang, Carl},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2024},
  year={2024}
}