ClinGen

This is the code for our paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models, to appear on ACL 2024 Findings.

Model Framework

Dataset

Generated Datasets

The original train/validation/test data, and the generated synthetic training data has been uploaded in Huggingface Dataset Hub (note that KG and LLM stands for two ways of incorporating external knowledge):

Corpus	# Train	# Test	# Class	Task	Link-KG	Link-LLM
LitCovid	24960	6238	7	Text Classification	litcovid	litcovid
HOC	3091	898	10	Text Classification	hoc	hoc
GAD	4750	350	1	Relation Extraction	gad	gad
CDR	8431	2522	1	Relation Extraction	cdr	cdr
ChemProt	8793	10807	5	Relation Extraction	chemprot	chemprot
MedNLI	11232	1422	3	Natural Language Inference	mednli	mednli
MEDIQA-NLI	-	405	3	Natural Language Inference	mediqa-nli	mediqa-nli
MEDIQA-RQE	8588	302	2	Natural Language Inference	mediqa-rqe	mediqa-rqe
PUBHEALTH	9804	1231	4	Fact Verification	pubhealth	pubhealth
HealthVer	10591	1824	3	Fact Verification	healthver	healthver
MQP	10	3033	2	Sentence Similarity	mqp	mqp
BC5CDR-Disease	4882	5085	1	Named Entity Recognition	bc5cdr-disease	bc5cdr-disease
BC5CDR-Chemical	4882	5085	1	Named Entity Recognition	bc5cdr-chemical	bc5cdr-chemical
NCBI-Disease	5336	921	1	Named Entity Recognition	ncbi-disease	ncbi-disease
CHEMDNER	14522	12430	1	Named Entity Recognition	chemdner	chemdner
CASI	5	100	6	Attribute Extraction	casi	casi

Note:

Due to privacy constraint, we are not able to release the training set for MedNLI/MediQA-NLI.
train.jsonl stands for the synthetic training set (may contain noise)
train_few.jsonl stands for the initial few-shot demonstrations
test.jsonl stands for data from the test set

Training Data Generation

First of all, please apply an OpenAI API key here, if you don't have one yet. Then, replace the YOUR_API_KEY in clingen.py with your own API key. Finally, run bash run_clingen.sh with your specified dataset name and keyword type.

Questions?

Feel free to contact ran.xu at emory.edu for any questions regarding this repo. Please try to specify the problem with details so we can help you better and quicker!

Citation

If you find this repository helpful, please kindly consider citing the corresponding paper. Thanks in advance!

@inproceedings{xu2024knowledge,
  title={Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models},
  author={Xu, Ran and Cui, Hejie and Yu, Yue and Kan, Xuan and Shi, Wenqi and Zhuang, Yuchen and Jin, Wei and Ho, Joyce and Yang, Carl},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2024},
  year={2024}
}