/PhenoTagger-Updates

An Improved Method for Phenotype Concept Recognition Using Rich HPO Information

Primary LanguageGAPApache License 2.0Apache-2.0

An Improved Method for Phenotype Concept Recognition Using Rich HPO Information

Content

Dependency package

PhenoTagger has been tested using Python3.9.19 on CentOS and uses the following dependencies on a CPU and GPU:

To install all dependencies automatically using the command:

$ pip install -r requirements.txt

Data and model preparation

  1. To run this code, you need to create a model folder named "models" in the PhenoTagger folder, then download the model files into the model folder.

    • First download original files of the pre-trained language models (PLMs): Bioformer, BioBERT, PubMedBERT
    • Then download the fine-tuned model files for HPO in Here. We provide BioBERT and Bioformer models for tagging.
  2. The two typo-corpora are provided in */data/

Tagging

You can identify the HPO concepts from biomedical texts by the tagging.py file.

The file requires 3 parameters:

  • --modeltype, -m, help="the model type (pubmedbert or biobert or bioformer?)"
  • --input, -i, help="the input prediction file"
  • --output, -o, help="output folder to save the tagged results"

Example:

$ CUDA_VISIBLE_DEVICES=0 python tagging.py -m biobert -i ../data/GSC_2024_test.tsv -o ../results/GSC_2024_test_biobert.tsv

We also provide some optional parameters for the different requirements of users in the tagging.py file.

para_set={
'onlyLongest':False,  # False: return overlapping concepts; True: only return the longgest concepts in the overlapping concepts
'abbrRecog':Fasle,   # False: don't identify abbreviation; True: identify abbreviations
'negation': False, #True:negation detection
'ML_Threshold':0.95,  # the Threshold of deep learning model
}

Note: If you use typo data for noise detection, we recommend replacing bioTag() in the recognition function with bioTag_ml()

Training

1. Make Typo Data for training by using the Build_typo_train_data.py file

The file requires 2 parameters:

  • --input, -i, help="Input ontology path."
  • --output, -o, help="Output typo_ontology path."

Example:

$ python Build_typo_train_data.py -i ../ontology/hp20240208.obo -o ../ontology/typo_hpo.obo

After the program is finished, 1 file will be generated in the outpath:

  • typo_hpo.obo

2. Build the ontology dictionary using the Build_dict.py file

The file requires 3 parameters:

  • --input, -i, help="input the ontology .obo file"
  • --output, -o, help="the output folder of dictionary"
  • --rootnode, -r, help="input the root node of the ontogyly"

Example:

$ python Build_dict.py -i ../ontology/hp.obo -o ../dict/ -r HP:0000118

After the program is finished, 6 files will be generated in the output folder.

  • id_word_map.json
  • lable.vocab
  • noabb_lemma.dic
  • obo.json
  • word_id_map.json
  • alt_hpoid.json

3. Build the distant supervised training dataset using the Build_distant_corpus.py file

The file requires 4 parameters:

  • --dict, -d, help="the input folder of the ontology dictionary"
  • --fileneg, -f, help="the text file used to generate the negatives" (You can use our negative text "mutation_disease.txt" )
  • --negnum, -n, help="the number of negatives, we suggest that the number is the same with the positives."
  • --output, -o, help="the output folder of the distantly-supervised training dataset"

Example:

$ python Build_distant_corpus.py -d ../dict/ -f ../data/mutation_disease.txt -n 50000 -o ../data/distant_train_data/

After the program is finished, 3 files will be generated in the outpath:

  • distant_train.conll (distantly-supervised training data)
  • distant_train_pos.conll (distantly-supervised training positives)
  • distant_train_neg.conll (distantly-supervised training negatives)

4. Training Ontology Vector

The ontology vector was trained using TransE.py and TransR.py. For the ConvE methods, please refer to [https://github.com/TimDettmers/ConvE].

After training, the vectors were processed using emb_process.py for format handling.

Example:

$ python TransE.py
$ python emb_process.py

5. Training using the training.py file

The file requires 4 parameters:

  • --trainfile, -t, help="the training file"
  • --devfile, -d, help="the development set file. If don't provide the dev file, the training will be stopped by the specified EPOCH"
  • --modeltype, -m, help="the deep learning model type (cnn, biobert, pubmedbert or bioformer?)"
  • --output, -o, help="the output folder of the model"

Example:

$ CUDA_VISIBLE_DEVICES=0 python training.py -t ../data/distant_train_data/distant_train.conll -d ../data/corpus/GSC/GSC-2024_dev.tsv -m biobert -o ../models/