PhenoTagger has been tested using Python3.9.19 on CentOS and uses the following dependencies on a CPU and GPU:
To install all dependencies automatically using the command:
$ pip install -r requirements.txt
-
To run this code, you need to create a model folder named "models" in the PhenoTagger folder, then download the model files into the model folder.
- First download original files of the pre-trained language models (PLMs): Bioformer, BioBERT, PubMedBERT
- Then download the fine-tuned model files for HPO in Here. We provide BioBERT and Bioformer models for tagging.
-
The two typo-corpora are provided in */data/
You can identify the HPO concepts from biomedical texts by the tagging.py file.
The file requires 3 parameters:
- --modeltype, -m, help="the model type (pubmedbert or biobert or bioformer?)"
- --input, -i, help="the input prediction file"
- --output, -o, help="output folder to save the tagged results"
Example:
$ CUDA_VISIBLE_DEVICES=0 python tagging.py -m biobert -i ../data/GSC_2024_test.tsv -o ../results/GSC_2024_test_biobert.tsv
We also provide some optional parameters for the different requirements of users in the tagging.py file.
para_set={
'onlyLongest':False, # False: return overlapping concepts; True: only return the longgest concepts in the overlapping concepts
'abbrRecog':Fasle, # False: don't identify abbreviation; True: identify abbreviations
'negation': False, #True:negation detection
'ML_Threshold':0.95, # the Threshold of deep learning model
}
Note: If you use typo data for noise detection, we recommend replacing bioTag() in the recognition function with bioTag_ml()
The file requires 2 parameters:
- --input, -i, help="Input ontology path."
- --output, -o, help="Output typo_ontology path."
Example:
$ python Build_typo_train_data.py -i ../ontology/hp20240208.obo -o ../ontology/typo_hpo.obo
After the program is finished, 1 file will be generated in the outpath:
- typo_hpo.obo
The file requires 3 parameters:
- --input, -i, help="input the ontology .obo file"
- --output, -o, help="the output folder of dictionary"
- --rootnode, -r, help="input the root node of the ontogyly"
Example:
$ python Build_dict.py -i ../ontology/hp.obo -o ../dict/ -r HP:0000118
After the program is finished, 6 files will be generated in the output folder.
- id_word_map.json
- lable.vocab
- noabb_lemma.dic
- obo.json
- word_id_map.json
- alt_hpoid.json
The file requires 4 parameters:
- --dict, -d, help="the input folder of the ontology dictionary"
- --fileneg, -f, help="the text file used to generate the negatives" (You can use our negative text "mutation_disease.txt" )
- --negnum, -n, help="the number of negatives, we suggest that the number is the same with the positives."
- --output, -o, help="the output folder of the distantly-supervised training dataset"
Example:
$ python Build_distant_corpus.py -d ../dict/ -f ../data/mutation_disease.txt -n 50000 -o ../data/distant_train_data/
After the program is finished, 3 files will be generated in the outpath:
- distant_train.conll (distantly-supervised training data)
- distant_train_pos.conll (distantly-supervised training positives)
- distant_train_neg.conll (distantly-supervised training negatives)
The ontology vector was trained using TransE.py and TransR.py. For the ConvE methods, please refer to [https://github.com/TimDettmers/ConvE].
After training, the vectors were processed using emb_process.py for format handling.
Example:
$ python TransE.py
$ python emb_process.py
The file requires 4 parameters:
- --trainfile, -t, help="the training file"
- --devfile, -d, help="the development set file. If don't provide the dev file, the training will be stopped by the specified EPOCH"
- --modeltype, -m, help="the deep learning model type (cnn, biobert, pubmedbert or bioformer?)"
- --output, -o, help="the output folder of the model"
Example:
$ CUDA_VISIBLE_DEVICES=0 python training.py -t ../data/distant_train_data/distant_train.conll -d ../data/corpus/GSC/GSC-2024_dev.tsv -m biobert -o ../models/