Skip to content

Commit

Permalink
Add instructions for GENIA.
Browse files Browse the repository at this point in the history
  • Loading branch information
sheng-msft committed Apr 18, 2023
1 parent f222b77 commit 54c74ba
Show file tree
Hide file tree
Showing 3 changed files with 58 additions and 1 deletion.
39 changes: 39 additions & 0 deletions conf/genia.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{
"run_name": "base-run",
"dataset_name": "GENIA",
"model_name_or_path": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract",
"train_file": "data/genia/train.json",
"validation_file": "data/genia/dev.json",
"test_file": "data/genia/test.json",
"entity_type_file": "entity_types.json",
"dataset_entity_types": ["DNA", "RNA", "protein", "cell_line", "cell_type"],
"report_to": "none",
"max_seq_length": 128,
"use_span_width_embedding": true,
"doc_stride": 16,
"preprocessing_num_workers": 6,
"dataloader_num_workers": 6,
"overwrite_cache": false,
"overwrite_output_dir": true,
"do_train": true,
"fp16": true,
"gradient_checkpointing": true,
"optim": "adamw_torch",
"label_names": ["ner"],
"num_train_epochs": 20,
"per_device_train_batch_size": 8,
"gradient_accumulation_steps": 2,
"learning_rate": 3e-5,
"logging_steps": 50,
"load_best_model_at_end": true,
"metric_for_best_model": "f1",
"greater_is_better": true,
"save_strategy": "steps",
"save_steps": 100,
"save_total_limit": 1,
"do_eval": true,
"evaluation_strategy": "steps",
"eval_steps": 100,
"do_predict": true,
"output_dir": "/tmp/genia"
}
13 changes: 13 additions & 0 deletions data_preproc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ The data preprocessing code is adapted from [DyGIE](https://github.com/luanyi/Dy
* ACE 2004 (https://catalog.ldc.upenn.edu/LDC2005T09)
* ACE 2005 (https://catalog.ldc.upenn.edu/LDC2006T06)
* CoNLL 2003 (We use the preprocessed version from [Yu et al., 2020](https://github.com/juntaoy/biaffine-ner/issues/16))
* GENIA (We use the preprocessed version from [Yu et al., 2020](https://github.com/juntaoy/biaffine-ner/issues/17))

# Usage

Expand Down Expand Up @@ -74,4 +75,16 @@ python convert_to_hf_ds_format.py conll2003/dev.json ${CoNLL2003}/dev.json --tas
python convert_to_hf_ds_format.py conll2003/test.json ${CoNLL2003}/test.json --task conll2003
```

### GENIA
```bash
cd genia
split -l 1599 train_dev.genia.jsonlines train_dev.split.
cd ..
mkdir -p ../data/genia
python convert_to_hf_ds_format.py genia/train_dev.split.aa ../data/genia/train.json --task conll2003
python convert_to_hf_ds_format.py genia/train_dev.split.ab ../data/genia/dev.json --task conll2003
python convert_to_hf_ds_format.py genia/test.genia.jsonlines ../data/genia/test.json --task conll2003
```


If you want to use other datasets, please convert them into the same format as above.
7 changes: 6 additions & 1 deletion entity_types.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,9 @@
{"dataset": "CoNLL2003", "name": "PER", "description": "a person entity is limited to human including a single individual or a group.", "description_source": "ACE2005 annotation guidelines"}
{"dataset": "CoNLL2003", "name": "ORG", "description": "organization entities are limited to companies, corporations, agencies, institutions and other groups of people.", "description_source": "ACE2005 annotation guidelines"}
{"dataset": "CoNLL2003", "name": "LOC", "description": "location entities are limited to geographical entities such as geographical areas and landmasses, mountains, bodies of water, and geological formations.", "description_source": "ACE2005 annotation guidelines"}
{"dataset": "CoNLL2003", "name": "MISC", "description": "examples of miscellaneous entities include events, nationalities, products and works of art.", "description_source": "MRC4NER Github repo"}
{"dataset": "CoNLL2003", "name": "MISC", "description": "examples of miscellaneous entities include events, nationalities, products and works of art.", "description_source": "MRC4NER Github repo"}
{"dataset": "GENIA", "name": "DNA", "description": "DNA, which consists of a polysugar-phosphate backbone possessing projections of purines and pyrimidines, forms a double helix that is held together by hydrogen bonds between these purines and pyrimidines.", "description_source": "UMLS definition"}
{"dataset": "GENIA", "name": "RNA", "description": "RNA is a polynucleotide consisting essentially of chains with a repeating backbone of phosphate and ribose units to which nitrogenous bases are attached.", "description_source": "UMLS definition"}
{"dataset": "GENIA", "name": "protein", "description": "Protein is Linear POLYPEPTIDES that are synthesized on RIBOSOMES and may be further modified, crosslinked, cleaved, or assembled into complex proteins with several subunits. The specific sequence of AMINO ACIDS determines the shape the polypeptide will take, during PROTEIN FOLDING, and the function of the protein.", "description_source": "UMLS definition"}
{"dataset": "GENIA", "name": "cell_line", "description": "Cell line are cells propagated in vitro in special media conducive to their growth. Cultured cells are used to study developmental, morphologic, metabolic, physiologic, and genetic processes, among others.", "description_source": "UMLS definition"}
{"dataset": "GENIA", "name": "cell_type", "description": "Cells are the fundamental, structural, and functional units or subunits of living organisms. They are composed of CYTOPLASM containing various ORGANELLES and a CELL MEMBRANE boundary.", "description_source": "UMLS definition"}

0 comments on commit 54c74ba

Please sign in to comment.