-
Notifications
You must be signed in to change notification settings - Fork 22
Creating Custom XGBoost Models for TOGA
TOGA uses a training dataset to train its XGBoost models. The dataset is a tab-separated file (.tsv) that contains various features related to genes and chains. Each line corresponds to one gene/chain intersection.
The dataset includes the following fields:
Gene or transcript identifiers Chain identifiers and their associated features such as synteny, alignment score, length, etc. Gene or transcript features such as length, number of exons, etc. Intersection features such as number of exon and intron bases intersecting chain aligning blocks Target class (ortholog or paralog) Please note that not all of these features are necessary for training the model. The required features vary depending on whether the model is for single-exon (SE) genes or multi-exon (ME) ones.
Pls, see models/trains.tsv
and train_model.py
to check how exactly it works.
- Global CDS fraction (gl_exo)
- Local CDS fraction (loc_exo)
- Flank fraction (flank_cov)
- Chain synteny log10 (synt_log)
- Intron coverage percentage (intr_perc)
- Global CDS fraction (gl_exo)
- Flank fraction (flank_cov)
- Exon coverage percentage (exon_perc)
- Chain synteny log10 (synt_log)
To create your custom training datasets for TOGA, you can use the following resources:
- Reference annotation in bed-12 format
- Dataset containing chain features. You can obtain this by calling toga.py with the --sac flag. This dataset will be saved into the
$project_dir/chain_results_df.tsv
file. - Biomart output (or any other reliable source of information)containing orthologs data.
An example for rat would contain fields such as Gene stable ID, Transcript stable ID, Rat gene stable ID, Rat gene name, Rat homology type, Rat protein or transcript stable ID. Feel free to use any subset of features you like and/or create your own.
The training dataset should be structured as: gene_ID + chain_ID + features + class. The most interesting classes for classification are "ortholog" and "paralog".
(!) Please note that TOGA classifies neither genes nor chains but gene/chain intersections. Therefore, your custom training datasets should reflect this structure for the most effective results.