Skip to content

Creating Custom XGBoost Models for TOGA

Bogdan Kirilenko edited this page Jul 12, 2023 · 1 revision

Understanding the Training Dataset

TOGA uses a training dataset to train its XGBoost models. The dataset is a tab-separated file (.tsv) that contains various features related to genes and chains. Each line corresponds to one gene/chain intersection.

The dataset includes the following fields:

Gene or transcript identifiers Chain identifiers and their associated features such as synteny, alignment score, length, etc. Gene or transcript features such as length, number of exons, etc. Intersection features such as number of exon and intron bases intersecting chain aligning blocks Target class (ortholog or paralog) Please note that not all of these features are necessary for training the model. The required features vary depending on whether the model is for single-exon (SE) genes or multi-exon (ME) ones.

Pls, see models/trains.tsv and train_model.py to check how exactly it works.

Required Features for Model Training

For the multi-exon (ME) model, the required features are:

  • Global CDS fraction (gl_exo)
  • Local CDS fraction (loc_exo)
  • Flank fraction (flank_cov)
  • Chain synteny log10 (synt_log)
  • Intron coverage percentage (intr_perc)

For the single-exon (SE) model, the required features are:

  • Global CDS fraction (gl_exo)
  • Flank fraction (flank_cov)
  • Exon coverage percentage (exon_perc)
  • Chain synteny log10 (synt_log)

Creating Your Custom Training Datasets

To create your custom training datasets for TOGA, you can use the following resources:

  • Reference annotation in bed-12 format
  • Dataset containing chain features. You can obtain this by calling toga.py with the --sac flag. This dataset will be saved into the $project_dir/chain_results_df.tsv file.
  • Biomart output (or any other reliable source of information)containing orthologs data.

An example for rat would contain fields such as Gene stable ID, Transcript stable ID, Rat gene stable ID, Rat gene name, Rat homology type, Rat protein or transcript stable ID. Feel free to use any subset of features you like and/or create your own.

The training dataset should be structured as: gene_ID + chain_ID + features + class. The most interesting classes for classification are "ortholog" and "paralog".

(!) Please note that TOGA classifies neither genes nor chains but gene/chain intersections. Therefore, your custom training datasets should reflect this structure for the most effective results.