Skip to content

Building multiple codon alignments

Bogdan Kirilenko edited this page Jul 4, 2023 · 1 revision

To be corrected and structured.

Script to build multiple codon alignments from multiple TOGA outputs: https://github.com/hillerlab/TOGA/blob/master/supply/extract_codon_alignment.py

Using this issue for inspiration: https://github.com/hillerlab/TOGA/issues/76

The script can be used to build a reliable codon alignment of particular transcript, using TOGA annotations from multiple species.

usage: extract_codon_alignment.py [-h] [--output OUTPUT] [--use_raw_sequences] [--save_not_aligned SAVE_NOT_ALIGNED]
                                  [--allow_one2zero] [--skip_dups] [--align_entirely] [--seq_number_limit SEQ_NUMBER_LIMIT]
                                  [--temp_dir TEMP_DIR] [--macse_caller MACSE_CALLER] [--use_prank]
                                  [--prank_executable PRANK_EXECUTABLE] [--prank_tree PRANK_TREE] [--debug]
                                  [--reference_2bit REFERENCE_2BIT] [--intermediate_data INTERMEDIATE_DATA]
                                  [--min_percent_of_sp_with_one_orth MIN_PERCENT_OF_SP_WITH_ONE_ORTH]
                                  [--max_copies MAX_COPIES] [--exclude_UL] [--force_repair]
                                  [--save_aligner_commands SAVE_ALIGNER_COMMANDS]
                                  input_dirs reference_bed transcript_id

positional arguments:
  input_dirs            File containing a list of TOGA results directories. Directories, listed in this file, will be used to
                        produce the alignment.
  reference_bed         Bed12-file containing *reference* annotations
  transcript_id         ID of the aligned transcript (must be present in the reference bed file)

options:
  -h, --help            show this help message and exit
  --output OUTPUT, -o OUTPUT
                        Output file, default stdout
  --use_raw_sequences, --raw
                        (experimental feature) Use direct CESAR output instead of corrected sequence.
  --save_not_aligned SAVE_NOT_ALIGNED, --sna SAVE_NOT_ALIGNED
                        Path to fasta file to save input sequences used for alignment, default None. This feature works with
                        entire gene alignment only (not exon-by-exon)
  --allow_one2zero, -z  Process orthologous projections in case a species has no intact/PI/UL projections (class is one2zero)
  --skip_dups, -s       Skip not one-2-one orthologs
  --align_entirely, -a  Do not align exons separately; align the gene entirely
  --seq_number_limit SEQ_NUMBER_LIMIT
                        Exit if number of aligned sequences exceeds the threshold. For example, if you like to align sequence
                        of the gene X in 10 species, and each species has 10 copies, the total number of sequences to align
                        is 100.
  --temp_dir TEMP_DIR   Temp dir, default /dev/shm/username or /tmp/username
  --macse_caller MACSE_CALLER
                        Executable containg command to call macse2. Example of the command: java -jar /path/to/macse2.jar
                        (not just a path to macse2.jar!)
  --use_prank           Use prank instead of MACSE
  --prank_executable PRANK_EXECUTABLE, --prank PRANK_EXECUTABLE
                        Prank executable in case you like to align sequences with PRANK
  --prank_tree PRANK_TREE
                        Tree to be used with PRANK
  --debug, -d           Write debugging information
  --reference_2bit REFERENCE_2BIT
                        Reference 2bit file, by default is inferred from TOGA output
  --intermediate_data INTERMEDIATE_DATA
                        For debugging: directory name to save intermediate data
  --min_percent_of_sp_with_one_orth MIN_PERCENT_OF_SP_WITH_ONE_ORTH, --mpo MIN_PERCENT_OF_SP_WITH_ONE_ORTH
                        Minimal fraction of species with at least one ortholog that have exactly one ortholog (one2one or
                        one2many), default 0.0, max 1.0 For example, if you have 100 species, 80 of them have at least one
                        ortholog, (not one2zero), and 40 of them have one2one, this value equals 0.5
  --max_copies MAX_COPIES
                        Maximal number of gene copies allowed per species, default 1. For example, if this arg equals 5, all
                        species that have >5 orthologs will be omitted.
  --exclude_UL          Do not consider UL projections as orthologous (NOT IMPLEMENTED YET)
  --force_repair, --fr  Force repair missing parts of the alignment. Please use in case the script continuously fails to
                        produce the result. Can be needed in case of massive alignments with abundant missing/corrupted
                        sequence.
  --save_aligner_commands SAVE_ALIGNER_COMMANDS
                        Save a sequence of MACSE commands to the specified location. Temporary files will not be deleted!