-
Notifications
You must be signed in to change notification settings - Fork 22
Building multiple codon alignments
Bogdan Kirilenko edited this page Jul 4, 2023
·
1 revision
To be corrected and structured.
Script to build multiple codon alignments from multiple TOGA outputs:
https://github.com/hillerlab/TOGA/blob/master/supply/extract_codon_alignment.py
Using this issue for inspiration: https://github.com/hillerlab/TOGA/issues/76
The script can be used to build a reliable codon alignment of particular transcript, using TOGA annotations from multiple species.
usage: extract_codon_alignment.py [-h] [--output OUTPUT] [--use_raw_sequences] [--save_not_aligned SAVE_NOT_ALIGNED]
[--allow_one2zero] [--skip_dups] [--align_entirely] [--seq_number_limit SEQ_NUMBER_LIMIT]
[--temp_dir TEMP_DIR] [--macse_caller MACSE_CALLER] [--use_prank]
[--prank_executable PRANK_EXECUTABLE] [--prank_tree PRANK_TREE] [--debug]
[--reference_2bit REFERENCE_2BIT] [--intermediate_data INTERMEDIATE_DATA]
[--min_percent_of_sp_with_one_orth MIN_PERCENT_OF_SP_WITH_ONE_ORTH]
[--max_copies MAX_COPIES] [--exclude_UL] [--force_repair]
[--save_aligner_commands SAVE_ALIGNER_COMMANDS]
input_dirs reference_bed transcript_id
positional arguments:
input_dirs File containing a list of TOGA results directories. Directories, listed in this file, will be used to
produce the alignment.
reference_bed Bed12-file containing *reference* annotations
transcript_id ID of the aligned transcript (must be present in the reference bed file)
options:
-h, --help show this help message and exit
--output OUTPUT, -o OUTPUT
Output file, default stdout
--use_raw_sequences, --raw
(experimental feature) Use direct CESAR output instead of corrected sequence.
--save_not_aligned SAVE_NOT_ALIGNED, --sna SAVE_NOT_ALIGNED
Path to fasta file to save input sequences used for alignment, default None. This feature works with
entire gene alignment only (not exon-by-exon)
--allow_one2zero, -z Process orthologous projections in case a species has no intact/PI/UL projections (class is one2zero)
--skip_dups, -s Skip not one-2-one orthologs
--align_entirely, -a Do not align exons separately; align the gene entirely
--seq_number_limit SEQ_NUMBER_LIMIT
Exit if number of aligned sequences exceeds the threshold. For example, if you like to align sequence
of the gene X in 10 species, and each species has 10 copies, the total number of sequences to align
is 100.
--temp_dir TEMP_DIR Temp dir, default /dev/shm/username or /tmp/username
--macse_caller MACSE_CALLER
Executable containg command to call macse2. Example of the command: java -jar /path/to/macse2.jar
(not just a path to macse2.jar!)
--use_prank Use prank instead of MACSE
--prank_executable PRANK_EXECUTABLE, --prank PRANK_EXECUTABLE
Prank executable in case you like to align sequences with PRANK
--prank_tree PRANK_TREE
Tree to be used with PRANK
--debug, -d Write debugging information
--reference_2bit REFERENCE_2BIT
Reference 2bit file, by default is inferred from TOGA output
--intermediate_data INTERMEDIATE_DATA
For debugging: directory name to save intermediate data
--min_percent_of_sp_with_one_orth MIN_PERCENT_OF_SP_WITH_ONE_ORTH, --mpo MIN_PERCENT_OF_SP_WITH_ONE_ORTH
Minimal fraction of species with at least one ortholog that have exactly one ortholog (one2one or
one2many), default 0.0, max 1.0 For example, if you have 100 species, 80 of them have at least one
ortholog, (not one2zero), and 40 of them have one2one, this value equals 0.5
--max_copies MAX_COPIES
Maximal number of gene copies allowed per species, default 1. For example, if this arg equals 5, all
species that have >5 orthologs will be omitted.
--exclude_UL Do not consider UL projections as orthologous (NOT IMPLEMENTED YET)
--force_repair, --fr Force repair missing parts of the alignment. Please use in case the script continuously fails to
produce the result. Can be needed in case of massive alignments with abundant missing/corrupted
sequence.
--save_aligner_commands SAVE_ALIGNER_COMMANDS
Save a sequence of MACSE commands to the specified location. Temporary files will not be deleted!