-
Notifications
You must be signed in to change notification settings - Fork 4
Home
(serious constellations of reoccurring phylogenetically-independent origin)
Scorpio is a tool for classifying, haplotyping and defining Variants of Concern or Variants of Interest for a species. It was designed in the context of SARS-CoV-2, but is not species specific - all SARS-CoV-2 specific information can be installed via constellations.
It currently includes the following commands:
-
classify
- takes a set of lineage-defining constellations with rules and classifies sequences by them. -
haplotype
- takes a set of constellations and writes haplotypes (either as strings or individual columns). -
list
- print themrca_lineage
andoutput_name
of constellations as a single column to stdout. -
define
- takes a CSV with a group column and a mutations column and extracts the common mutations within the group, optionally with reference to a specified outgroup
It takes as input a ref-coordinate based multiple sequence alignment FASTA. For this reason it currently only supports typing SNP mutations and deletions (not insertions). This style of MSA has been commonly used within the SARS-CoV-2 pandemic as it can be generated by combining consensus-to-reference mappings instead of all-against-all mappings and therefore scales much better with millions of sequences. This MSA can be generated from unaligned reads using the following command:
minimap2 -t <threads> -a --secondary=no -x asm20 --score-N=0 <reference_fasta> <sequence_fasta> \
| gofasta sam toMultiAlign -t <threads> --reference <reference_fasta> --pad -o alignment.fasta
Or potentially using MAFFT with the --keeplength option ("Keep alignment length" in the web app).
Classify counts up the number of reference, alternative, ambiguous and other alleles at each of the defining sites of each constellation, and summarizes whether each sequence can be classified as belonging to each constellation based on sets of rules.
If it meets the criteria set in the rules for several constellations, a winning constellation is chosen by default as the constellation with the most rules met and with the best support (#alt/#sites). The default output is a single summary file, with optional additional columns. Individual counts and True/False classifications for each constellation can be output in individual CSV files.
- Create individual count files for each of the Omicron and Delta constellations. Note that the
-n
flag specifies a list of names in the format specified by thelabel
in the constellation JSON files.
scorpio classify -i alignment.fa --prefix scorpio_classify --output-counts -n "Delta (B.1.617.2-like)" "Omicron (B.1.1.529-like)" "Omicron (BA.1-like)" "Omicron (BA.2-like)" "Omicron (BA.3-like)" "Omicron (Unassigned)"
Create barcode strings for each sample for each constellation - these strings are ordered by position in the definition files and can help to resolve why a sample is failing to be classified as a given constellation: amplicon dropout, potential recombination or contamination.
Options include combining constellations and creating a single barcode/set of haplotypes for the ordered list of defining sites of all constellations, splitting barcodes into a column per site, and outputting a file per constellation containing counts of ref, alt, ambig and other alleles.
- Create a single summary file with a haplotype barcodes for each of the Omicron and Delta constellations for each sample. Note that the
-n
flag specifies a list of names in the format specified by thelabel
in the constellation JSON files.
scorpio haplotype -i alignment.fa --prefix scorpio_haplotypes -n "Delta (B.1.617.2-like)" "Omicron (B.1.1.529-like)" "Omicron (BA.1-like)" "Omicron (BA.2-like)" "Omicron (BA.3-like)"
Prints to stdout a single column list of the mrca_lineage
and output_name
for each constellation. This can then be parsed for downstream analysis e.g. this is used by Pangolin to get a list of the lineages we have constellations for in order to remove false positive lineage assignments. The output_name
corresponds to the label
in the constellation JSON unless another field is specified with --label
.
Identify the common mutations within a group of sequences (assumes that the mutations for each sample have already been found and are provided as a pipe-separated list in a column called nucleotide_mutations
). This is the format mutations are provided as output by the COG-UK datapipe variant calling module (https://github.com/COG-UK/datapipe/blob/main/modules/align_and_variant_call.nf). If required, can specify an outgroup, and mutations which are common to this outgroup are placed in a separate ancestral site list which is used by classify
but not haplotype
in order to retain sensitivity whilst removing noise from haplotype barcodes.