nextstrain · ivan-aksamentov · Feb 15, 2022 · Feb 8, 2022 · Feb 8, 2022 · Feb 8, 2022
diff --git a/docs/user/algorithm/07-quality-control.md b/docs/user/algorithm/07-quality-control.md
@@ -10,13 +10,11 @@ Nextclade scans each query sequence for issues which may indicate problems occur
 
 For each query sequence each individual QC rule produces a quality score. These **individual QC scores** are empirically calibrated to fit the following thresholds:
 
-| Score         | Meaning                  | Color designation     |
-|---------------|--------------------------|-----------------------|
-| 0             | the best quality         | bright green          |
-| 0 to 29       | "good" quality           | green to yellow       |
-| 30            | warning threshold        | yellow                |
-| 30 to 99      | "mediocre" quality       | yellow to orange      |
-| 100 and above | "bad" quality            | red to bright red     |
+| Score         | Meaning            | Color designation |
+| ------------- | ------------------ | ----------------- |
+| 0 to 29       | "good" quality     | green             |
+| 30 to 99      | "mediocre" quality | yellow            |
+| 100 and above | "bad" quality      | red               |
 
 After all scores are calculated, the **final QC score** ``$` S `$`` is calculated as follows:
 
@@ -32,60 +30,85 @@ The final score has the same thresholds as the the individual scores.
 
 ## Individual QC Rules
 
-For SARS-CoV-2, we currently implement the following QC rules (in parentheses are the one-letter designations used in [Nextclade Web](../nextclade-web))
+For SARS-CoV-2, we currently implement the following QC rules (in parentheses are the one-letter designations used in [Nextclade Web](../nextclade-web)). For other viruses, such as influenza, the same QC rules are used. However, the parametrization is different. The exact parameters can be found in the `qc.json` input file. Datasets provided by Nextclade can be inspected in the Github repo [nextstrain/nextclade_data](https://github.com/nextstrain/nextclade_data).
 
-#### Missing data (N)
+As an example, here's a [`qc.json` for a recent H3N2 dataset](https://github.com/nextstrain/nextclade_data/blob/master/data/datasets/flu_h3n2_ha/references/CY163680/versions/2022-01-18T12:00:00Z/files/qc.json). Here's the [`qc.json` for a recent SARS-CoV-2 dataset](https://github.com/nextstrain/nextclade_data/blob/master/data/datasets/sars-cov-2/references/MN908947/versions/2022-02-07T12:00:00Z/files/qc.json).
 
-If your sequence misses more than 3000 sites (`N` characters), it will be flagged as `bad` <!--- what happens if it's 2900 Ns? What's the function, linear from 0 to 100 from 0 to 3000 and capped thereafter? -->
+### Missing data (N)
 
-#### Mixed sites (M)
+If your sequence misses more than 3000 sites (`N` characters), it will be flagged as `bad`. The first 300 missing sites are not penalized (`scoreBias`). After that the score goes linearly from 0-100 as the number `N`s goes from 300 to 3000 (`scoreBias + missingDataThreshold`).
 
-Ambiguous nucleotides (such as `R`, `Y`, etc) are often indicative of contamination (or superinfection) and more than 10 such non-ACGTN characters will result in a QC flag `bad`.
+### Mixed sites (M)
 
-#### Private mutations (P)
+Ambiguous nucleotides (such as `R`, `Y`, etc) are often indicative of contamination (or superinfection) and more than 10 (`mixedSitesThreshold`) such non-ACGTN characters will result in a QC flag `bad`.
 
-Sequences with more than 24 mutations relative to the closest sequence in the reference tree are flagged as `bad`. We will revise this threshold as diversity of the SARS-CoV-2 population increases.
+### Private mutations (P)
 
-#### Mutation clusters (C)
+In order to assign clades, Nextclade places sequences on a reference tree that is representative of the global phylogeny (see figure below). The query sequence (dashed) is compared to all sequences (including internal nodes) of the reference tree to identify the nearest neighbor.
 
-If your sequence has clusters with 6 or more private differences within a 100-nucleotide window, it will be flagged as `bad`.
+As a by-product of this placement, Nextclade identifies the mutations, called "private mutations", that differ between the query sequence and the nearest neighbor sequence. In the figure, the yellow and dark green mutations are private mutations, as they occur in addition to the 3 mutations of the attachment node.
 
-#### Stop codons (S)
+![Identification of private mutations](//docs/user/assets/algo_private-muts.png)
+
+Many sequence quality problems are identifiable by the presence of private mutations. Sequences with unusually many private mutations are unlikely to be biological and are thus flagged as bad.
+
+Since web version 1.13.0 (CLI 1.10.0), Nextclade classifies private mutations further into 3 categories to be more sensitive to potential contamination, co-infection and recombination:
+
+1. Reversions: Private mutations that go back to the reference sequence, i.e. a mutation with respect to reference is present on the attachment node but not on the query sequence.
+2. Labeled mutations: Private mutations to a genotype that is known to be common in a clade.
+3. Unlabeled mutations: Private mutations that are neither reversions nor labeled.
+
+For an illustration of these 3 types, see the figure below.
+
+![Classification of private mutations](//docs/user/assets/algo_private-muts-classification.png)
+
+Reversions are common artefacts in some bioinformatic pipelines when there is amplicon dropout.
+They are also a sign of contamination, co-infection or recombination. Labeled mutations also contain commonly when there's contamination, co-infection or recombination.
+
+Reversions and labeled mutations are weighted several times higher than unlabeled mutations due to their higher sensitivity and specificity for quality problems (and recombination).
+In February 2022, every reversion was counted 6 times (`weightReversionSubstitutions`) while every labeled mutation was counted 4 times (`weightLabeledSubstitutions`). Unlabeled mutations get weight 1 (`weightUnlabeledSubstitutions`).
+
+From the weighted sum, 8 (`typical`) is subtracted. The score is then a linear interpolation between 0 and 100 (and above), where 100 corresponds to 24 (`cutoff`).
+
+Private deletion ranges (including reversion) are currently counted as a single unlabeled substitution, but this could change in the future.
+
+Which genotypes get "labeled" is determined in the dataset config file `virus_properties.json` which can also be found in the [Github repo](https://github.com/nextstrain/nextclade_data/blob/master/data/datasets/sars-cov-2/references/MN908947/versions/2022-02-07T12:00:00Z/files/virus_properties.json).
+Currently, all mutations that appear in at least 30% of the sequences of a clade or in at least 100k sequences in a clade get that clade's label.
+
+### Mutation clusters (C)
+
+To be more sensitive for quality problems in a narrow area of a genome, the mutation cluster rule counts the number of private within all possible 100-nucleotide windows (`windowSize`).
+If that number exceeds 6 (`clusterCutOff`), this counts as a SNP cluster.
+The quality score is the number of clusters times 50 (`scoreWeight`), hence 1 cluster will cause the cluster rule to be mediocre.
+
+### Stop codons (S)
 
 Replicating viruses can not have premature stop codons in essential genes and such premature stops are hence an indicator of problematic sequences.
 
-However, some stop codons are known to be common even in functional viruses. Our stop codon rule excludes such known stop codons and assigns a QC score of 75 to each additional premature stop.
+However, some stop codons are known to be common even in functional viruses. Our stop codon rule excludes such known stop codons `ignoredStopCodons` and assigns a QC score of 75 to each premature stop, hence if there are 2 unknown stop codons, the score is 150.
 
-<!--- How does this work with 2 stop codons? Score of 150? But I thought 100 is max. -->
+Beware that the position of known stop codons in `qc.json` is 0-indexed, in contrast to the 1-indexed positions that are used in Nextclade outputs.
 
-#### Frame shifts (F)
+### Frame shifts (F)
 
-Frame shifting insertions or deletions typically result in a garbled translation or a premature stop. Nextalign currently doesn't translate frame shifted coding sequences and each frame shift is assigned a QC score 75. Note, however, that clade 21H has a frame shift towards the end of ORF3a that results in a premature stop.
+Frame shifting insertions or deletions typically result in a garbled translation or a premature stop. Nextalign currently doesn't translate frame shifted coding sequences and each frame shift is assigned a QC score 75. Note, however, that clade 21H (Mu) has a frame shift towards the end of ORF3a that results in a premature stop. Known frame shifts (those listed in `ignoredFrameShifts`) in `qc.json` are not penalized.
 
 ## Interpretation
 
 Nextclade's QC warnings don't necessarily mean your sequences are problematic, but these issues warrant closer examination. You may explore the rest of the analysis results for the flagged sequences to make the decision.
 
 The numeric QC scores are useful for rough estimation of the quality of sequences. However, these values are empirical. They only hint on possible issues, the possible scale of these issues and call for further investigation. They do not have any other meaning or application.
 
+The [Nextstrain SARS-CoV-2 pipeline](https://github.com/nextstrain/ncov) uses similar (more lenient) QC criteria. For example, Nextstrain will exclude your sequence if it has fewer than 27000 valid bases (corresponding to roughly 3000 Ns) and doesn't check for ambiguous characters. Sequences flagged for excess divergence and SNP clusters by Nextclade are likely excluded by Nextstrain.
 
-<!-- TODO: this is a SARS-CoV-2-specific section -->
-
-The [Nextstrain SARS-CoV-2 pipeline](https://github.com/nextstrain/ncov) uses similar (more lenient) QC criteria. For example, Nextstrain will exclude your sequence if it has fewer than 27000 valid bases (corresponding to roughly 3000 Ns) and <!--- should this be `but` instead of `and`? --> doesn't check for ambiguous characters. Sequences flagged for excess divergence and SNP clusters by Nextclade are likely excluded <!--- should this be `included` or `excluded`? --> by Nextstrain.
-
-<!-- TODO: check factual correctness and spelling of the next sentence -->
-
-Note that there are many additional potential problems Nextclade does not check for. These include for example: primer sequences, adaptaters <!--- adaptors? --> , or chimeras <!--- reocmbinations? --> between divergent SARS-CoV-2 strains.
+Note that there are many additional potential problems Nextclade does not check for. These include for example: primer sequences or adapters.
 
 ## Configuration
 
 QC checks can be enabled or disabled, and their parameters can be changed by providing a custom QC configuration file (typically `qc.json`) in the Advanced mode of [Nextclade Web](../nextclade-web) or in [Nextclade CLI](../nextclade-cli).
 
-<!-- TODO: describe the QC parameters -->
-
 ## Results
 
 QC results are presented in the "QC" column of the results table in [Nextclade Web](../nextclade-web). More information is included into mouseover tooltips.
 
 QC results are also included in the analysis results JSON, CSV and TSV files generated by [Nextclade CLI](../nextclade-cli) and in the "Download" dialog of [Nextclade Web](../nextclade-web).
-
diff --git a/docs/user/assets/algo_private-muts-classification.png b/docs/user/assets/algo_private-muts-classification.png
diff --git a/docs/user/assets/algo_private-muts.png b/docs/user/assets/algo_private-muts.png
diff --git a/docs/user/assets/web_advanced-ui.png b/docs/user/assets/web_advanced-ui.png
diff --git a/docs/user/assets/web_select-advanced.png b/docs/user/assets/web_select-advanced.png
diff --git a/docs/user/input-files.md b/docs/user/input-files.md
@@ -95,6 +95,8 @@ Accepted formats: JSON. Example configuration for SARS-CoV-2:
 }
 ```
 
+Note that the positions are 0-indexed and codon range ends are excluded. So `ORF3a:257-276` should be encoded as `{"begin": 256, "end": 276 }`.
+
 ## Gene map
 
 (or "genome annotations")
@@ -196,6 +198,8 @@ It is of the following schema (shortened for clarity):
 }
 ```
 
+Positions are 1-indexed.
+
 Nextclade Web (advanced mode): accepted in "Virus properties" drag & drop box.
 
 Nextclade CLI flag: `--input-virus-properties`
diff --git a/docs/user/nextclade-web.md b/docs/user/nextclade-web.md
@@ -141,6 +141,18 @@ Once Nextclade has finished its analysis, you can download the results in a vari
 - `nextclade.errors.csv`: A file containing all errors and warnings that occurred during the analysis, like genes that failed to be translated
 - `nextclade.zip`: A `zip` file containing all files mentioned above
 
+### Advanced mode
+
+You can use a custom dataset in Nextclade web through the advanced mode. The same input files as for Nextclade CLI can be specified, see [input files](input-files) for more details. Click on `Customize dataset files` to open advanced mode:
+
+![Advanced mode](//docs/user/assets/web_select-advanced.png)
+
+The selected dataset, for example `SARS-CoV-2` is used as the default for any input file and overwritten by user supplied files.
+
+You can provide files by drag and drop, by providing a web link or copying it into a text box:
+
+![Advanced mode UI](//docs/user/assets/web_advanced-ui.png)
+
 ### URL parameters
 
 Input files can be specified in the URL parameters. The name of the parameters match the corresponding `--input*`
@@ -152,18 +164,18 @@ If an `input-fasta` URL parameter is provided, Nextclade Web automatically start
 
 All parameters are optional.
 
-| URL parameter     | Meaning                                                                                             |
-| ----------------- | --------------------------------------------------------------------------------------------------- |
-| input-fasta       | URL to a fasta file containing query sequences. If provided, the analysis will start automatically. |
-| input-root-seq    | URL to a fasta file containing reference (root) sequence.                                           |
-| input-tree        | URL to a Auspice JSON v2 file containing reference tree.                                            |
-| input-pcr-primers | URL to a CSV file containing PCR primers.                                                           |
-| input-qc-config   | URL to a JSON file containing QC onfiguration.                                                      |
-| input-gene-map    | URL to a GFF3 file containing gene map.                                                             |
-| dataset-name      | Safe name of the dataset to use. Examples: `sars-cov-2`, `flu_h3n2_ha`                              |
-| dataset-reference | Accession of the reference sequence of the dataset to use: Examples: `MN908947`, `CY034116`.        |
-| dataset-tag       | Version tag of the dataset to use.                                                                  |
-<!--- TODO @ivan-aksamentov: Add row for virusPropertiesJson parameter -->
+| URL parameter          | Meaning                                                                                             |
+| ---------------------- | --------------------------------------------------------------------------------------------------- |
+| input-fasta            | URL to a fasta file containing query sequences. If provided, the analysis will start automatically. |
+| input-root-seq         | URL to a fasta file containing reference (root) sequence.                                           |
+| input-tree             | URL to a Auspice JSON v2 file containing reference tree.                                            |
+| input-pcr-primers      | URL to a CSV file containing PCR primers.                                                           |
+| input-qc-config        | URL to a JSON file containing QC onfiguration.                                                      |
+| input-gene-map         | URL to a GFF3 file containing gene map.                                                             |
+| input-virus-properties | URL to a JSON file containing labeled genotypes (`virusProperties.json`)                            |
+| dataset-name           | Safe name of the dataset to use. Examples: `sars-cov-2`, `flu_h3n2_ha`                              |
+| dataset-reference      | Accession of the reference sequence of the dataset to use: Examples: `MN908947`, `CY034116`.        |
+| dataset-tag            | Version tag of the dataset to use.                                                                  |
 
 For example, the file with input sequences hosted at `https://example.com/sequences.fasta` can be specified with:
 

diff --git a/docs/user/output-files.md b/docs/user/output-files.md
@@ -94,10 +94,14 @@ Every row in tabular output corresponds to 1 input sequence. The meaning of colu
 
 ### JSON results
 
-Nextclade CLI flag: `--output-json`
+Nextclade CLI flag: `--output-json`, filename `nextclade.json`
 
 JSON results file is best for in-depth automated processing of results. It contains everything tabular files contain, plus more, in a more machine-friendly format.
 
+> ⚠️ Beware that JSON results use 0-indexed nucleotide and codon positions, whereas csv and tsv files use 1-indexed positions. The reason is, that JSON corresponds more closely to the internal representation and 0-indexing is the default in most programming languages. For example, substitution `{refNuc: "C", pos: 2146, queryNuc: "T"}` in JSON results corresponds to substitution `C2147T` in csv and tsv files.
+>
+>Ranges are inclusive for the start and exclusive for the end. Hence, `missing: {begin: 704, end: 726}` in JSON results corresponds to `missing: 705-726` in csv/tsv results.
+
 ## Output phylogenetic tree
 
 Nextclade CLI flags: `--output-tree`