Updated the documentation

sanger-tol · Jun 4, 2024 · 24e2ca2 · 24e2ca2
1 parent dc0a3c6
commit 24e2ca2
Showing 1 changed file with 52 additions and 8 deletions.
diff --git a/docs/output.md b/docs/output.md
@@ -15,6 +15,9 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [BlobDir](#blobdir) - Output files viewable on a [BlobToolKit viewer](https://github.com/blobtoolkit/blobtoolkit)
 - [Static plots](#static-plots) - Static versions of the BlobToolKit plots
 - [BUSCO](#busco) - BUSCO results
+- [Read alignments](#read-alignments) - Aligned reads (optional)
+- [Read coverage](#read-coverage) - Read coverage tracks
+- [Base content](#base-content) - _k_-mer statistics (for k &le; 4)
 - [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
@@ -26,8 +29,8 @@ The files in the BlobDir dataset which is used to create the online interactive
 <summary>Output files</summary>
 
 - `blobtoolkit/`
-  - `<accession>/`
-    - `*.json.gz`: files generated from genome and alignment coverage statistics
+  - `<assembly-name>/`
+    - `*.json.gz`: files generated from genome and alignment coverage statistics.
 
 More information about visualising the data in the [BlobToolKit repository](https://github.com/blobtoolkit/blobtoolkit/tree/main/src/viewer)
 
@@ -53,12 +56,53 @@ BUSCO results generated by the pipeline (all BUSCO lineages that match the claas
 <details markdown="1">
 <summary>Output files</summary>
 
-- `blobtoolkit/`
-  - `busco/`
-    - `*.batch_summary.txt`: BUSCO scores as tab-separated files (1 file per lineage).
-    - `*.fasta.txt`: BUSCO scores as formatted text (1 file per lineage).
-    - `*.json`: BUSCO scores as JSON (1 file per lineage).
-    - `*/`: all output BUSCO files, including the coordinate and sequence files of the annotated genes.
+- `busco/`
+  - `<lineage-name>/`
+    - `short_summary.json`: BUSCO scores for that lineage as a tab-separated file.
+    - `short_summary.tsv`: BUSCO scores for that lineage as JSON.
+    - `short_summary.txt`: BUSCO scores for that lineage as formatted text.
+    - `full_table.tsv`: Coordinates of the annotated BUSCO genes as a tab-separated file.
+    - `missing_busco_list.tsv`: List of the BUSCO genes that could not be found.
+    - `*_busco_sequences.tar.gz`: Sequences of the annotated BUSCO genes. 1 _tar_ archive for each of the three annotation levels (`single_copy`, `multi_copy`, `fragmented`), with 1 file per gene.
+    - `hmmer_output.tar.gz`: Archive of the HMMER alignment scores.
+
+</details>
+
+### Read alignments
+
+If the pipeline is run with `--align true`, it aligns the input reads to the assembly with minimap2.
+Otherwise no BAM files are generated.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `read_mapping/`
+  - `<datatype>/`
+    - `<sample>.bam`: alignments of that sample's reads in BAM format.
+
+</details>
+
+### Read coverage
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `read_mapping/`
+  - `<datatype>/`
+    - `<sample>.coverage.1k.bed.gz`: Bedgraph file with the coverage of the alignments of that sample per 1 kbp windows.
+
+</details>
+
+### Base content
+
+- [Base content](#base-content) - _k_-mer statistics (for k &le; 4)
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `base_content/`
+  - `<assembly-name>_*nuc_windows.tsv.gz`: Tab-separated files with the counts of every _k_-mer for k &le; 4 in 1 kbp windows. The first three columns correspond to the coordinates (sequence name, start, end), followed by each _k_-mer.
+  - `<assembly-name>_freq_windows.tsv.gz`: Tab-separated files with frequencies derived from the _k_-mer counts.
 
 </details>