Skip to content

Commit

Permalink
Updated the output documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
muffato committed May 24, 2024
1 parent eb9f6bf commit 97516bd
Showing 1 changed file with 15 additions and 160 deletions.
175 changes: 15 additions & 160 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@ The directories comply with Tree of Life's canonical directory structure.

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [Gene annotation files](#gene-annotation-files) - Assembly files, either straight from the NCBI FTP, or indices built on them
- [Repeat annotation files](#repeat-annotation-files) - Files corresponding to analyses run (by the NCBI) on the original assembly, e.g repeat masking
- [Repeat annotation files](#repeat-annotation-files) - Files corresponding to repeat annotation produced by Ensembl
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

All data files are compressed (and indexed) with `bgzip`.
Expand All @@ -23,170 +22,26 @@ All Fasta files are indexed with `samtools faidx`, which allows accessing any re

All BED files are indexed with tabix in both TBI and CSI modes, unless the sequences are too large.

### Gene annotation files

Here are the files you can expect in the `gene/` sub-directory.

```text
/lustre/scratch124/tol/projects/darwin/data/insects/Noctua_fimbriata/
└── analysis
└── ilNocFimb1.1
└── gene
└── braker2
├── GCA_905163415.1.braker2.2022_03.cdna.fa.gz
├── GCA_905163415.1.braker2.2022_03.cdna.fa.gz.dict
├── GCA_905163415.1.braker2.2022_03.cdna.fa.gz.fai
├── GCA_905163415.1.braker2.2022_03.cdna.fa.gz.gzi
├── GCA_905163415.1.braker2.2022_03.cdna.seq_length.tsv
├── GCA_905163415.1.braker2.2022_03.cds.fa.gz
├── GCA_905163415.1.braker2.2022_03.cds.fa.gz.dict
├── GCA_905163415.1.braker2.2022_03.cds.fa.gz.fai
├── GCA_905163415.1.braker2.2022_03.cds.fa.gz.gzi
├── GCA_905163415.1.braker2.2022_03.cds.seq_length.tsv
├── GCA_905163415.1.braker2.2022_03.gff3.gz
├── GCA_905163415.1.braker2.2022_03.gff3.gz.csi
├── GCA_905163415.1.braker2.2022_03.gff3.gz.gzi
├── GCA_905163415.1.braker2.2022_03.pep.fa.gz
├── GCA_905163415.1.braker2.2022_03.pep.fa.gz.dict
├── GCA_905163415.1.braker2.2022_03.pep.fa.gz.fai
├── GCA_905163415.1.braker2.2022_03.pep.fa.gz.gzi
└── GCA_905163415.1.braker2.2022_03.pep.seq_length.tsv
```

The directory structure includes the assembly name, e.g. `fParRan2.2`, and all files are named after the assembly accession, e.g. `GCA_900634625.2`.
The file name (and the directory name) includes the annotation method and date. Current methods are:

- `braker2` for [BRAKER2](https://academic.oup.com/nargab/article/3/1/lqaa108/6066535)
- `ensembl` for Ensembl's own annotation pipeline

The `.seq_length.tsv` files are tabular analogous to the common `chrom.sizes`. They contain the sequence names and their lengths.

_The following documentation is copied from Ensembl's FTP_

#### Fasta files

Ensembl provide gene sequences in FASTA format in three files. The 'cdna' file contains
transcript sequences for all types of gene (including, for example,
pseudogenes and RNA genes). The 'cds' file contains the DNA sequences
of the coding regions of protein-coding genes. The 'pep' file contains
the amino acid sequences of protein-coding genes.

The headers in the 'cdna' FASTA files have the format:

```text
><transcript_stable_id> <seq_type> <assembly_name>:<seq_name>:<start>:<end>:<strand> gene:<gene_stable_id> gene_biotype:<gene_biotype> transcript_biotype:<transcript_biotype> [gene_symbol:<gene_symbol>] [description:<description>]
```

Example 'cdna' header:

```text
>ENSZVIT00000000002.1 cdna UG_Zviv_1:LG1:3600:22235:-1 gene:ENSZVIG00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding
```

The headers in the 'cds' FASTA files have the format:

```text
><transcript_stable_id> <seq_type> <assembly_name>:<seq_name>:<coding_start>:<coding_end>:<strand> gene:<gene_stable_id> gene_biotype:<gene_biotype> transcript_biotype:<transcript_biotype> [gene_symbol:<gene_symbol>] [description:<description>]
```

Example 'cds' header:

```text
>ENSZVIT00000000002.1 cds UG_Zviv_1:LG1:5289:19862:-1 gene:ENSZVIG00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding
```

The headers in the 'pep' FASTA files have the format:

```text
><protein_stable_id> <seq_type> <assembly_name>:<seq_name>:<coding_start>:<coding_end>:<strand> gene:<gene_stable_id> transcript:<transcript_stable_id> gene_biotype:<gene_biotype> transcript_biotype:<transcript_biotype> [gene_symbol:<gene_symbol>] [description:<description>]
```

Example 'pep' header:

```text
>ENSZVIP00000000002.1 pep UG_Zviv_1:LG1:5289:19862:-1 gene:ENSZVIG00000000002.1 transcript:ENSZVIT00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding
```

Stable IDs for genes, transcripts, and proteins include a version
suffix. Gene symbols and descriptions are not available for all genes.

#### GFF3 file

A GFF3 ([specification](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)) file is also provided.
GFF3 files are validated using [GenomeTools](http://genometools.org).

The 'type' of gene features is:

- `gene` for protein-coding genes
- `ncRNA_gene` for RNA genes
- `pseudogene` for pseudogenes

The 'type' of transcript features is:

- `mRNA` for protein-coding transcripts
- a specific type or RNA transcript such as `snoRNA` or `lnc_RNA`
- `pseudogenic_transcript` for pseudogenes

All transcripts are linked to `exon` features.
Protein-coding transcripts are linked to `CDS`, `five_prime_UTR`, and
`three_prime_UTR` features.

Attributes for feature types:
(italics indicate data which is not available for all features)

- region types:
- `ID`: Unique identifier, format `<region_type>:<region_name>`
- _`Alias`_: A comma-separated list of aliases, usually including the
`INSDC` accession
- _`Is_circular`_: Flag to indicate circular regions
- gene types:
- `ID`: Unique identifier, format `gene:<gene_stable_id>`
- `biotype`: Ensembl biotype, e.g. `protein_coding`, `pseudogene`
- `gene_id`: Ensembl gene stable ID
- `version`: Ensembl gene version
- _`Name`_: Gene name
- _`description`_: Gene description
- transcript types:
- `ID`: Unique identifier, format `transcript:<transcript_stable_id>`
- `Parent`: Gene identifier, format `gene:<gene_stable_id>`
- `biotype`: Ensembl biotype, e.g. `protein_coding`, `pseudogene`
- `transcript_id`: Ensembl transcript stable ID
- `version`: Ensembl transcript version
- _`Note`_: If the transcript sequence has been edited (i.e. differs
from the genomic sequence), the edits are described in a note.
- exon
- `Parent`: Transcript identifier, format `transcript:<transcript_stable_id>`
- `exon_id`: Ensembl exon stable ID
- `version`: Ensembl exon version
- `constitutive`: Flag to indicate if exon is present in all
transcripts
- `rank`: Integer that show the 5'->3' ordering of exons
- CDS
- `ID`: Unique identifier, format `CDS:<protein_stable_id>`
- `Parent`: Transcript identifier, format `transcript:<transcript_stable_id>`
- `protein_id`: Ensembl protein stable ID
- `version`: Ensembl protein version

### Repeat annotation files

Here are the files you can expect in the `repeats/` sub-directory.
Here are the files you can expect in the results directory.

```text
analysis
└── gfLaeSulp1.1
── repeats
── ncbi
├── GCA_927399515.1.masked.ncbi.bed.gz
├── GCA_927399515.1.masked.ncbi.bed.gz.gzi
├── GCA_927399515.1.masked.ncbi.bed.gz.tbi
├── GCA_927399515.1.masked.ncbi.fasta.dict
├── GCA_927399515.1.masked.ncbi.fasta.gz
├── GCA_927399515.1.masked.ncbi.fasta.gz.fai
└── GCA_927399515.1.masked.ncbi.fasta.gz.gzi
└── repeats
└── ensembl
── GCA_907164925.1.masked.ensembl.bed.gz
── GCA_907164925.1.masked.ensembl.bed.gz.csi
├── GCA_907164925.1.masked.ensembl.bed.gz.gzi
├── GCA_907164925.1.masked.ensembl.bed.gz.tbi
├── GCA_907164925.1.masked.ensembl.fa.dict
├── GCA_907164925.1.masked.ensembl.fa.gz
├── GCA_907164925.1.masked.ensembl.fa.gz.fai
├── GCA_907164925.1.masked.ensembl.fa.gz.gzi
└── GCA_907164925.1.masked.ensembl.fa.gz.sizes
```

They all correspond to the repeat-masking analysis run by Ensembl themselves. Like for the `assembly/` sub-directory,
the directory structure includes the assembly name, e.g. `gfLaeSulp1.1`, and all files are named after the assembly accession, e.g. `GCA_927399515.1`.
They all correspond to the repeat-masking analysis run by Ensembl themselves.
All files are named after the assembly accession, e.g. `GCA_907164925.1`.

- `GCA_*.masked.ncbi.fasta.gz`: Masked assembly in Fasta format
- `GCA_*.masked.ncbi.bed.gz`: BED file with the coordinates of the regions masked by the Ensembl pipeline
Expand Down

0 comments on commit 97516bd

Please sign in to comment.