Skip to content

Commit

Permalink
Prettier
Browse files Browse the repository at this point in the history
  • Loading branch information
DLBPointon committed Sep 22, 2023
1 parent f4ddbd9 commit 801c21a
Showing 1 changed file with 58 additions and 44 deletions.
102 changes: 58 additions & 44 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ The TreeVal pipeline has a few requirements before being able to run:
:warning: Please ensure you read the following sections on Directory Strucutre (gene_alignment_data, synteny, scripts), HiC data prep and Pacbio data prep. Without these you may not be able to successfully run the TreeVal pipeline. If nothing is clear then leave an issue report.

### Directory Structure

<details>
<summary>
<font>Details</font>
Expand All @@ -35,44 +36,46 @@ The TreeVal pipeline has a few requirements before being able to run:
<pre><code>

treeval-resources
├─ gene_alignment_data/
│ └─ { classT }
│ ├─ csv_data
│ │ └─ { Name.Assession }-data.csv # Generated by our scripts
│ └─ { Name } # Here and below is generated by our scripts
│ └─ { Name.Assession }
│ ├─ cdna
│ │ └─ { Chunked fasta files }
│ ├─ rna
│ │ └─ { Chunked fasta files }
│ ├─ cds
│ │ └─ { Chunked fasta files }
│ └─ pep
│ └─ { Chunked fasta files }
├─ gene_alignment_prep/
│ ├─ scripts/ # We supply these in this repo
│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
├─ synteny/
│ └─ {classT}
├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
├─ gene_alignment_data/
│ └─ { classT }
│ ├─ csv_data
│ │ └─ { Name.Assession }-data.csv # Generated by our scripts
│ └─ { Name } # Here and below is generated by our scripts
│ └─ { Name.Assession }
│ ├─ cdna
│ │ └─ { Chunked fasta files }
│ ├─ rna
│ │ └─ { Chunked fasta files }
│ ├─ cds
│ │ └─ { Chunked fasta files }
│ └─ pep
│ └─ { Chunked fasta files }
├─ gene_alignment_prep/
│ ├─ scripts/ # We supply these in this repo
│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
├─ synteny/
│ └─ {classT}
├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo

</pre></code>
<p>
`classT` can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, this value, that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find gene_alignment_data as well as syntenic genomes.
</p>

![ClassT](../docs/images/Sanger-classT.png)
![ClassT](../docs/images/Sanger-classT.png)

</details>
</br>

### Synteny

<details>
<summary>
<font>Details</font>
Expand All @@ -87,6 +90,7 @@ treeval-resources
</br>

### Gene Alignment and Synteny Data and Directories

<details>
<summary>
<font>Details</font>
Expand All @@ -102,14 +106,15 @@ treeval-resources
<pre><code>
mkdir -p gene_alignment_prep/scripts/

cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
cp treeval/bin/treeval-dataprep/\* gene_alignment_prep/scripts/

mkdir -p gene_alignment_prep/raw_fasta/

mkdir -p gene_alignment_data/bird/csv_data/

mkdir -p synteny/bird/
</code></pre>
</code></pre>

<p>
The naming of the bird folder here is important, keep this in mind.
</p>
Expand Down Expand Up @@ -141,7 +146,8 @@ cd synteny/bird/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz

gunzip bTaeGut1_4.fasta.gz
</code></pre>
</code></pre>

<p>
This leaves us with a file called `bTaeGut1_4.fasta` the genomic assembly of `bTaeGut1_4` (the <a href="https://id.tol.sanger.ac.uk/">Tree of Life ID</a>) for this species) also known as <i>Taeniopygia guttata</i>, the Australian Zebrafinch.
</p>
Expand All @@ -158,7 +164,8 @@ curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bG
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz

curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz
</code></pre>
</code></pre>

<p>
Now that's all downloaded we need to prep it. At this point it is all still gzipped (the `.gz` on the end denotes that the file is compressed) in this format we can't use it. So lets use some bash magic.
</p>
Expand Down Expand Up @@ -199,8 +206,8 @@ Your using at least Version 3.6, You are good to go...
os imported
argparse imported
regex imported
WORKING ON: cds--GallusGallus-GRCg7b
Records per file: 1000
WORKING ON: cds--GallusGallus-GRCg7b
Records per file: 1000
Entryfunction called
GallusGallus-GRCg7b.cds.fasta
File found at %s GallusGallus-GRCg7b.cds.fasta
Expand All @@ -215,7 +222,8 @@ File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa
</code></pre>
</code></pre>

<p>
This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a `pep` or `cdna` file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.
</p>
Expand Down Expand Up @@ -265,7 +273,8 @@ Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.G
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa
</code></pre>
</code></pre>

<p>
This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data_file.
</p>
Expand All @@ -284,7 +293,8 @@ alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/

synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny
</code></pre>
</code></pre>

<p>
I said earlier that the the fact we called a folder `bird` was important, this is because it now becomes our `classT`:
</p>
Expand Down Expand Up @@ -313,6 +323,7 @@ alignment:
</br>

### HiC data Preparation

<details>
<summary>
<font>Details</font>
Expand All @@ -334,6 +345,7 @@ samtools index {prefix}.cram
</br>

### PacBio Data Preparation

<details>
<summary><font>Details</font></summary>
<p>
Expand Down Expand Up @@ -393,6 +405,7 @@ samtools index {prefix}.cram
</br>

### Pretext Accessory File Ingestion

<details>
<summary><font>Details</font></summary>
<p>
Expand All @@ -405,14 +418,15 @@ samtools index {prefix}.cram
<pre><code>
cd {outdir}/hic_files

bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"

bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"

cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"

cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
</code></pre>

cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
</code></pre>
</details>
</br>

Expand Down Expand Up @@ -451,7 +465,7 @@ The following is an example YAML file we have used during production: [nxOscDF50
- `busco`
- `lineages_path`: path to folder above lineages folder
- `lineage`: Example is nematode_odb10
</br>
</br>

<details>
<summary><font size="+1">Notes on using BUSCO</font></summary>
Expand Down Expand Up @@ -526,7 +540,7 @@ The typical command for running the pipeline is as follows:

```console
nextflow run sanger-tol/treeval --input assets/treeval.yaml --outdir <OUTDIR> -profile singularity, sanger
````
```

With the `treeval.yaml` containing the information from the above YAML_CONTENTS section

Expand Down

0 comments on commit 801c21a

Please sign in to comment.