Prettier

sanger-tol · Sep 22, 2023 · 801c21a · 801c21a
1 parent f4ddbd9
commit 801c21a
Showing 1 changed file with 58 additions and 44 deletions.
diff --git a/docs/usage.md b/docs/usage.md
@@ -19,6 +19,7 @@ The TreeVal pipeline has a few requirements before being able to run:
 :warning: Please ensure you read the following sections on Directory Strucutre (gene_alignment_data, synteny, scripts), HiC data prep and Pacbio data prep. Without these you may not be able to successfully run the TreeVal pipeline. If nothing is clear then leave an issue report.
 
 ### Directory Structure
+
 <details>
   <summary>
     <font>Details</font>
@@ -35,44 +36,46 @@ The TreeVal pipeline has a few requirements before being able to run:
   <pre><code>
 
 treeval-resources
-  │
-  ├─ gene_alignment_data/
-  │   └─ { classT }
-  │         ├─ csv_data
-  │         │       └─ { Name.Assession }-data.csv # Generated by our scripts
-  │         └─ { Name } # Here and below is generated by our scripts
-  │                 └─ { Name.Assession }
-  │                             ├─ cdna
-  │                             │   └─ { Chunked fasta files }
-  │                             ├─ rna
-  │                             │   └─ { Chunked fasta files }
-  │                             ├─ cds
-  │                             │   └─ { Chunked fasta files }
-  │                             └─ pep
-  │                                 └─ { Chunked fasta files }
-  │
-  ├─ gene_alignment_prep/
-  │   ├─ scripts/             # We supply these in this repo
-  │   ├─ raw_fasta/           # Storing your fasta downloaded from NCBI or Ensembl
-  │   └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
-  │
-  ├─ synteny/
-  │   └─ {classT}
-  │
-  ├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
-  │
-  └─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
+│
+├─ gene_alignment_data/
+│ └─ { classT }
+│ ├─ csv_data
+│ │ └─ { Name.Assession }-data.csv # Generated by our scripts
+│ └─ { Name } # Here and below is generated by our scripts
+│ └─ { Name.Assession }
+│ ├─ cdna
+│ │ └─ { Chunked fasta files }
+│ ├─ rna
+│ │ └─ { Chunked fasta files }
+│ ├─ cds
+│ │ └─ { Chunked fasta files }
+│ └─ pep
+│ └─ { Chunked fasta files }
+│
+├─ gene_alignment_prep/
+│ ├─ scripts/ # We supply these in this repo
+│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
+│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
+│
+├─ synteny/
+│ └─ {classT}
+│
+├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
+│
+└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
+
 </pre></code>
 <p>
   `classT` can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, this value, that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find gene_alignment_data as well as syntenic genomes.
 </p>
 
-  ![ClassT](../docs/images/Sanger-classT.png)
+![ClassT](../docs/images/Sanger-classT.png)
 
 </details>
 </br>
 
 ### Synteny
+
 <details>
   <summary>
     <font>Details</font>
@@ -87,6 +90,7 @@ treeval-resources
 </br>
 
 ### Gene Alignment and Synteny Data and Directories
+
 <details>
   <summary>
     <font>Details</font>
@@ -102,14 +106,15 @@ treeval-resources
   <pre><code>
 mkdir -p gene_alignment_prep/scripts/
 
-cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
+cp treeval/bin/treeval-dataprep/\* gene_alignment_prep/scripts/
 
 mkdir -p gene_alignment_prep/raw_fasta/
 
 mkdir -p gene_alignment_data/bird/csv_data/
 
 mkdir -p synteny/bird/
-  </code></pre>
+</code></pre>
+
   <p>
     The naming of the bird folder here is important, keep this in mind.
   </p>
@@ -141,7 +146,8 @@ cd  synteny/bird/
 curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz
 
 gunzip bTaeGut1_4.fasta.gz
-  </code></pre>
+</code></pre>
+
   <p>
   This leaves us with a file called `bTaeGut1_4.fasta` the genomic assembly of `bTaeGut1_4` (the <a href="https://id.tol.sanger.ac.uk/">Tree of Life ID</a>) for this species) also known as <i>Taeniopygia guttata</i>, the Australian Zebrafinch.
   </p>
@@ -158,7 +164,8 @@ curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bG
 curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz
 
 curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz
-  </code></pre>
+</code></pre>
+
   <p>
   Now that's all downloaded we need to prep it. At this point it is all still gzipped (the `.gz` on the end denotes that the file is compressed) in this format we can't use it. So lets use some bash magic.
   </p>
@@ -199,8 +206,8 @@ Your using at least Version 3.6, You are good to go...
 os imported
 argparse imported
 regex imported
-WORKING ON:             cds--GallusGallus-GRCg7b
-Records per file:       1000
+WORKING ON: cds--GallusGallus-GRCg7b
+Records per file: 1000
 Entryfunction called
 GallusGallus-GRCg7b.cds.fasta
 File found at %s GallusGallus-GRCg7b.cds.fasta
@@ -215,7 +222,8 @@ File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa
 File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa
 File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa
 File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa
-  </code></pre>
+</code></pre>
+
   <p>
   This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a `pep` or `cdna` file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.
   </p>
@@ -265,7 +273,8 @@ Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.G
 Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa
 Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa
 Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa
-  </code></pre>
+</code></pre>
+
   <p>
   This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data_file.
   </p>
@@ -284,7 +293,8 @@ alignment:
   data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
 
 synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny
-  </code></pre>
+</code></pre>
+
   <p>
     I said earlier that the the fact we called a folder `bird` was important, this is because it now becomes our `classT`:
   </p>
@@ -313,6 +323,7 @@ alignment:
 </br>
 
 ### HiC data Preparation
+
 <details>
   <summary>
     <font>Details</font>
@@ -334,6 +345,7 @@ samtools index {prefix}.cram
 </br>
 
 ### PacBio Data Preparation
+
 <details>
   <summary><font>Details</font></summary>
   <p>
@@ -393,6 +405,7 @@ samtools index {prefix}.cram
 </br>
 
 ### Pretext Accessory File Ingestion
+
 <details>
   <summary><font>Details</font></summary>
   <p>
@@ -405,14 +418,15 @@ samtools index {prefix}.cram
   <pre><code>
   cd {outdir}/hic_files
 
-  bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
+bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
 
-  bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
+bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
 
-  cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
+cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
+
+cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
+</code></pre>
 
-  cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i  { your.pretext } -n "gap"
-  </code></pre>
 </details>
 </br>
 
@@ -451,7 +465,7 @@ The following is an example YAML file we have used during production: [nxOscDF50
 - `busco`
   - `lineages_path`: path to folder above lineages folder
   - `lineage`: Example is nematode_odb10
-</br>
+    </br>
 
 <details>
   <summary><font size="+1">Notes on using BUSCO</font></summary>
@@ -526,7 +540,7 @@ The typical command for running the pipeline is as follows:
 
 ```console
 nextflow run sanger-tol/treeval --input assets/treeval.yaml --outdir <OUTDIR> -profile singularity, sanger
-````
+```
 
 With the `treeval.yaml` containing the information from the above YAML_CONTENTS section