From 801c21a08952223ff73e6c6ce5f981c277ca524a Mon Sep 17 00:00:00 2001 From: DLBPointon Date: Fri, 22 Sep 2023 11:45:09 +0100 Subject: [PATCH] Prettier --- docs/usage.md | 102 ++++++++++++++++++++++++++++---------------------- 1 file changed, 58 insertions(+), 44 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index e196a525..1b0ac05d 100755 --- a/docs/usage.md +++ b/docs/usage.md @@ -19,6 +19,7 @@ The TreeVal pipeline has a few requirements before being able to run: :warning: Please ensure you read the following sections on Directory Strucutre (gene_alignment_data, synteny, scripts), HiC data prep and Pacbio data prep. Without these you may not be able to successfully run the TreeVal pipeline. If nothing is clear then leave an issue report. ### Directory Structure +
Details @@ -35,44 +36,46 @@ The TreeVal pipeline has a few requirements before being able to run:

 
 treeval-resources
-  │
-  ├─ gene_alignment_data/
-  │   └─ { classT }
-  │         ├─ csv_data
-  │         │       └─ { Name.Assession }-data.csv # Generated by our scripts
-  │         └─ { Name } # Here and below is generated by our scripts
-  │                 └─ { Name.Assession }
-  │                             ├─ cdna
-  │                             │   └─ { Chunked fasta files }
-  │                             ├─ rna
-  │                             │   └─ { Chunked fasta files }
-  │                             ├─ cds
-  │                             │   └─ { Chunked fasta files }
-  │                             └─ pep
-  │                                 └─ { Chunked fasta files }
-  │
-  ├─ gene_alignment_prep/
-  │   ├─ scripts/             # We supply these in this repo
-  │   ├─ raw_fasta/           # Storing your fasta downloaded from NCBI or Ensembl
-  │   └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
-  │
-  ├─ synteny/
-  │   └─ {classT}
-  │
-  ├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
-  │
-  └─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
+│
+├─ gene_alignment_data/
+│ └─ { classT }
+│ ├─ csv_data
+│ │ └─ { Name.Assession }-data.csv # Generated by our scripts
+│ └─ { Name } # Here and below is generated by our scripts
+│ └─ { Name.Assession }
+│ ├─ cdna
+│ │ └─ { Chunked fasta files }
+│ ├─ rna
+│ │ └─ { Chunked fasta files }
+│ ├─ cds
+│ │ └─ { Chunked fasta files }
+│ └─ pep
+│ └─ { Chunked fasta files }
+│
+├─ gene_alignment_prep/
+│ ├─ scripts/ # We supply these in this repo
+│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
+│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
+│
+├─ synteny/
+│ └─ {classT}
+│
+├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
+│
+└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
+
 

`classT` can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, this value, that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find gene_alignment_data as well as syntenic genomes.

- ![ClassT](../docs/images/Sanger-classT.png) +![ClassT](../docs/images/Sanger-classT.png)

### Synteny +
Details @@ -87,6 +90,7 @@ treeval-resources
### Gene Alignment and Synteny Data and Directories +
Details @@ -102,14 +106,15 @@ treeval-resources

 mkdir -p gene_alignment_prep/scripts/
 
-cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
+cp treeval/bin/treeval-dataprep/\* gene_alignment_prep/scripts/
 
 mkdir -p gene_alignment_prep/raw_fasta/
 
 mkdir -p gene_alignment_data/bird/csv_data/
 
 mkdir -p synteny/bird/
-  
+ +

The naming of the bird folder here is important, keep this in mind.

@@ -141,7 +146,8 @@ cd synteny/bird/ curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz gunzip bTaeGut1_4.fasta.gz - + +

This leaves us with a file called `bTaeGut1_4.fasta` the genomic assembly of `bTaeGut1_4` (the Tree of Life ID) for this species) also known as Taeniopygia guttata, the Australian Zebrafinch.

@@ -158,7 +164,8 @@ curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bG curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz - + +

Now that's all downloaded we need to prep it. At this point it is all still gzipped (the `.gz` on the end denotes that the file is compressed) in this format we can't use it. So lets use some bash magic.

@@ -199,8 +206,8 @@ Your using at least Version 3.6, You are good to go... os imported argparse imported regex imported -WORKING ON: cds--GallusGallus-GRCg7b -Records per file: 1000 +WORKING ON: cds--GallusGallus-GRCg7b +Records per file: 1000 Entryfunction called GallusGallus-GRCg7b.cds.fasta File found at %s GallusGallus-GRCg7b.cds.fasta @@ -215,7 +222,8 @@ File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa - + +

This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a `pep` or `cdna` file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.

@@ -265,7 +273,8 @@ Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.G Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa - + +

This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data_file.

@@ -284,7 +293,8 @@ alignment: data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/ synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny - + +

I said earlier that the the fact we called a folder `bird` was important, this is because it now becomes our `classT`:

@@ -313,6 +323,7 @@ alignment:
### HiC data Preparation +
Details @@ -334,6 +345,7 @@ samtools index {prefix}.cram
### PacBio Data Preparation +
Details

@@ -393,6 +405,7 @@ samtools index {prefix}.cram
### Pretext Accessory File Ingestion +

Details

@@ -405,14 +418,15 @@ samtools index {prefix}.cram


   cd {outdir}/hic_files
 
-  bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
+bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
 
-  bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
+bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
 
-  cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
+cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
+
+cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
+
- cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap" -

@@ -451,7 +465,7 @@ The following is an example YAML file we have used during production: [nxOscDF50 - `busco` - `lineages_path`: path to folder above lineages folder - `lineage`: Example is nematode_odb10 -
+
Notes on using BUSCO @@ -526,7 +540,7 @@ The typical command for running the pipeline is as follows: ```console nextflow run sanger-tol/treeval --input assets/treeval.yaml --outdir -profile singularity, sanger -```` +``` With the `treeval.yaml` containing the information from the above YAML_CONTENTS section