treeval-resources
- │
- ├─ gene_alignment_data/
- │ └─ { classT }
- │ ├─ csv_data
- │ │ └─ { Name.Assession }-data.csv # Generated by our scripts
- │ └─ { Name } # Here and below is generated by our scripts
- │ └─ { Name.Assession }
- │ ├─ cdna
- │ │ └─ { Chunked fasta files }
- │ ├─ rna
- │ │ └─ { Chunked fasta files }
- │ ├─ cds
- │ │ └─ { Chunked fasta files }
- │ └─ pep
- │ └─ { Chunked fasta files }
- │
- ├─ gene_alignment_prep/
- │ ├─ scripts/ # We supply these in this repo
- │ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
- │ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
- │
- ├─ synteny/
- │ └─ {classT}
- │
- ├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
- │
- └─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
+│
+├─ gene_alignment_data/
+│ └─ { classT }
+│ ├─ csv_data
+│ │ └─ { Name.Assession }-data.csv # Generated by our scripts
+│ └─ { Name } # Here and below is generated by our scripts
+│ └─ { Name.Assession }
+│ ├─ cdna
+│ │ └─ { Chunked fasta files }
+│ ├─ rna
+│ │ └─ { Chunked fasta files }
+│ ├─ cds
+│ │ └─ { Chunked fasta files }
+│ └─ pep
+│ └─ { Chunked fasta files }
+│
+├─ gene_alignment_prep/
+│ ├─ scripts/ # We supply these in this repo
+│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
+│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
+│
+├─ synteny/
+│ └─ {classT}
+│
+├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
+│
+└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
+
`classT` can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, this value, that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find gene_alignment_data as well as syntenic genomes.
- ![ClassT](../docs/images/Sanger-classT.png) +![ClassT](../docs/images/Sanger-classT.png)
mkdir -p gene_alignment_prep/scripts/
-cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
+cp treeval/bin/treeval-dataprep/\* gene_alignment_prep/scripts/
mkdir -p gene_alignment_prep/raw_fasta/
mkdir -p gene_alignment_data/bird/csv_data/
mkdir -p synteny/bird/
-
+
+
The naming of the bird folder here is important, keep this in mind.
@@ -141,7 +146,8 @@ cd synteny/bird/ curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz gunzip bTaeGut1_4.fasta.gz - + +This leaves us with a file called `bTaeGut1_4.fasta` the genomic assembly of `bTaeGut1_4` (the Tree of Life ID) for this species) also known as Taeniopygia guttata, the Australian Zebrafinch.
@@ -158,7 +164,8 @@ curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bG curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz - + +Now that's all downloaded we need to prep it. At this point it is all still gzipped (the `.gz` on the end denotes that the file is compressed) in this format we can't use it. So lets use some bash magic.
@@ -199,8 +206,8 @@ Your using at least Version 3.6, You are good to go... os imported argparse imported regex imported -WORKING ON: cds--GallusGallus-GRCg7b -Records per file: 1000 +WORKING ON: cds--GallusGallus-GRCg7b +Records per file: 1000 Entryfunction called GallusGallus-GRCg7b.cds.fasta File found at %s GallusGallus-GRCg7b.cds.fasta @@ -215,7 +222,8 @@ File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa - + +This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a `pep` or `cdna` file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.
@@ -265,7 +273,8 @@ Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.G Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa - + +This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data_file.
@@ -284,7 +293,8 @@ alignment: data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/ synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny - + +I said earlier that the the fact we called a folder `bird` was important, this is because it now becomes our `classT`:
@@ -313,6 +323,7 @@ alignment: ### HiC data Preparation +
@@ -393,6 +405,7 @@ samtools index {prefix}.cram
### Pretext Accessory File Ingestion
+
@@ -405,14 +418,15 @@ samtools index {prefix}.cram
Details
- cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
-
cd {outdir}/hic_files
- bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
+bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
- bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
+bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
- cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
+cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
+
+cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
+
Notes on using BUSCO
@@ -526,7 +540,7 @@ The typical command for running the pipeline is as follows:
```console
nextflow run sanger-tol/treeval --input assets/treeval.yaml --outdir