Skip to content

Commit

Permalink
Update usage documentation for synteny and gene alignment
Browse files Browse the repository at this point in the history
  • Loading branch information
weaglesBio committed Nov 27, 2024
1 parent 370c8dc commit a9fa618
Showing 1 changed file with 16 additions and 129 deletions.
145 changes: 16 additions & 129 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@

The TreeVal pipeline has a few requirements before being able to run:

- The `gene_alignment_data` (this refers to the alignment.data_dir and alignment.geneset data noted in the yaml, which we will explain later) and `synteny_data` follow a particular directory structure.
- The `gene_alignment_data` requires a specific .csv format.

- HiC CRAM files must be pre-indexed in the same location as the CRAM file, e.g., `samtools index {cram file}`. A check and automated indexing of the cram file will be added in the future.

- Finally, the yaml file (which is described below in Full Samplesheet). This needs to contain all of the information related to the assembly for the pipeline to run.

## Prior to running TreeVal

:warning: Please ensure you read the following sections on Directory Structure (`gene_alignment_data`, `synteny`, scripts), HiC data prep and Pacbio data prep. Without these you may not be able to successfully run the TreeVal pipeline. If nothing is clear then please leave an issue report.
:warning: Please ensure you read the following sections on Directory Structure (`gene_alignment_data` and scripts), HiC data prep and Pacbio data prep. Without these you may not be able to successfully run the TreeVal pipeline. If nothing is clear then please leave an issue report.

We now also support ( and encourage ) using the nf-co2footprint plugin (on Nextflow versions >= 23.07) which generates statistics on how much energy your pipeline uses as well as the amount of Co2 it helps produce. As it is pre-release, you will need to compile this plugin your self and store it in your `$NXF_HOME/plugins` directory, which you can find with `echo $NXF_HOME`. We have included the relevant config file `co2footprint.config` in this repo. The plugin can be used be including `-plugins nf-co2footprint@{VERSION} -c co2footprint.config` in your nextflow command. Please head to the website to find out more [NF-CO2FOOTPRINT](https://nextflow-io.github.io/nf-co2footprint/contributing/setup/).

Expand Down Expand Up @@ -44,110 +44,18 @@ You should now be able to run the pipeline as you see fit.

</details>

### Directory Structure

<details markdown="1">
<summary>Details</summary>

The working example found below in the Gene alignment and synteny sections (below), will cover setting up the `synteny` and `gene_alignment_data` directories as well as downloading some example data.

These two sub-workflows, for now, need the use of the variables `defined_class`, `synteny_genome_path`, `data_dir` and `geneset`. These variables are found inside the yaml ( this is the file that will tell TreeVal what and where everything is ). Currently, we don't use `common_name`, e.g., `bee`, `wasp`, `moth`, etc. However, we hope to make use of it in the future as our `gene_alignment_data` "database" grows and requires a higher degree of organisation.

First, you should set up a directory in our recommended structure:

```
treeval-resources
├─ gene_alignment_data/
│ ├─ { defined_class }
│ ├─ csv_data
│ │ └─ { Organism.Accession }-data.csv # Generated by our scripts
│ ├─ { Organism } # Here and below is generated by our scripts
│ ├─ { Organism.Accession }
│ ├─ cdna
│ │ └─ { Chunked fasta files }
│ ├─ rna
│ │ └─ { Chunked fasta files }
│ ├─ cds
│ │ └─ { Chunked fasta files }
│ └─ peps
│ └─ { Chunked fasta files }
├─ gene_alignment_prep/
│ ├─ scripts/ # We supply these in this repo
│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
├─ synteny/
│ └─ {defined_class}
├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
```

The above naming will be further explained

`defined_class` can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, the value that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find `gene_alignment_data` as well as syntenic genomes.

![defined_class](../docs/images/Sanger-classT.png)

</details>

### Synteny

<details markdown="1">
<summary>Details</summary>

For synteny you should store the full genomic fasta file, of any high quality genome you want to be compared against, in teh above created directory.

For bird we recommend the Golden Eagle ( _Aquila chrysaetos_ ) and the Zebrafinch (_Taeniopygia guttata_), which can be downloaded from NCBI. Rename, these files to something more human readable, and drop them into the `synteny/bird/` folder. Any TreeVal run you now perform where the `defined_class` is bird will run a syntenic alignment against all genomes in that folder. It would be best to keep this to around three unless needed. Again, this is something we could expand on with the `common_name` field if requested in the future.

</details>

### Gene Alignment and Synteny Data and Directories
### Gene Alignment and Synteny Data

<details markdown="1">
<summary>Details</summary>

Seeing as this can be quite a complicated to set up, here's a walk through.

#### Step 1 - Set up the directories

Lets set up the directory structure as if we want to run it on a bird genome.

```
mkdir -p gene_alignment_prep/scripts/
#### Step 1 -- Preparing Synteny data

cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
For synteny you should provide the full genomic fasta file, of any high quality genome you want to be compared against.

mkdir -p gene_alignment_prep/raw_fasta/
mkdir -p gene_alignment_data/bird/csv_data/
mkdir -p synteny/bird/
```

The naming of the bird folder here is important, keep this in mind.

So now we have this structure:

```
~/treeval-resources
├─ synteny/
│ └─ bird/
├─ gene_alignment_data/
│ └─ bird/
│ └─ csv_data/
└─ gene_alignment_prep/
├─ scripts/
└─ raw_fasta/
```

#### Step 2 - Download some data
For bird we recommend the Golden Eagle ( _Aquila chrysaetos_ ) and the Zebrafinch (_Taeniopygia guttata_), which can be downloaded from NCBI.

Now, let's download our syntenic alignment data. I think the Zebrafinch (_Taeniopygia guttata_) would be good against the Chicken (_Gallus gallus_).

Expand All @@ -163,6 +71,8 @@ This leaves us with a file called `bTaeGut1_4.fasta` the genomic assembly of `bT

Now lets move into the `raw_data` folder and download some data, this may take some time.

#### Step 1 -- Preparing Gene alignment data

```
cd ../../gene_alignment_prep/raw_data/
Expand Down Expand Up @@ -276,40 +186,19 @@ This is all useful for the pipeline which generates job ids based on the org col

#### Step 4 -- Understand where we are at

So we have now generated the directory structure for `gene_alignment_data`. So now let's use what we know to fill out the yaml.
Now let's use what we know to fill out the yaml.

The yaml is a file that we need in order to tell the pipeline where everything is, an example can be found [here](https://raw.githubusercontent.com/sanger-tol/treeval/dev/assets/local_testing/nxOscDF5033.yaml).

Here we can see a number of fields that need to be filled out, the easiest being `synteny_genome_path` and `data_dir`. These refer to the directories we made earlier so we can replace them as such:

```yaml
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/

synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny
```
I said earlier that the the fact we called a folder `bird` was important, this is because it now becomes our `defined_class`:

```yaml
defined_class: bird
```

During pipeline execution, this is appended onto the end of `data_dir` and `synteny_genome_path` in order to find the correct files to use. So now all of the files inside `/FULL/PATH/TO/treeval-resources/synteny/bird/ ` will be used for syntenic alignments. Likewise with our `alignment.data_dir`, TreeVal will turn this into `/FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/` and then appends `csv_data/`.

In Step 3, we generated some files which will be living in our `/FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/csv_data/` folder and look like `GallusGallus.GRCg7b-data.csv`. These (minus the `-data.csv`) will be what we enter into the `geneset` field in the yaml. The `common_name` is a field we don't currently use.

```yaml
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
common_name: "" # For future implementation (adding bee, wasp, ant etc)
geneset: "GallusGallus.GRCg7b"
genesets:
- /FULL/PATH/TO/<geneset_name>-data.csv
synteny:
synteny_genomes:
- /FULL/PATH/TO/<genome_name>.fasta
```
However, what is cool about this field (geneset) is that you can add as many as you want. So say you have the `alignment.data_dir` for the Finch saved as `TaeniopygiaGuttata.bTaeGut1_4`. The geneset field becomes: `geneset: "GallusGallus.GRCg7b,TaeniopygiaGuttata.bTaeGut1_4"`

Hopefully this explains things a bit better and you understand how this sticks together!

</details>
### HiC data Preparation
Expand Down Expand Up @@ -445,17 +334,15 @@ The following is an example YAML file we have used during production: [nxOscDF50
- `hic_cram`: path (ending with `/`) to folder containing cram files.
- `hic_aligner`: choice between `bwamam2` and `minimap2`
- `alignment`
- `data_dir`: Gene alignment data path (ending with `/`).
- `common_name`: For future implementation (adding bee, wasp, ant etc)
- `geneset_id`: a csv list of geneset data to be used
- `genesets`: List of Gene alignment data .csv file paths.
- `kmer_profile`:
- `kmer_length`: length of kmer to be used in plotting
- `dir`: directory containing old plot to be regenerated if applicable
- `self_comp`
- `motif_len`: Length of motif to be used in self complementary sequence finding
- `mummer_chunk`: Size of chunks used by MUMMER module.
- `synteny`
- `synteny_genome_path`: Path to syntenic genomes grouped by clade.
- `synteny_genomes`: List of paths to syntenic genomes grouped by clade.
- `outdir`: Will be required in future development.
- `intron:`
- `size`: base pair size of introns default is 50k
Expand Down

0 comments on commit a9fa618

Please sign in to comment.