Merge pull request #291 from sanger-tol/galaxy_dev

1.1.0 - Ancient Aurora
sanger-tol · Apr 8, 2024 · c3ecafe · c3ecafe
2 parents 1740f93 + bc6c484
commit c3ecafe
Show file tree

Hide file tree

Showing 236 changed files with 5,374 additions and 1,338 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -29,8 +29,11 @@ jobs:
           - "22.10.1"
           - "latest-everything"
     steps:
-      - name: Check out pipeline code
-        uses: actions/checkout@v3
+      - name: Get branch names
+        # Pulls the names of current branches in repo
+        # steps.branch-names.outputs.current_branch is used later and returns the name of the branch the PR is made FROM not to
+        id: branch-names
+        uses: tj-actions/branch-names@v8
 
       - name: Install Nextflow
         uses: nf-core/setup-nextflow@v1
@@ -45,22 +48,28 @@ jobs:
           mkdir -p $NXF_SINGULARITY_CACHEDIR
           mkdir -p $NXF_SINGULARITY_LIBRARYDIR
 
-      - name: Download test data
+      - name: Install Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+
+      - name: Install nf-core
+        run: |
+          pip install nf-core
+
+      - name: NF-Core Download - download singularity containers
+        # Forcibly download repo on active branch and download SINGULARITY containers into the CACHE dir if not found
+        # Must occur after singularity install or will crash trying to dl containers
+        # Zip up this fresh download and run the checked out version
+        run: |
+          nf-core download sanger-tol/treeval --revision ${{ steps.branch-names.outputs.current_branch }} --compress none -d --force --outdir sanger-treeval --container-cache-utilisation amend --container-system singularity
+
+      - name: Download Tiny test data
         # Download A fungal test data set that is full enough to show some real output.
         run: |
           curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
 
-      #- name: Docker - Run RAPID pipeline with test data
-      # Remember that you can parallelise this by using strategy.matrix
-      #  run: |
-      #    nextflow run ${GITHUB_WORKSPACE} -entry RAPID -profile test_github,docker --outdir ./results-rapid
-
-      - name: Singularity - Run RAPID pipeline with test data
+      - name: Singularity - Run FULL pipeline with test data
         # Remember that you can parallelise this by using strategy.matrix
         run: |
-          nextflow run ${GITHUB_WORKSPACE} -entry RAPID -profile test_github,singularity --outdir ./results-rapid
-
-      #- name: Run FULL pipeline with test data
-      #  # Remember that you can parallelise this by using strategy.matrix
-      #  run: |
-      ##    nextflow run ${GITHUB_WORKSPACE} -profile test_github,docker --outdir ./results-full
+          nextflow run ./sanger-treeval/${{ steps.branch-names.outputs.current_branch }}/main.nf -profile test_github,singularity --outdir ./Sing-Full
diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -32,7 +32,7 @@ jobs:
       - uses: actions/setup-node@v3
 
       - name: Install Prettier
-        run: npm install -g prettier
+        run: npm install -g prettier@3.0.3
 
       - name: Run Prettier --check
         run: prettier --check ${GITHUB_WORKSPACE}

diff --git a/.nf-core.yml b/.nf-core.yml
@@ -5,6 +5,9 @@ lint:
     - conf/test_full.config
     - docs/images/nf-core-treeval_logo_light.png
     - docs/images/nf-core-treeval_logo_dark.png
+    - conf/igenomes.config
+    - .github/workflows/awstest.yml
+    - .github/workflows/awsfulltest.yml
   files_unchanged:
     - .github/workflows/linting.yml
     - .github/CONTRIBUTING.md

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,96 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [1.1.0] - Ancient Aurora - [2024-04-26]
+
+The second release for sanger-tol, created with the [nf-core](https://nf-co.re/) template.
+
+This builds on the initial release by adding subworkflows which generate kmer based coverage tracks and a kmer spectra graph. There are also a number of updates to logic used throughout the pipeline, as well as to the resources required by a significant number of modules.
+
+### Enhancements & Fixes
+
+- Updates to the resource allocation methods used by a number of modules in the base.config.
+- Added a flag to stop the usage of Juicer.
+- Subworkflow to generate a kmer based coverage track.
+- Subworkflow to generate/update a kmer spectra graph.
+- Subworkflow to use minimap2 for HiC mapping, if selected.
+- Subworkflow to use BWAmem2 for HiC mapping, if selected.
+- Subworkflow to ingest Pretext accessory files into the Pretext file, simplifying post-TreeVal data manipulation.
+- Updated the logic in use throughout the pipeline.
+- Updated the modules.config to include some of the logic, cleaning the code.
+- Updated the HiC subworkflow to include subsampling the HiC data for Juicer due to resource requirements with large amounts of data.
+- Updated the YAML_INPUT subworkflow, this now contains "flags" to change some software options.
+- Updated the data names in the input YAML to reduce confusion.
+- Updated software (Pretext{View,Snapshot,Graph}) to allow for use on large genomes with big data.
+  - Added associated patch files and cpu architecture files.
+- Updated the minimap2 align module to remove samtools view in preference of paftools for our usecase.
+- Updated the test.yml inline with the above changes.
+- Updated the SELFCOMP subworkflow to allow for the parallelisation of the work on large genomes.
+- Updated the READ_COVERAGE subworkflow to produce the scaffold based AVG coverage and STND coverage
+- Updated Modules from NF-Core - mostly relates to module structure rather than software.
+- Updated the SummaryStats output to include HiC container counts.
+- Added -T / -t flags where possible to minimise the use of the /tmp directory.
+- Replaced CONCAT_MUMMER with CATCAT for simplicity.
+- Removed JUICER from the RAPID entrypoint.
+- Removed the csi or tbi logic. CSI is now used by default, this simplified the workflow and enlarges the capacity to handle much larger genomes. The logic block previously required was then moved.
+- Added NF-DOWNLOAD to the CI-CD due to an error that causes incomplete downloaded when downloading a number of images at the same time.
+- Added the RAPID_TOL entry point which is more geared towards the requirements of Sanger.
+- Fix a bug in build_alignment_blocks.py to avoid indexing errors happening in large genomes.
+- Change output BEDGRAPH from EXTRACT_TELO module.
+
+### Parameters
+
+| Old Parameter | New Parameter |
+| ------------- | ------------- |
+| -             | --juicer      |
+
+### Software dependencies
+
+Note, since the pipeline is using Nextflow DSL2, each process will be run with its own Biocontainer. This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
+
+| Module                                 | Old Version  | New Versions     |
+| -------------------------------------- | ------------ | ---------------- |
+| bamtobed_sort ( bedtools + samtools )  | -            | 2.31.0 + 1.17    |
+| bedtools                               | 2.31.0       | 2.31.1           |
+| busco                                  | 5.4.3        | 5.5.0            |
+| bwa-mem2                               | -            | 2.2.1            |
+| cat                                    | -            | 2.3.4            |
+| chunk_fasta ( pyfasta )                | -            | 0.5.2-1          |
+| cooler                                 | -            | 0.9.2            |
+| cram_filter_align_bwamem2_fixmate_sort | -            |                  |
+| ^ ( samtools + bwamem2 ) ^             | -            | 1.17 + 2.2.1     |
+| coreutils                              | -            | 9.1              |
+| fastk                                  | -            | 1.0.1            |
+| gcc                                    | 7.1.0        | 10.4.0           |
+| find_telomere_windows ( java-jdk )     | -            | 8.0.112          |
+| generate_cram_csv ( samtools )         | -            | 1.17             |
+| gnu-sort                               | -            | 8.25             |
+| juicer_tools_pre ( java-jdk )          | -            | 8.0.112          |
+| perl                                   | -            | 5.26.2           |
+| merquryfk                              | -            | 1.0.1            |
+| minimap2 + samtools                    | -            | 2.24 + 1.14      |
+| miniprot                               | -            | 0.11--he4a0461_2 |
+| mummer                                 | -            | 3.23             |
+| paftools ( minimap2 + samtools )       | -            | 2.24 + 1.14      |
+| pretextmap + samtools                  | 0.1.9 + 1.17 | 0.0.2 + 1.17     |
+| python                                 | 3.9          | -                |
+| - pandas                               | 1.5.2        | -                |
+| samtools                               | 1.17         | 1.18             |
+| selfcomp_splitfasta ( perl-bioperl )   | -            | 1.7.8-1          |
+| seqtk                                  | -            | 1.4              |
+| tabix                                  | -            | 1.11             |
+| ucsc                                   | -            | 377              |
+| windowmasker (blast)                   | -            | 2.14.0           |
+
+### Fixed
+
+- Resource allocations being calculated incorrectly.
+- Pretext bugs related to large data.
+
+### Dependencies
+
+### Deprecated
+
 ## [1.0.0] - Ancient Atlantis - [2023-06-27]
 
 Initial release of sanger-tol/treeval, created with the [nf-core](https://nf-co.re/) template.

diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 
 ## Introduction
 
-**sanger-tol/treeval** is a bioinformatics best-practice analysis pipeline for the generation of data supplemental to the curation of reference quality genomes. This pipeline has been written to generate flat files compatible with [JBrowse2](https://jbrowse.org/jb2/).
+**sanger-tol/treeval [1.1.0 - Ancient Aurora]** is a bioinformatics best-practice analysis pipeline for the generation of data supplemental to the curation of reference quality genomes. This pipeline has been written to generate flat files compatible with [JBrowse2](https://jbrowse.org/jb2/) as well as HiC maps for use in Juicebox, PretextView and HiGlass.
 
 The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
 
@@ -29,6 +29,8 @@ The treeval pipeline has a sister pipeline currently named [curationpretext](htt
 11. Generate a telomere track based on input motif ( TELO_FINDER )
 12. Run Busco and convert results into bed format ( BUSCO_ANNOTATION )
 13. Ancestral Busco linkage if available for clade ( BUSCO_ANNOTATION:ANCESTRAL_GENE )
+14. Count KMERs with FastK and plot the spectra using MerquryFK ( KMER )
+15. Generate a coverge track using KMER data ( KMER_READ_COVERAGE )
 
 ## Usage
 

diff --git a/assets/github_testing/TreeValTinyTest.yaml → ...s/github_testing/TreeValTinyFullTest.yaml b/assets/github_testing/TreeValTinyTest.yaml → ...s/github_testing/TreeValTinyFullTest.yaml
@@ -6,26 +6,33 @@ assembly:
   assem_version: 1
   project_id: DTOL
 reference_file: /home/runner/work/treeval/treeval/TreeValTinyData/assembly/draft/grTriPseu1.fa
+map_order: length
 assem_reads:
-  longread_type: hifi
-  longread_data: /home/runner/work/treeval/treeval/TreeValTinyData/genomic_data/pacbio/
-  hic_data: /home/runner/work/treeval/treeval/TreeValTinyData/genomic_data/hic-arima/
+  read_type: hifi
+  read_data: /home/runner/work/treeval/treeval/TreeValTinyData/genomic_data/pacbio
   supplementary_data: path
+hic_data:
+  hic_cram: /home/runner/work/treeval/treeval/TreeValTinyData/genomic_data/hic-arima/
+  hic_aligner: bwamem2
+kmer_profile:
+  # kmer_length will act as input for kmer_read_cov fastk and as the name of folder in profile_dir
+  kmer_length: 31
+  dir: /home/runner/work/treeval/treeval/TreeValTinyData/
 alignment:
   data_dir: /home/runner/work/treeval/treeval/TreeValTinyData/gene_alignment_data/
   common_name: "" # For future implementation (adding bee, wasp, ant etc)
-  geneset: "LaetiporusSulphureus.gfLaeSulp1"
+  geneset_id: "LaetiporusSulphureus.gfLaeSulp1"
   #Path should end up looking like "{data_dir}{classT}/{common_name}/csv_data/{geneset}-data.csv"
 self_comp:
   motif_len: 0
   mummer_chunk: 10
-synteny:
-  synteny_genome_path: /home/runner/work/treeval/treeval/TreeValTinyData/synteny/
-outdir: "NEEDS TESTING"
 intron:
   size: "50k"
 telomere:
   teloseq: TTAGGG
+synteny:
+  synteny_path: /home/runner/work/treeval/treeval/treeval/TreeValTinyData/synteny
+  synteny_genomes: "LaetiporusSulphureus"
 busco:
   lineages_path: /home/runner/work/treeval/treeval/TreeValTinyData/busco/subset/
   lineage: fungi_odb10
diff --git a/assets/local_testing/nxOscDF5033-BGA.yaml b/assets/local_testing/nxOscDF5033-BGA.yaml
@@ -21,7 +21,8 @@ intron:
 telomere:
   teloseq: TTAGGG
 synteny:
-  synteny_genome_path: /workspace/treeval-curation/synteny/ # Will not exist
+  synteny_path: /nfs/treeoflife-01/teams/tola/users/dp24/treeval/TreeValTinyData/synteny/
+  synteny_genomes: "LaetiporusSulphureus"
 busco:
   lineages_path: /workspace/treeval-curation/busco/v5
   lineage: nematoda_odb10
diff --git a/assets/local_testing/nxOscDF5033.yaml b/assets/local_testing/nxOscDF5033.yaml
@@ -6,11 +6,18 @@ assembly:
   defined_class: nematode
   project_id: DTOL
 reference_file: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/assembly/draft/DF5033.hifiasm.noTelos.20211120/DF5033.noTelos.hifiasm.purged.noCont.noMito.fasta
+map_order: length
 assem_reads:
-  longread_type: hifi
-  longread_data: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/fasta/
-  hic_data: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/hic-arima2/full/
+  read_type: hifi
+  read_data: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/fasta/
   supplementary_data: path
+hic_data:
+  hic_cram: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/hic-arima2/full/
+  hic_aligner: minimap2
+kmer_profile:
+  # kmer_length will act as input for kmer_read_cov fastk and as the name of folder in profile_dir
+  kmer_length: 31
+  dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/
 alignment:
   data_dir: /lustre/scratch123/tol/resources/treeval/gene_alignment_data/
   common_name: "" # For future implementation (adding bee, wasp, ant etc)
@@ -24,7 +31,8 @@ intron:
 telomere:
   teloseq: TTAGGG
 synteny:
-  synteny_genome_path: /lustre/scratch123/tol/resources/treeval/synteny/
+  synteny_path: /nfs/treeoflife-01/teams/tola/users/dp24/treeval/TreeValTinyData/synteny/
+  synteny_genomes: ""
 busco:
   lineages_path: /lustre/scratch123/tol/resources/busco/v5
   lineage: nematoda_odb10
diff --git a/assets/local_testing/nxOscSUBSET.yaml b/assets/local_testing/nxOscSUBSET.yaml
@@ -1,30 +1,38 @@
 assembly:
   assem_level: scaffold
+  assem_version: 1
   sample_id: OscheiusSUBSET
   latin_name: to_provide_taxonomic_rank
   defined_class: nematode
-  assem_version: 1
   project_id: DTOL
 reference_file: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_SUBSET/assembly/draft/SUBSET_genome/Oscheius_SUBSET.fasta
-#/lustre/scratch123/tol/resources/treeval/nextflow_test_data/Oscheius_DF5033/assembly/draft/DF5033.hifiasm.noTelos.20211120/DF5033.noTelos.hifiasm.purged.noCont.noMito.fasta
+map_order: length
 assem_reads:
-  pacbio: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_SUBSET/genomic_data/pacbio/
-  hic: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/hic-arima2/subset/
-  supplementary: path
+  read_type: hifi
+  read_data: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_SUBSET/genomic_data/pacbio/
+  supplementary_data: path
+hic_data:
+  hic_cram: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/hic-arima2/subset/
+  hic_aligner: minimap2
+kmer_profile:
+  # kmer_length will act as input for kmer_read_cov fastk and as the name of folder in profile_dir
+  kmer_length: 31
+  dir: /lustre/scratch123/tol/resources/treeval/treeval-testdata/TreeValSmallData/Oscheius_DF5033/genomic_data/nxOscSpes1/pacbio/
 alignment:
   data_dir: /lustre/scratch123/tol/resources/treeval/gene_alignment_data/
   common_name: "" # For future implementation (adding bee, wasp, ant etc)
-  geneset: "Gae_host.Gae"
+  geneset_id: "Gae_host.Gae"
   #Path should end up looking like "{data_dir}{classT}/{common_name}/csv_data/{geneset}-data.csv"
 self_comp:
   motif_len: 0
-  mummer_chunk: 4
+  mummer_chunk: 10
 intron:
   size: "50k"
 telomere:
   teloseq: TTAGGG
 synteny:
-  synteny_genome_path: /lustre/scratch123/tol/resources/treeval/synteny/
+  synteny_path: /nfs/treeoflife-01/teams/tola/users/dp24/treeval/TreeValTinyData/synteny/
+  synteny_genomes: ""
 busco:
   lineages_path: /lustre/scratch123/tol/resources/busco/v5
   lineage: nematoda_odb10
diff --git a/assets/nematode/csv_data/s3_Gae_Host.Gae-data.csv b/assets/nematode/csv_data/s3_Gae_Host.Gae-data.csv
diff --git a/bin/FKprof b/bin/FKprof
diff --git a/bin/assign_anc.py b/bin/assign_anc.py
@@ -2,8 +2,11 @@
 import pandas as pd
 import optparse
 
-
 # Script originally developed by Yumi Sims ([email protected])
+# -------------------
+# Update for BUSCO 5.5.0 - by we3 (Will Eagles)
+# Reorder start and end so smallest always second column. Also, trim range from scaffold name in first column.
+# -------------------
 
 parser = optparse.OptionParser(version="%prog 1.0")
 parser.add_option(
@@ -65,4 +68,15 @@
 
 df_final = df_final.astype({"Gene End": "int", "Gene Start": "int"})
 
+df_final["Sequence"] = df_final["Sequence"].str.replace(r":.*", "", regex=True)
+
+df_final[["Gene Start", "Gene End"]] = df_final.apply(
+    lambda row: (
+        (row["Gene Start"], row["Gene End"])
+        if row["Gene Start"] < row["Gene End"]
+        else (row["Gene End"], row["Gene Start"])
+    ),
+    axis=1,
+    result_type="expand",
+)
 df_final.to_csv(csvfile, index=False, header=False, sep="\t")
diff --git a/bin/awk_filter_reads.sh b/bin/awk_filter_reads.sh
@@ -0,0 +1 @@
+awk 'BEGIN{OFS="\t"}{if($1 ~ /^\@/) {print($0)} else {$2=and($2,compl(2048)); print(substr($0,2))}}'
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		awk 'BEGIN{OFS="\t"}{if($1 ~ /^\@/) {print($0)} else {$2=and($2,compl(2048)); print(substr($0,2))}}'