Skip to content

Commit

Permalink
Merge pull request #121 from tkchafin/db_params
Browse files Browse the repository at this point in the history
Db params
  • Loading branch information
tkchafin authored Nov 22, 2024
2 parents 8d2b83f + 7cfbcc0 commit 301bcff
Show file tree
Hide file tree
Showing 42 changed files with 527 additions and 64 deletions.
12 changes: 1 addition & 11 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,9 @@ jobs:
with:
version: "${{ matrix.NXF_VER }}"

- name: Download the NCBI taxdump database
run: |
mkdir ncbi_taxdump
curl -L https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar -C ncbi_taxdump -xzf -
- name: Download the BUSCO lineage database
run: |
mkdir busco_database
curl -L https://tolit.cog.sanger.ac.uk/test-data/resources/busco/blobtoolkit.GCA_922984935.2.2023-08-03.lineages.tar.gz | tar -C busco_database -xzf -
- name: Run pipeline with test data
# You can customise CI pipeline run tests as required
# For example: adding multiple test runs with different parameters
# Remember that you can parallelise this by using strategy.matrix
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --taxdump $PWD/ncbi_taxdump --busco $PWD/busco_database --outdir ./results
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-10-02]
## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-11-20]

The pipeline is now considered to be a complete and suitable replacement for the Snakemake version.

Expand All @@ -13,6 +13,7 @@ The pipeline is now considered to be a complete and suitable replacement for the
to indicate in the samplesheet whether the reads are paired or single.
- Updated the Blastn settings to allow 7 days runtime at most, since that
covers 99.7% of the jobs.
- Allow database inputs to be optionally compressed (`.tar.gz`)

### Software dependencies

Expand Down
Binary file removed assets/test/mMelMel3.1.buscogenes.dmnd
Binary file not shown.
Binary file removed assets/test/mMelMel3.1.buscoregions.dmnd
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.ndb
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nhr
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nin
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nog
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nos
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.not
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nsq
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.ntf
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nto
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/taxonomy4blast.sqlite3
Binary file not shown.
Binary file removed assets/test_full/gfLaeSulp1.1.buscogenes.dmnd
Binary file not shown.
Binary file removed assets/test_full/gfLaeSulp1.1.buscoregions.dmnd
Binary file not shown.
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nhr
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nin
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nog
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nos
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.not
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nsq
Binary file not shown.
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nto
Binary file not shown.
Binary file not shown.
10 changes: 5 additions & 5 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,11 @@ params {
taxon = "Meles meles"

// Databases
taxdump = "/lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump"
busco = "/lustre/scratch123/tol/resources/nextflow/busco/blobtoolkit.GCA_922984935.2.2023-08-03"
blastp = "${projectDir}/assets/test/mMelMel3.1.buscogenes.dmnd"
blastx = "${projectDir}/assets/test/mMelMel3.1.buscoregions.dmnd"
blastn = "${projectDir}/assets/test/nt_mMelMel3.1"
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
busco = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/blobtoolkit.GCA_922984935.2.2023-08-03.tar.gz"
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscogenes.dmnd.tar.gz"
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscoregions.dmnd.tar.gz"
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/nt_mMelMel3.1.tar.gz"

// Need to be set to avoid overfilling /tmp
use_work_dir_as_temp = true
Expand Down
8 changes: 4 additions & 4 deletions conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ params {
taxon = "Laetiporus sulphureus"

// Databases
taxdump = "/lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump"
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
busco = "/lustre/scratch123/tol/resources/busco/latest"
blastp = "${projectDir}/assets/test_full/gfLaeSulp1.1.buscogenes.dmnd"
blastx = "${projectDir}/assets/test_full/gfLaeSulp1.1.buscoregions.dmnd"
blastn = "${projectDir}/assets/test_full/nt_gfLaeSulp1.1"
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Laetiporus_sulphureus/resources/gfLaeSulp1.1.buscogenes.dmnd.tar.gz"
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Laetiporus_sulphureus/resources/gfLaeSulp1.1.buscoregions.dmnd.tar.gz"
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Laetiporus_sulphureus/resources/nt_gfLaeSulp1.1.tar.gz"

// Need to be set to avoid overfilling /tmp
use_work_dir_as_temp = true
Expand Down
10 changes: 5 additions & 5 deletions conf/test_raw.config
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,11 @@ params {
taxon = "Meles meles"

// Databases
taxdump = "/lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump"
busco = "/lustre/scratch123/tol/resources/nextflow/busco/blobtoolkit.GCA_922984935.2.2023-08-03"
blastp = "${projectDir}/assets/test/mMelMel3.1.buscogenes.dmnd"
blastx = "${projectDir}/assets/test/mMelMel3.1.buscoregions.dmnd"
blastn = "${projectDir}/assets/test/nt_mMelMel3.1/"
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
busco = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/blobtoolkit.GCA_922984935.2.2023-08-03.tar.gz"
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscogenes.dmnd.tar.gz"
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscoregions.dmnd.tar.gz"
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/nt_mMelMel3.1.tar.gz"

// Need to be set to avoid overfilling /tmp
use_work_dir_as_temp = true
Expand Down
28 changes: 27 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,15 +78,20 @@ The BlobToolKit pipeline can be run in many different ways. The default way requ

It is a good idea to put a date suffix for each database location so you know at a glance whether you are using the latest version. We are using the `YYYY_MM` format as we do not expect the databases to be updated more frequently than once a month. However, feel free to use `DATE=YYYY_MM_DD` or a different format if you prefer.

Note that all input databases may be optionally passed directly to the pipeline compressed as `.tar.gz`, and the pipeline will handle decompression.
The instructions below show how to build each input database in _two_ forms: decompressed _and_ compressed. You may not need to do both. Select the one that is most appropriate for how you want to use the pipeline.

#### 1. NCBI taxdump database

Create the database directory, retrieve and decompress the NCBI taxonomy:

```bash
DATE=2024_10
TAXDUMP=/path/to/databases/taxdump_${DATE}
TAXDUMP_TAR=/path/to/databases/taxdump_${DATE}.tar.gz
mkdir -p "$TAXDUMP"
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar -xzf - -C "$TAXDUMP"
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz -o $TAXDUMP_TAR
tar -xzf $TAXDUMP_TAR -C "$TAXDUMP"
```

#### 2. NCBI nucleotide BLAST database
Expand All @@ -96,6 +101,7 @@ Create the database directory and move into the directory:
```bash
DATE=2024_10
NT=/path/to/databases/nt_${DATE}
NT_TAR=/path/to/databases/nt_${DATE}.tar.gz
mkdir -p $NT
cd $NT
```
Expand All @@ -113,6 +119,11 @@ done
wget "https://ftp.ncbi.nlm.nih.gov/blast/db/v5/taxdb.tar.gz" &&
tar xf taxdb.tar.gz -C $NT &&
rm taxdb.tar.gz

# Compress and cleanup
cd ..
tar -cvzf $NT_TAR $NT
rm -r $NT
```

#### 3. UniProt reference proteomes database
Expand All @@ -126,6 +137,7 @@ Create the database directory and move into the directory:
```bash
DATE=2024_10
UNIPROT=/path/to/databases/uniprot_${DATE}
UNIPROT_TAR=/path/to/databases/uniprot_${DATE}.tar.gz
mkdir -p $UNIPROT
cd $UNIPROT
```
Expand All @@ -152,6 +164,12 @@ diamond makedb -p 16 --in reference_proteomes.fasta.gz --taxonmap reference_prot
# clean up
mv extract/{README,STATS} .
rm -r extract
rm -r $TAXDUMP

# Compress final database and cleanup
cd ..
tar -cvzf $UNIPROT_TAR $UNIPROT
rm -r $UNIPROT
```

#### 4. BUSCO databases
Expand All @@ -161,6 +179,7 @@ Create the database directory and move into the directory:
```bash
DATE=2024_10
BUSCO=/path/to/databases/busco_${DATE}
BUSCO_TAR=/path/to/databases/busco_${DATE}.tar.gz
mkdir -p $BUSCO
cd $BUSCO
```
Expand All @@ -181,6 +200,13 @@ If you have [GNU parallel](https://www.gnu.org/software/parallel/) installed, yo
find v5/data -name "*.tar.gz" | parallel "cd {//}; tar -xzf {/}"
```

Finally re-compress and cleanup the files:

```bash
tar -cvzf $BUSCO_TAR $BUSCO
rm -r $BUSCO
```

## Changes from Snakemake to Nextflow

### Commands
Expand Down
5 changes: 5 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,11 @@
"installed_by": ["modules"],
"patch": "modules/nf-core/seqtk/subseq/seqtk-subseq.diff"
},
"untar": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"windowmasker/mkcounts": {
"branch": "master",
"git_sha": "32cac29d4a92220965dace68a1fb0bb2e3547cac",
Expand Down
18 changes: 8 additions & 10 deletions modules/local/generate_config.nf
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,11 @@ process GENERATE_CONFIG {
val taxon_query
val busco_lin
path lineage_tax_ids
tuple val(meta2), path(blastn)
val reads
// The following are passed as "val" because we just want to know the full paths. No staging necessary
val blastp_path
val blastx_path
val blastn_path
val taxdump_path
tuple val(meta2), path(blastp)
tuple val(meta3), path(blastx)
tuple val(meta4), path(blastn)
tuple val(meta5), path(taxdump)

output:
tuple val(meta), path("*.yaml") , emit: yaml
Expand All @@ -43,10 +41,10 @@ process GENERATE_CONFIG {
$accession_params \\
--nt $blastn \\
$input_reads \\
--blastp ${blastp_path} \\
--blastx ${blastx_path} \\
--blastn ${blastn_path} \\
--taxdump ${taxdump_path} \\
--blastp ${blastp} \\
--blastx ${blastx} \\
--blastn ${blastn} \\
--taxdump ${taxdump} \\
--output_prefix ${prefix}
cat <<-END_VERSIONS > versions.yml
Expand Down
7 changes: 7 additions & 0 deletions modules/nf-core/untar/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

84 changes: 84 additions & 0 deletions modules/nf-core/untar/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

49 changes: 49 additions & 0 deletions modules/nf-core/untar/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 301bcff

Please sign in to comment.