Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel manifest.txt file download for rsync_from_ncbi.pl #890

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jonathancosme
Copy link

changed rsync code block slightly, to allow parallel file downloads (determined by environment variable KRAKEN2_THREAD_CT). First, manifest.txt will be split up into KRAKEN2_THREAD_CT number of temporary files, then rsync will be called on every temporary file, and finally, all temporary files are removed. The linux CLI packaged 'parallel' must be installed.

changed rsync code block slightly, to allow parallel file downloads (determined by environment variable KRAKEN2_THREAD_CT). First, manifest.txt will be split up into KRAKEN2_THREAD_CT number of temporary files, then rsync will be called on every temporary file, and finally, all temporary files are removed. The linux CLI packaged 'parallel' must be installed.
Copy link
Collaborator

ch4rr0 commented Nov 21, 2024

We wrote k2 wrapper which is part the kraken 2 repo and provides parallel downloads. k2 download-library —library archaea —threads 6 —db foo. Can you try it out and see how it compares to this PR. Suggestions welcome.

@jonathancosme
Copy link
Author

@ch4rr0
running

k2 download-library --library archaea --threads 48 --fast-build --no-mask --db test

gives me

usage: kraken2 [-h] {add-to-library,download-library,download-taxonomy,build,classify,inspect,clean} ...
kraken2: error: unrecognized arguments: --threads 48 --fast-build

Just FYI, I installed this with conda using the following command...

mamba create --name kraken2 -c nvidia -c bioconda -c conda-forge python=3.11 cudatoolkit kraken2 parallel awscli -y

activating it...

conda activate kraken2

creating a directory...

mkdir test

adding the taxonomy...

k2 download-taxonomy --db test

and finally, trying to add a library:

k2 download-library --library archaea --threads 48 --fast-build --no-mask --db test

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Nov 21, 2024 via email

@jonathancosme
Copy link
Author

the --fast-build argument is irrelevant.
Even if I use your original command

k2 download-library --library archaea --threads 6 --db test

I still get this error

kraken2: error: unrecognized arguments: --threads 6

The problem is that k2 download-library doesn't accept the --threads parameter, which means it will default to using 1 thread, i.e. downloading one file at a time.
For the following command (your suggestion)

k2 download-library --library archaea --db test

it took 4 minutes to download 171 files out of 620 (then it got stuck).
This command (with my suggested changes)

kraken2-build --download-library archaea --threads 48 --db test

took 27 seconds to download all the files, and process them (I stopped it when it starting the masking task).
so if I was downloading only 117 files instead of 620, it would have been 7.5 seconds.
That's over 95% less time to download files.
Archaea is small, but it made a HUGE difference when I had to download the bacteria library (which was around 200gb of data).

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Nov 22, 2024

Ah, I know why. It's very likely that you're using the k2 packaged with the latest release of Kraken2. You will need to fetch the latest changes from the Kraken2 repository. In any event with 12 threads it took 29 seconds to download and process the archaea library with masking disabled, 97 seconds with masking turned on.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Nov 22, 2024

Here is proof:
time ./k2 download-library --library archaea --threads 12 --db /tmp/archaea --no-masking --log foo.out
33.31 real 53.74 user 4.10 sys

archaea.txt

Edit: removed incorrect log

@jonathancosme
Copy link
Author

Hmm i see....
well this is the kraken2 version I've got:
image
and this is what I'm seeing on the kraken github home page
image
Maybe there's something is not being updated on the bioconda side?

In any case, seems like k2 has excellent download performance.
I didn't even know about the k2 wrapper; didn't see any documentation on it.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Nov 22, 2024

I have not done a good job at marketing the script to the user base. It will be included in the next release of kraken2. I think the conda recipe also has to be updated to reference k2. In the meantime please test and let us know how we can improve on the script.

@aruaud
Copy link

aruaud commented Nov 25, 2024

Had the same issue with k2: --threads parameter unrecognized. It worked for me to update the k2 script with the one on the git repo + create a symlink to the k2mask:

  1. Download the last k2 and replace in your conda env:
    wget https://raw.githubusercontent.com/DerrickWood/kraken2/master/scripts/k2 -O k2
    mv k2 ~/miniconda3/envs/YOUR_ENV/bin/k2
    chmod +x ~/miniconda3/envs/YOUR_ENV/bin/k2
    rm k2

  2. I had to create a symlink to k2mask, otherwise files could be downloaded but the masking would fail
    find ~/miniconda3/envs/YOUR_ENV/ -name k2mask (For me it was: ~/miniconda3/envs/YOUR_ENV/share/kraken2-2.1.3-2/libexec/k2mask)
    ln -s /path/to/k2mask ~/miniconda3/envs/YOUR_ENV/bin/k2mask
    chmod +x ~/miniconda3/envs/YOUR_ENV/bin/k2mask

All good afterwards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants