-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support draft assemblies #97
Conversation
|
Python linting (
|
a049ba9
to
0e6fa8e
Compare
I've added some code to achieve the goal of As far as I understand the requirements, this is the last bit that was missing to complete support for draft assemblies. I'll mark this pull-request as ready. |
413a84d
to
424cdf7
Compare
@eeaunin . I've rebased this branch. It now includes the fixes I've made for blast |
d6ff541
to
ddf7b44
Compare
ddf7b44
to
aa82abc
Compare
Release 0.5
I had a closer look at how
So the exclusion of taxids is supposed to be optional and configurable by the user. https://github.com/blobtoolkit/pipeline/blob/master/v1/example.yaml However, I couldn't find a setting for the same thing in the Snakemake pipeline v2 code. Maybe the authors just forgot to include it? In my runs with the Snakemake pipeline negative taxids were not used but there are suppressed error messages buried in the run logs relating to that. In a run with a Plasmodium yoelii yoelii assembly there is this error in the logs (
So it ran into the error but then just quietly continued running. It is unclear to me what caused this error, as the taxid used there (352914) is at strain level. In another run it has skipped using the taxid filter due to another error:
So the filtering doesn't work if the supplied database is V4 instead of V5 but this also doesn't crash the Snakemake pipeline and just produces an error message in the logs. I guess it would be okay if the |
The filter in SEQTK_SUBSEQ is not sufficient because some BLOBTOOLKIT_CHUNK further excludes masked regions
Skip blastn if there are no chunks
…y-classifications/
aa82abc
to
635c6e0
Compare
…NA taxon_ids NCBI is still the first database we query
635c6e0
to
cd471de
Compare
cd471de
to
e9d3a64
Compare
@eeaunin . I've added a I've rebased the branch onto the latest stable release 0.5.1 |
That's good then! I think it's fine to merge the |
57b041f
to
8c70c77
Compare
On this branch, there is no input Yaml file. The only mandatory parameters are:
--taxon
)--fasta
)--input
) to list the read files--accession
is optional and is used to pull assembly information from ENA into the blobDir's meta.json.I haven't restructured the pipeline much. All the blobtools command at the end still require a yaml file. My solution is to add a script at the beginning of the pipeline that generates the minimal yaml file required (as per #77 (comment)). It still allows clearly getting some parameters in the input-check sub-workflow and making the busco sub-workflow more focused on running buco + blastp.
Busco lineages are inferred from the taxonomy directly here. Like in the genome-note pipeline, I've moved away from using GoaT as GoaT is just a proxy to the NCBI taxonomy. This way, I can keep control of both the version of Busco and the list of lineages in the same place.
I've also introduced the
--busco_lineages
parameter to allow precisely selecting the lineages that are used, rather than the taxonomy-based defaults.Still a draft for now as I want to review
/nfs/team135/yy5/btk_config/taxonomiser_v2.py
and maybe incorporate some elements of it.PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).