Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress or override rule "prepare_nextclade"? "CURLE_PEER_FAILED_VERIFICATION: SSL peer certificate or SSH remote key was not OK." #875

Closed
jacaravas opened this issue Mar 1, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@jacaravas
Copy link

I'm having difficulty with the latest ncov workflow due to the addition of the automatic update of the nextclade dataset in the "prepare nextclade" rule. The "nextclade dataset get" command does not use the correct certificate store in our conda environment and fails. This issue is documented on the nextclade issues page and no fix is forthcoming.

To bypass this issue in other workflows, I have a script that identifies the latest data release and downloads the files without calling the nextclade datasets get command. It is a showstopper here, though, because that command kills the workflow.

Short of maintaining a separate copy of "main_workflow.smk" that has a customized version of this rule, is there any way to bypass or replace this step?

For searchability, this is the specific error message

[ERROR] Nextclade: When fetching a file "https://data.clades.nextstrain.org/index.json": CURLE_PEER_FAILED_VERIFICATION: SSL peer certificate or SSH remote key was not OK. Please verify the correctness of the address, check your internet connection and try again later. Make sure that you have a permission to access the requested resource.
[Tue Mar 1 14:00:29 2022]
Error in rule prepare_nextclade:
jobid: 31
output: data/sars-cov-2-nextclade-defaults
shell:

    nextclade dataset get --name sars-cov-2 --output-dir data/sars-cov-2-nextclade-defaults

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
@jacaravas jacaravas added the enhancement New feature or request label Mar 1, 2022
@huddlej huddlej added bug Something isn't working and removed enhancement New feature or request labels Mar 2, 2022
@huddlej huddlej self-assigned this Mar 2, 2022
huddlej added a commit that referenced this issue Mar 2, 2022
Replaces a reference to the _rule_ that produces Nextclade data needed
for alignment with the _literal path_ to the corresponding output. This
approach allows users to define custom rules that generate the same
output and override the workflow's default rules. This kind of
functionality is important, as it allows users to easily modify the
workflow DAG from their custom build configuration and avoid maintaining
a fork of the main workflow that can get outdated easily.

Related to #875
@corneliusroemer
Copy link
Member

There's a working workaround: download datasets from our Github dataset repo: https://github.com/nextstrain/nextclade_data/tree/master/data/datasets

I know this is not user friendly, but at least it should work until we've fixed nextstrain/nextclade#744

The general issue of nextclade dataset get may be addressed in a future version of Nextclade, but not in the near term since it's difficult with the current C++ CLI implementation.

@huddlej
Copy link
Contributor

huddlej commented Mar 2, 2022

Thank you for the detailed issue description, @jacaravas! While we consider a solution from the Nextclade side of the problem (as @corneliusroemer mentions above), I do have a workaround that should allow you to define your own Snakemake rule that overrides the main workflow's prepare_nextclade rule with your own logic. This custom Snakemake logic can live in your build profile directory (or anywhere relative to the top-level of the ncov directory), so you don't need to maintain a separate fork of the repo.

If you pull the latest version of the ncov workflow (including the change from #877), you can add the following lines to your builds YAML to include custom rules into the workflow:

custom_rules:
  - "get_nextclade_data.smk"

Then, you can define get_nextclade_data.smk with a custom rule that overrides the main workflow’s rule to get Nextclade data (replacing the shell command below with your custom script):

# Tell Snakemake to prefer the custom rule in this file over the rule in the main workflow
# that produces the same output.
ruleorder: custom_prepare_nextclade > prepare_nextclade

rule custom_prepare_nextclade:
    output:
        nextclade_dataset = directory("data/sars-cov-2-nextclade-defaults"),
    params:
        name = "sars-cov-2",
    conda: config["conda_environment"]
    shell:
        """
        echo "This is a custom rule!"
        nextclade dataset get --name {params.name} --output-dir {output.nextclade_dataset}
        """

Theoretically, you can use this approach to rewire almost any parts of the workflow with your own custom rules. In practice, the workflow still defines inputs using variables that point to the output of specific rules instead of using strings that point to literal paths. This implementation choice on our end prevents easy overriding of rules in the way shown above and is what #877 addressed for this specific Nextclade rule.

Do you want to give this a try and let us know here if the proposed solution above works?

@huddlej huddlej moved this from New to In Progress in Nextstrain planning (archived) Mar 2, 2022
@jacaravas
Copy link
Author

@huddlej Fantastic! That is exactly what I was looking for. I already have an edited version of the rule in place in main_workflow that does curl based download of the data files, but it will be much easier to maintain the solution as an external custom rule. I was not looking forward to editing "main_workflow.smk" every time I updated.

Repository owner moved this from In Progress to Done in Nextstrain planning (archived) Mar 2, 2022
@jacaravas
Copy link
Author

I made a quick repo for the script(s) I'm using to download nextclade datasets in case anyone else needs help with this error.
https://github.com/jacaravas/update-nextclade-dataset

@corneliusroemer
Copy link
Member

corneliusroemer commented Mar 2, 2022

Very cool! @jacavaras How did you manage to work around this curl/wget bug both @huddlej and I encountered when trying the manual route?

I'll link to your script from Nextclade, too, as a workaround. This is fab.

@huddlej
Copy link
Contributor

huddlej commented Mar 2, 2022

Excellent! I'm glad this solution works for your setup, @jacaravas. Thank you also for the example downloads scripts. This does feel like a place that shell + curl is better-suited to the problem at hand than C++.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Mar 7, 2022

Related: the problem with zip bundles (nextstrain/nextclade#744) was fixed in nextstrain/nextclade_data#24.

These zips are not used by nextclade and neither recommended or supported for end users. Neither parsing index.json is supported. However if nothing else works, zip bundles might be a decent workaround.

The proxy support may be coming, depending on prioritization of other features and dev resources available. Contributions are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

4 participants