Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input Data to the FastMatch Pipeline #2

Merged
merged 9 commits into from
Dec 12, 2024
Merged

Input Data to the FastMatch Pipeline #2

merged 9 commits into from
Dec 12, 2024

Conversation

sgsutcliffe
Copy link
Collaborator

@sgsutcliffe sgsutcliffe commented Dec 10, 2024

Description:
We want to provide a query and reference sample selection to the FastMatch pipeline along with a selection of relevant metadata fields and parameters, so that users can obtain matched distances and context while avoiding unnecessary referencing of the entire database.

Acceptance Criteria:

  1. The FastMatch pipeline accepts one or more query samples for running comparisons on
  2. The FastMatch pipeline accepts one or more reference samples for running comparisons on
  3. The FastMatch pipeline accepts the parameters set by the user within IRIDA Next:
    1. Comparison score threshold value

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile ,test,docker --outdir <OUTDIR>).

Copy link

github-actions bot commented Dec 10, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit a70d030

+| ✅ 144 tests passed       |+
#| ❔  28 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

❔ Tests ignored:

  • files_exist - File is ignored: assets/nf-core-fastmatchirida_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-fastmatchirida_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-fastmatchirida_logo_dark.png
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: lib/Utils.groovy
  • files_exist - File is ignored: lib/WorkflowMain.groovy
  • files_exist - File is ignored: lib/NfcoreTemplate.groovy
  • files_exist - File is ignored: lib/Workflowfastmatchirida.groovy
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • nextflow_config - Config variable ignored: params.max_cpus
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File does not exist: assets/nf-core-fastmatchirida_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-fastmatchirida_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-fastmatchirida_logo_dark.png
  • files_unchanged - File ignored due to lint config: docs/README.md
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/fastmatchirida/fastmatchirida/.github/workflows/awstest.yml
  • actions_awsfulltest - actions_awsfulltest
  • pipeline_name_conventions - pipeline_name_conventions

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.1
  • Run at 2024-12-12 16:09:15

Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @sgsutcliffe. This is amazing (and fast) work 😄

I have some in-line comments below.

"default": "",
"properties": {
"threshold": {
"type": "number",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! 8de39dd

@@ -25,6 +25,13 @@
"pattern": "^\\S+\\.mlst(\\.subtyping)?\\.json(\\.gz)?$",
"errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json', '.mlst.json.gz', '.mlst.subtyping.json', or 'mlst.subtyping.json.gz'"
},
"fastmatch_category": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the specific behaviour of this column may need a bit further discussion later on, but is something we can leave for this PR (and likely for this sprint to get feedback from others).

Specifically, on the nextflow side, the data in this column is being moved into the meta object for each sample. However, we cannot use the keyword "meta" in this schema JSON file, since that is used by IRIDA Next to load data from the metadata table in IRIDA Next.

I think it would make most sense to actually use the "meta" keyword in this JSON file, but maybe change the behaviour of IRIDA Next somehow? Or, to allow loading of a metadata column OR user-entered values to set query/reference samples.

However, as this is a more complex use case it requires further discussion. So this is good as-is now. I just wanted to make a note here about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the my question, the issue was sort of raised. Might be worth a formal discussion, I agree. I did not like my work around.

+ " Please either set '--pd_distm scaled' or remove fractions from distance thresholds.")
}
} else if (params.pd_distm == 'scaled') {
if (gm_thresholds_list.any { it != null && (it as Float < 0.0 || it as Float > 100.0) }) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of these if-else statements in the gasclustering pipeline was to have some additional error checking on distance threshold values depending on the distance unit selected by the user. That is:

  1. If scaled is selected, the threshold should be >= 0 and <= 100 (the threshold is a percent value).

The complexity of the if/else statements here was because we were passing a list of comma-separated thresholds as a string (e..g, "1,2,3").

For fastmatching, I think it would make sense to keep some of these checks, but they can be simplified (since the threshold is passed as a number instead of a string that needs to be parsed). Specifically:

  1. If hamming is selected, the threshold is >= 0 (this would already be supported by adding the constraint in the schema_input.json file from another of my comments).
  2. If scaled is selected, the threshold is >= 0 and <= 100 (since it's a percentage value).

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly profile_dist will still output a percentage or integer based on pd_distm so the threshold cut-off will vary? I will implement this. I think it makes sense. Could be useful if user forgets to check between scaled and hamming.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational! a70d030

Copy link
Member

@emarinier emarinier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Steven. Aaron covered a lot of things. I have just one small note, no further suggestions on changes.

@@ -43,6 +43,9 @@ params {
validationShowHiddenParams = false
validate_params = true

// FastMatch
threshold = 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a big thing, but I was thinking about how my output script is handling this and I was assuming it was an integer (hamming distances). We'll have to remember to accommodate both integers and floats with this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's if scaled is provided it can be a float. Am I correct @apetkau?

@sgsutcliffe sgsutcliffe requested a review from apetkau December 12, 2024 16:11
@sgsutcliffe sgsutcliffe merged commit 25f22dd into dev Dec 12, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants