-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input Data to the FastMatch Pipeline #2
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much @sgsutcliffe. This is amazing (and fast) work 😄
I have some in-line comments below.
"default": "", | ||
"properties": { | ||
"threshold": { | ||
"type": "number", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you set a "minimum"
threshold of 0 here: https://nextflow-io.github.io/nf-validation/nextflow_schema/nextflow_schema_specification/#minimum-maximum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! 8de39dd
@@ -25,6 +25,13 @@ | |||
"pattern": "^\\S+\\.mlst(\\.subtyping)?\\.json(\\.gz)?$", | |||
"errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json', '.mlst.json.gz', '.mlst.subtyping.json', or 'mlst.subtyping.json.gz'" | |||
}, | |||
"fastmatch_category": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the specific behaviour of this column may need a bit further discussion later on, but is something we can leave for this PR (and likely for this sprint to get feedback from others).
Specifically, on the nextflow side, the data in this column is being moved into the meta
object for each sample. However, we cannot use the keyword "meta"
in this schema JSON file, since that is used by IRIDA Next to load data from the metadata table in IRIDA Next.
I think it would make most sense to actually use the "meta"
keyword in this JSON file, but maybe change the behaviour of IRIDA Next somehow? Or, to allow loading of a metadata column OR user-entered values to set query/reference samples.
However, as this is a more complex use case it requires further discussion. So this is good as-is now. I just wanted to make a note here about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the my question, the issue was sort of raised. Might be worth a formal discussion, I agree. I did not like my work around.
+ " Please either set '--pd_distm scaled' or remove fractions from distance thresholds.") | ||
} | ||
} else if (params.pd_distm == 'scaled') { | ||
if (gm_thresholds_list.any { it != null && (it as Float < 0.0 || it as Float > 100.0) }) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of these if-else statements in the gasclustering pipeline was to have some additional error checking on distance threshold values depending on the distance unit selected by the user. That is:
- If
scaled
is selected, the threshold should be >= 0 and <= 100 (the threshold is a percent value).
The complexity of the if/else statements here was because we were passing a list of comma-separated thresholds as a string (e..g, "1,2,3"
).
For fastmatching, I think it would make sense to keep some of these checks, but they can be simplified (since the threshold is passed as a number instead of a string that needs to be parsed). Specifically:
- If
hamming
is selected, the threshold is >= 0 (this would already be supported by adding the constraint in theschema_input.json
file from another of my comments). - If
scaled
is selected, the threshold is >= 0 and <= 100 (since it's a percentage value).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly profile_dist
will still output a percentage or integer based on pd_distm
so the threshold cut-off will vary? I will implement this. I think it makes sense. Could be useful if user forgets to check between scaled and hamming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operational! a70d030
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Steven. Aaron covered a lot of things. I have just one small note, no further suggestions on changes.
@@ -43,6 +43,9 @@ params { | |||
validationShowHiddenParams = false | |||
validate_params = true | |||
|
|||
// FastMatch | |||
threshold = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a big thing, but I was thinking about how my output script is handling this and I was assuming it was an integer (hamming distances). We'll have to remember to accommodate both integers and floats with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's if scaled is provided it can be a float. Am I correct @apetkau?
Description:
We want to provide a query and reference sample selection to the FastMatch pipeline along with a selection of relevant metadata fields and parameters, so that users can obtain matched distances and context while avoiding unnecessary referencing of the entire database.
Acceptance Criteria:
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile ,test,docker --outdir <OUTDIR>
).