-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gray out early clades #1132
Gray out early clades #1132
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this looks like a great improvement! I'm excited to try the same approach in the seasonal flu workflow.
My main comment is that I think the default value should be an explicit value that ties to the meaning of the clade_recency
functionality instead of an empty string. For example, a string value of “all” or “all-time” would more clearly communicate what the config option is doing. An empty string doesn’t immediately communicate “assign colors to all clades” and looks more like a typo in the config YAML (like maybe someone forgot to enter a value?).
The clade recency argument appears in two different parts of the workflow: the workflow configuration/rules and the assign-colors.py
script. We could modify either of these parts to support a default argument of "all".
In the context of the script, we could keep the default behavior of “assign colors to all clades” when the --clade-recency
argument is not provided and that would not be surprising behavior. Providing that argument would be a way to opt into the behavior of filtering by date. When the user provides that argument, the script would require that the value is an integer representing number of months in the past. In the context of the workflow, this approach would require us to build a clade recency argument string to pass to the script. So, instead of this:
--clade-recency {params.clade_recency}
we would build this:
{params.clade_recency_argument}
When the user provides a value of “all” or the workflow can’t find a clade recency in the config, the clade recency argument is an empty string. When the user provides an integer value, the clade recency argument is --clade-recency 3
, for example. We already use this pattern in the ncov workflow (like the get_priority_argument function).
Another pattern we use elsewhere is a custom argparse type that accepts a specific string value (“all”) or an integer like we do for Augur’s nthreads arguments. We could use this pattern to modify the assign-colors
script to accept --clade-recency all
as a valid input. Then the workflow can keep the same pattern in this PR where the colors
rule hardcodes the --clade-recency
argument with a {params.clade_recency}
reference instead of building the argument as described above. When the user provides a value of all
, the script happily performs the default behavior of assigning colors to all clades.
I lean slightly toward the params.clade_recency_argument
pattern, since it doesn’t add complexity to the assign-colors
script that would only exist for the sake of the workflow, but either approach would work.
1fd29e4
to
3f8bb08
Compare
I've (finally) attended to @huddlej's helpful comments. This included swapping to I'm re-running trial builds in https://github.com/nextstrain/ncov/actions/runs/11059102557. This should update the URLs referenced in the original description like https://nextstrain.org/staging/ncov/gisaid/trial/gray-out-early-clades/global/6m. The resulting dataset coloring should be the same as the original PR, but just with a cleaner interface. @victorlin: I've tagged you for review here because I know that @huddlej is very busy with VCM and Options. If you could look things over, I'd very much appreciate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get to review much but sending in a comment before I sign off for the day.
Will plan to continue review on Monday.
ea85954
to
1ae7447
Compare
I pushed a couple more fixes: b685fbb, 3273dcc I also contemplated renaming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I think all the commits can get squashed into one, or at least all my commits squashed into one along the lines of "Use ISO duration format for clade recency, error on invalid values".
def _get_clade_recency_for_wildcards(wildcards): | ||
# check if builds.yaml contains colors:{build_name}:clade_recency | ||
if wildcards.build_name in config["colors"] and 'clade_recency' in config["colors"][wildcards.build_name]: | ||
return config["colors"][wildcards.build_name]["clade_recency"] | ||
# check if builds.yaml or parameters.yaml contains colors:default:clade_recency | ||
elif "colors" in config and "clade_recency" in config["colors"]["default"]: | ||
return config["colors"]["default"]["clade_recency"] | ||
# else return sensible default value | ||
else: | ||
return "all" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: I've already picked on this elsewhere and don't want implementation details to get in the way of having improved coloring
In #1131 we settled on "strategy 2" which would translate to:
def _get_clade_recency_for_wildcards(wildcards):
# check if builds.yaml contains colors:{build_name}:clade_recency
if wildcards.build_name in config["colors"] and 'clade_recency' in config["colors"][wildcards.build_name]:
return config["colors"][wildcards.build_name]["clade_recency"]
# use workflow default
else:
return config["colors"]["default"]["clade_recency"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch here. I agree that cleaning up default values to source directly from parameters.yaml
is the right way to do things. But that said, I think I'd like to tackle this cleanup in a separate PR after this PR and #1130 are merged.
Thanks for the review @victorlin! Could you squash your commits however you'd like and I'll then update |
When assigning colors gray out clades that are not in recent samples. Use clade_recency parameter to consider how many months back to consider "recent".
Set up a build-specific override for colors: clade_recency parameter. Make the default to be empty string "" and have this keep all clades colored. Override for the Nextstrain GISAID profile to specify clade_recency of - `1` month for "1m" build - `2` months for "2m" build - `6` months for "6m" build - `""` (all months) for "all-time" build
Per rationale in issue 1131 (#1131), switch to following a three step look up for setting build-specific parameters of colors:clade_recency.
Rather than previous behavior of setting --clade-recency 6 by building --clade-recency {params.clade_recency}, instead build the entire argument via {params.clade_recency_argument}. This allows setting colors.global_all-time.clade_recency to "all" in builds.yaml and in this case handing assign-colors.py nothing for clade_recency. This keeps the behavior of assign-colors.py as expected and backwards compatible, while simultaneously allowing for the "all" sensible default in parameters.yaml rather than the previous empty string to accomplish this same behavior.
The colors.clade_recency parameter can be specified in a build-specific fashion. If specifying a default for use, use the pattern of specifying colors.default.clade_recency rather than the previous colors.clade_recency.
Show a meaningful error message if the value is improperly formatted.
Keeps consistent with other existing usage of this format.
6ee0016
to
0bd8251
Compare
@trvrb done. Also rebased onto latest |
Thanks again @victorlin! |
Description of proposed changes
This PR is attempting to solve the same problem as #1129, but in a different fashion. In this case, we keep the subsampling in the 6m, 2m and 1m builds the same as before and so keep the long contextual tail going from 6m / 2m / 1m back to the beginning of the pandemic. However, I've added new functionality to the
assign-colors.py
script that takes in a--clade-recency
parameter. This specifies the number of months back to look for circulating clades. So for example,--clade-recency 6
would use tips in the previous 6 months and only keepclade_membership
in this time period. I added parameter logic to specify--clade-recency
of6
vs2
vs1
for6m
,2m
and1m
builds. Theall-time
builds include all clades.Here's some comparisons.
global/6m
Live
PR
global/2m
Live
PR
global/1m
Live
PR
I think this is a big improvement. We no longer have 3 highly similar similar shades of red for currently circulating clades. It also seems semantically appropriate to gray out these early clades to correspond to "context". This the same thing we do for regions outside of the focal region ala south-america/6m?c=country
My preference would be to merge this PR (after code review) and reconsider the previous time-based sampling PR #1129. I think we'll eventually need to move the
min_date
up from 2020, but with clade colors fixed it's not so urgent. Additionally, having this strategy of gray clades outside of focal window would work with alternative time-based sampling windows.Testing
Tested locally and via trial build (URLs linked above). However, this adds a new
parameters.yaml
option. I tested and it shouldn't break existing builds that don't specifycolors:clade_recency
, but attention here wouldn't hurt.Release checklist
If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.