Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add water type + speed up fetchATTAINS(); {arcgislayers}-ify fetchNHD #536

Merged
merged 30 commits into from
Nov 5, 2024

Conversation

cristinamullin
Copy link
Collaborator

kathryn-willi

Hi EPA TADA team!

This PR addresses the following:

Reverted fetchNHD() code back to using {arcgislayers}. This CRAN package should now be stable, and speeds up the download quite drastically!
Added an argument to fetchATTAINS() to only download raw ATTAINS features if explicitly asked for. This speeds up instances in TADA_GetATTAINS() where that raw data is not requested.
Included new code in fetchATTAINS() that adds water type info to each assessment unit (closes #520).
Ensured {test_that} tests work across our geospatial functions.
Looking forward to your review!
Katie

kathryn-willi and others added 27 commits October 17, 2024 12:01
WaterType update + {arcgislayers}-ification
Minor mods based on MB suggestions
- adds missing dependencies
- clarify TADA_GetATTAINS and TADA_MakeSpatial docs
- small improvements to TADAModule2.Rmd, including removal of eval = F on certain chunks (this part can be reverted if needed)
@cristinamullin
Copy link
Collaborator Author

Added two commits to bring in upstream changes from the develop branch and in addition:

  • adds missing dependencies
  • clarify TADA_GetATTAINS and TADA_MakeSpatial docs
  • small improvements to TADAModule2.Rmd, including removal of eval = F on certain chunks (this part can be reverted if needed)

@cristinamullin
Copy link
Collaborator Author

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

@wokenny13
Copy link
Collaborator

wokenny13 commented Oct 30, 2024

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

Some topics that @cristinamullin and I discussed to think through:

  • If impairment decisions with assessments are always done on an AU level, where it always including an AUID column, then it may not cause issues.
  • If done on just a monitoring site level, that then gets rolled up to an AU, these duplicate observations would then be counted more than once, which in this case becomes problematic.
  • I want to give users some flexibility on how to summarize any impairment for a parameter & designated use combination, while also handling of "site-specific" cases that may contain more than 1 monitoring location, etc. The current view that I have for this summary (based only on discrete, individual observations, can be seen below. Once a user defines the standards value for a parameter and designated use, we count the number of monitoring location sites found in the user's TADA dataframe pull, along with discrete counts of observation (each row with ResultMeasureValue), and the number of those discrete observations that have exceeded that standard value. Any thoughts on this current 'standards exceedance' summary layout? (There are other columns that are not shown in this view)
  • Flagging these instances in which they occur should be addressed. The column name "Index" does seem too vague and a name change for this would make it more user friendly in my view.
image

@hillarymarler
Copy link
Collaborator

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

Are the sites falling within more than one AU? Or just falling within a catchment that contains more than one AU? I am wondering if we could modify the function so that the AUID/catchment are added, but no duplicates are created. I am still working on my review of this PR, but I'll keep thinking about this as I do.

@kathryn-willi
Copy link
Collaborator

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

I agree that renaming "index" to something more explanatory could help. Would this sort of identification-tracking be useful across all of TADA, and therefore something that could be added to all TADA WQP pulls? Something like "TADA.ID" that is the very first column... just a thought! Otherwise, I think moving the column to be with the other ATTAINS columns makes sense.

I also really like the idea of printing a message when this duplicate observation-situation happens and would be an easy thing to add!

@kathryn-willi
Copy link
Collaborator

kathryn-willi commented Oct 30, 2024

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

Are the sites falling within more than one AU? Or just falling within a catchment that contains more than one AU? I am wondering if we could modify the function so that the AUID/catchment are added, but no duplicates are created. I am still working on my review of this PR, but I'll keep thinking about this as I do.

To reply quickly/shortly to this - you are correct that this happens when sites are falling within a catchment that contains more that one AU, not necessarily that the site is falling into multiple raw ATTAINS features.

@cristinamullin
Copy link
Collaborator Author

cristinamullin commented Oct 30, 2024

@katiehealy @wokenny13 @hillarymarler
FYI I created a new issue here: #539 to continue working on the index column issue. This PR does not have to resolve that before being merged (though it could if that is easier).

@cristinamullin
Copy link
Collaborator Author

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

I agree that renaming "index" to something more explanatory could help. Would this sort of identification-tracking be useful across all of TADA, and therefore something that could be added to all TADA WQP pulls? Something like "TADA.ID" that is the very first column... just a thought! Otherwise, I think moving the column to be with the other ATTAINS columns makes sense.

I also really like the idea of printing a message when this duplicate observation-situation happens and would be an easy thing to add!

Actually now that you mention it, there is a WQP column, "ResultIdentifier", that includes a unique ID for every observation. I believe it should serve the same function as the new "index" column.

@cristinamullin
Copy link
Collaborator Author

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

Are the sites falling within more than one AU? Or just falling within a catchment that contains more than one AU? I am wondering if we could modify the function so that the AUID/catchment are added, but no duplicates are created. I am still working on my review of this PR, but I'll keep thinking about this as I do.

To reply quickly/shortly to this - you are correct that this happens when sites are falling within a catchment that contains more that one AU, not necessarily that the site is falling into multiple raw ATTAINS features.

I originally assumed that this occurs where states have overlapping AU's (overlapping raw ATTAINS features). That also does occur.

However this scenario, when sites are falling within a catchment that contains more that one AU, could happen frequently (e.g. at tributaries). I think is a point in the workflow where we need a flag of some sort, so a user is able to decide what to do with a site that could be matched with multiple AU's (same catchment). A) Do they pick one of the multiple AU's to assign that site to (would the current function facilitate being able to do that in a future R Shiny app?)), B) Do they reuse the data for both AUs (that seems to be what the current functionality would enable). Note: We plan to leverage these functions for the development of a companion R Shiny app in the future that would help users review/QC these associations.

@hillarymarler
Copy link
Collaborator

I have been thinking quite a bit about whether we should be incorporating the catchment information from the ATTAINS webservices as the default option in TADA. As the catchments themselves are not involved in the assessment process (states are free to define assessment units at the catchment level, but this would then be reflected in the assessment unit geometry). I can understand how having the catchment info available would be useful for other purposes, but don't think it belongs in the assessment workflow, especially if it is creating duplicate records that need to be identified and dealt with.

I chatted about this briefly with Wendy and she verified that catchments are used for the purposes of summarizing assessment data, but it is a behind the scenes process and should not interfere with state assessment methods.

@kathryn-willi
Copy link
Collaborator

I feel that using the catchments to grab AUs is a good way to be more confident that the raw AUs are within the same drainage area/more likely associated with the same NHD feature as the WQP observation. But, I do agree that the raw ATTAINS features are the most important features returned in TADA_GetATTAINS(). One way to resolve this issue could be to set the default for TADA_GetATTAINS()'s return_sf argument to TRUE. We could also add/modify steps to the workflow like -

1. Grab the AUs that fall within the same catchment as the WQP observation (basically, as the function exists currently).
2. We could add a new column containing the distance between each WQP observation and the the raw ATTAINS AUs they are associated with in the TADA_with_ATTAINS dataframe.
3. Add a new argument, `return_nearest = TRUE`. This could filter the AUs grabbed in step 1 to the nearest AUs only (i.e., for WQP observations that fall within a catchment with multiple AUs, we select only the one closest to it). 

A completely different approach could be to get rid of the use of catchments altogether. In this workflow we could grab the nearest raw ATTAINS features to each WQP observation based on some search radius (maybe user-defined?) and/or all the AUs within that search radius, then create a column identifying their distances away like in the approach above...

Also happy to discuss this over a call if that would be easier!

@hillarymarler
Copy link
Collaborator

I feel that using the catchments to grab AUs is a good way to be more confident that the raw AUs are within the same drainage area/more likely associated with the same NHD feature as the WQP observation. But, I do agree that the raw ATTAINS features are the most important features returned in TADA_GetATTAINS(). One way to resolve this issue could be to set the default for TADA_GetATTAINS()'s return_sf argument to TRUE. We could also add/modify steps to the workflow like -

1. Grab the AUs that fall within the same catchment as the WQP observation (basically, as the function exists currently).
2. We could add a new column containing the distance between each WQP observation and the the raw ATTAINS AUs they are associated with in the TADA_with_ATTAINS dataframe.
3. Add a new argument, `return_nearest = TRUE`. This could filter the AUs grabbed in step 1 to the nearest AUs only (i.e., for WQP observations that fall within a catchment with multiple AUs, we select only the one closest to it). 

A completely different approach could be to get rid of the use of catchments altogether. In this workflow we could grab the nearest raw ATTAINS features to each WQP observation based on some search radius (maybe user-defined?) and/or all the AUs within that search radius, then create a column identifying their distances away like in the approach above...

Also happy to discuss this over a call if that would be easier!

@kathryn-willi I think your proposed solution of a return_nearest = TRUE argument sounds great, as it would both keep the benefits of using catchments to grab AUs and remove the issue of duplicate results when a WQP observation falls in a catchment with multiple AUs. Most of our assessment automating users will want only the nearest AU returned, so this sets things up nicely for them while still allowing other users to return the additional catchment/AU information if they would like to for another purpose.

@cristinamullin - what do you think?

@cristinamullin
Copy link
Collaborator Author

I feel that using the catchments to grab AUs is a good way to be more confident that the raw AUs are within the same drainage area/more likely associated with the same NHD feature as the WQP observation. But, I do agree that the raw ATTAINS features are the most important features returned in TADA_GetATTAINS(). One way to resolve this issue could be to set the default for TADA_GetATTAINS()'s return_sf argument to TRUE. We could also add/modify steps to the workflow like -

1. Grab the AUs that fall within the same catchment as the WQP observation (basically, as the function exists currently).
2. We could add a new column containing the distance between each WQP observation and the the raw ATTAINS AUs they are associated with in the TADA_with_ATTAINS dataframe.
3. Add a new argument, `return_nearest = TRUE`. This could filter the AUs grabbed in step 1 to the nearest AUs only (i.e., for WQP observations that fall within a catchment with multiple AUs, we select only the one closest to it). 

A completely different approach could be to get rid of the use of catchments altogether. In this workflow we could grab the nearest raw ATTAINS features to each WQP observation based on some search radius (maybe user-defined?) and/or all the AUs within that search radius, then create a column identifying their distances away like in the approach above...
Also happy to discuss this over a call if that would be easier!

@kathryn-willi I think your proposed solution of a return_nearest = TRUE argument sounds great, as it would both keep the benefits of using catchments to grab AUs and remove the issue of duplicate results when a WQP observation falls in a catchment with multiple AUs. Most of our assessment automating users will want only the nearest AU returned, so this sets things up nicely for them while still allowing other users to return the additional catchment/AU information if they would like to for another purpose.

@cristinamullin - what do you think?

I agree and think we should proceed with the proposed solutions! In addition, we should make sure to capture our logic/decisions made here either in the function documentation or the Module 2 vignette. Thank you all for the review & discussion!

@kathryn-willi
Copy link
Collaborator

Should we merge this PR and I can submit a new one with these updates, or should I try to incorporate them into this one?

@cristinamullin cristinamullin merged commit 69792d3 into develop Nov 5, 2024
7 checks passed
@cristinamullin cristinamullin deleted the review-533 branch November 8, 2024 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding assessment unit type to GetATTAINS features
4 participants