add water type + speed up fetchATTAINS(); {arcgislayers}-ify fetchNHD #536

cristinamullin · 2024-10-29T15:50:55Z

Hi EPA TADA team!

This PR addresses the following:

Reverted fetchNHD() code back to using {arcgislayers}. This CRAN package should now be stable, and speeds up the download quite drastically!
Added an argument to fetchATTAINS() to only download raw ATTAINS features if explicitly asked for. This speeds up instances in TADA_GetATTAINS() where that raw data is not requested.
Included new code in fetchATTAINS() that adds water type info to each assessment unit (closes #520).
Ensured {test_that} tests work across our geospatial functions.
Looking forward to your review!
Katie

Co-authored-by: Matt Brousil <[email protected]>

WaterType update + {arcgislayers}-ification

Minor mods based on MB suggestions

Co-authored-by: Matt Brousil <[email protected]>

- adds missing dependencies - clarify TADA_GetATTAINS and TADA_MakeSpatial docs - small improvements to TADAModule2.Rmd, including removal of eval = F on certain chunks (this part can be reverted if needed)

cristinamullin · 2024-10-29T15:52:10Z

Added two commits to bring in upstream changes from the develop branch and in addition:

adds missing dependencies
clarify TADA_GetATTAINS and TADA_MakeSpatial docs
small improvements to TADAModule2.Rmd, including removal of eval = F on certain chunks (this part can be reverted if needed)

cristinamullin · 2024-10-29T19:35:49Z

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

wokenny13 · 2024-10-30T16:36:59Z

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

Some topics that @cristinamullin and I discussed to think through:

If impairment decisions with assessments are always done on an AU level, where it always including an AUID column, then it may not cause issues.
If done on just a monitoring site level, that then gets rolled up to an AU, these duplicate observations would then be counted more than once, which in this case becomes problematic.
I want to give users some flexibility on how to summarize any impairment for a parameter & designated use combination, while also handling of "site-specific" cases that may contain more than 1 monitoring location, etc. The current view that I have for this summary (based only on discrete, individual observations, can be seen below. Once a user defines the standards value for a parameter and designated use, we count the number of monitoring location sites found in the user's TADA dataframe pull, along with discrete counts of observation (each row with ResultMeasureValue), and the number of those discrete observations that have exceeded that standard value. Any thoughts on this current 'standards exceedance' summary layout? (There are other columns that are not shown in this view)
Flagging these instances in which they occur should be addressed. The column name "Index" does seem too vague and a name change for this would make it more user friendly in my view.

hillarymarler · 2024-10-30T19:00:38Z

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

Are the sites falling within more than one AU? Or just falling within a catchment that contains more than one AU? I am wondering if we could modify the function so that the AUID/catchment are added, but no duplicates are created. I am still working on my review of this PR, but I'll keep thinking about this as I do.

kathryn-willi · 2024-10-30T19:11:23Z

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

I agree that renaming "index" to something more explanatory could help. Would this sort of identification-tracking be useful across all of TADA, and therefore something that could be added to all TADA WQP pulls? Something like "TADA.ID" that is the very first column... just a thought! Otherwise, I think moving the column to be with the other ATTAINS columns makes sense.

I also really like the idea of printing a message when this duplicate observation-situation happens and would be an easy thing to add!

kathryn-willi · 2024-10-30T19:22:47Z

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

Are the sites falling within more than one AU? Or just falling within a catchment that contains more than one AU? I am wondering if we could modify the function so that the AUID/catchment are added, but no duplicates are created. I am still working on my review of this PR, but I'll keep thinking about this as I do.

To reply quickly/shortly to this - you are correct that this happens when sites are falling within a catchment that contains more that one AU, not necessarily that the site is falling into multiple raw ATTAINS features.

cristinamullin · 2024-10-30T19:41:46Z

@katiehealy @wokenny13 @hillarymarler
FYI I created a new issue here: #539 to continue working on the index column issue. This PR does not have to resolve that before being merged (though it could if that is easier).

cristinamullin · 2024-10-30T19:44:46Z

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

I agree that renaming "index" to something more explanatory could help. Would this sort of identification-tracking be useful across all of TADA, and therefore something that could be added to all TADA WQP pulls? Something like "TADA.ID" that is the very first column... just a thought! Otherwise, I think moving the column to be with the other ATTAINS columns makes sense.

I also really like the idea of printing a message when this duplicate observation-situation happens and would be an easy thing to add!

Actually now that you mention it, there is a WQP column, "ResultIdentifier", that includes a unique ID for every observation. I believe it should serve the same function as the new "index" column.

cristinamullin · 2024-10-30T19:55:14Z

TADA_GetATTAINS may cause the dataframe to grow (contain more rows than the original TADA dataframe) because Water Quality Portal observations can fall within an NHD catchment that contains more than one ATTAINS assessment unit. The new, “index”, column links these duplicate observations for sites that fall within more than one AU. This is important. Should we print a message for users when this occurs? I can foresee users running into subsequent analysis issues if they are not aware (don’t really read the documentation). Also, is the column name, “index”, too vague? Any ideas on how to make this more user friendly? How about TADA.SameResultMultipleAUs ? Currently, it is placed in at the start of the df, is that where we want it to go?

Are the sites falling within more than one AU? Or just falling within a catchment that contains more than one AU? I am wondering if we could modify the function so that the AUID/catchment are added, but no duplicates are created. I am still working on my review of this PR, but I'll keep thinking about this as I do.

To reply quickly/shortly to this - you are correct that this happens when sites are falling within a catchment that contains more that one AU, not necessarily that the site is falling into multiple raw ATTAINS features.

I originally assumed that this occurs where states have overlapping AU's (overlapping raw ATTAINS features). That also does occur.

However this scenario, when sites are falling within a catchment that contains more that one AU, could happen frequently (e.g. at tributaries). I think is a point in the workflow where we need a flag of some sort, so a user is able to decide what to do with a site that could be matched with multiple AU's (same catchment). A) Do they pick one of the multiple AU's to assign that site to (would the current function facilitate being able to do that in a future R Shiny app?)), B) Do they reuse the data for both AUs (that seems to be what the current functionality would enable). Note: We plan to leverage these functions for the development of a companion R Shiny app in the future that would help users review/QC these associations.

hillarymarler · 2024-10-31T16:19:09Z

I have been thinking quite a bit about whether we should be incorporating the catchment information from the ATTAINS webservices as the default option in TADA. As the catchments themselves are not involved in the assessment process (states are free to define assessment units at the catchment level, but this would then be reflected in the assessment unit geometry). I can understand how having the catchment info available would be useful for other purposes, but don't think it belongs in the assessment workflow, especially if it is creating duplicate records that need to be identified and dealt with.

I chatted about this briefly with Wendy and she verified that catchments are used for the purposes of summarizing assessment data, but it is a behind the scenes process and should not interfere with state assessment methods.

kathryn-willi · 2024-10-31T17:12:03Z

I feel that using the catchments to grab AUs is a good way to be more confident that the raw AUs are within the same drainage area/more likely associated with the same NHD feature as the WQP observation. But, I do agree that the raw ATTAINS features are the most important features returned in TADA_GetATTAINS(). One way to resolve this issue could be to set the default for TADA_GetATTAINS()'s return_sf argument to TRUE. We could also add/modify steps to the workflow like -

1. Grab the AUs that fall within the same catchment as the WQP observation (basically, as the function exists currently).
2. We could add a new column containing the distance between each WQP observation and the the raw ATTAINS AUs they are associated with in the TADA_with_ATTAINS dataframe.
3. Add a new argument, `return_nearest = TRUE`. This could filter the AUs grabbed in step 1 to the nearest AUs only (i.e., for WQP observations that fall within a catchment with multiple AUs, we select only the one closest to it).

A completely different approach could be to get rid of the use of catchments altogether. In this workflow we could grab the nearest raw ATTAINS features to each WQP observation based on some search radius (maybe user-defined?) and/or all the AUs within that search radius, then create a column identifying their distances away like in the approach above...

Also happy to discuss this over a call if that would be easier!

hillarymarler · 2024-10-31T19:07:13Z

I feel that using the catchments to grab AUs is a good way to be more confident that the raw AUs are within the same drainage area/more likely associated with the same NHD feature as the WQP observation. But, I do agree that the raw ATTAINS features are the most important features returned in TADA_GetATTAINS(). One way to resolve this issue could be to set the default for TADA_GetATTAINS()'s return_sf argument to TRUE. We could also add/modify steps to the workflow like -
1. Grab the AUs that fall within the same catchment as the WQP observation (basically, as the function exists currently).
2. We could add a new column containing the distance between each WQP observation and the the raw ATTAINS AUs they are associated with in the TADA_with_ATTAINS dataframe.
3. Add a new argument, `return_nearest = TRUE`. This could filter the AUs grabbed in step 1 to the nearest AUs only (i.e., for WQP observations that fall within a catchment with multiple AUs, we select only the one closest to it). 
A completely different approach could be to get rid of the use of catchments altogether. In this workflow we could grab the nearest raw ATTAINS features to each WQP observation based on some search radius (maybe user-defined?) and/or all the AUs within that search radius, then create a column identifying their distances away like in the approach above...

Also happy to discuss this over a call if that would be easier!

@kathryn-willi I think your proposed solution of a return_nearest = TRUE argument sounds great, as it would both keep the benefits of using catchments to grab AUs and remove the issue of duplicate results when a WQP observation falls in a catchment with multiple AUs. Most of our assessment automating users will want only the nearest AU returned, so this sets things up nicely for them while still allowing other users to return the additional catchment/AU information if they would like to for another purpose.

@cristinamullin - what do you think?

cristinamullin · 2024-11-01T12:11:40Z

I feel that using the catchments to grab AUs is a good way to be more confident that the raw AUs are within the same drainage area/more likely associated with the same NHD feature as the WQP observation. But, I do agree that the raw ATTAINS features are the most important features returned in TADA_GetATTAINS(). One way to resolve this issue could be to set the default for TADA_GetATTAINS()'s return_sf argument to TRUE. We could also add/modify steps to the workflow like -
1. Grab the AUs that fall within the same catchment as the WQP observation (basically, as the function exists currently).
2. We could add a new column containing the distance between each WQP observation and the the raw ATTAINS AUs they are associated with in the TADA_with_ATTAINS dataframe.
3. Add a new argument, `return_nearest = TRUE`. This could filter the AUs grabbed in step 1 to the nearest AUs only (i.e., for WQP observations that fall within a catchment with multiple AUs, we select only the one closest to it). 
A completely different approach could be to get rid of the use of catchments altogether. In this workflow we could grab the nearest raw ATTAINS features to each WQP observation based on some search radius (maybe user-defined?) and/or all the AUs within that search radius, then create a column identifying their distances away like in the approach above...
Also happy to discuss this over a call if that would be easier!
@kathryn-willi I think your proposed solution of a return_nearest = TRUE argument sounds great, as it would both keep the benefits of using catchments to grab AUs and remove the issue of duplicate results when a WQP observation falls in a catchment with multiple AUs. Most of our assessment automating users will want only the nearest AU returned, so this sets things up nicely for them while still allowing other users to return the additional catchment/AU information if they would like to for another purpose.

@cristinamullin - what do you think?

I agree and think we should proceed with the proposed solutions! In addition, we should make sure to capture our logic/decisions made here either in the function documentation or the Module 2 vignette. Thank you all for the review & discussion!

kathryn-willi · 2024-11-01T16:01:50Z

Should we merge this PR and I can submit a new one with these updates, or should I try to incorporate them into this one?

kathryn-willi and others added 27 commits October 17, 2024 12:01

arcgislyaers-ification + watertype add

db17420

retest, look good now

f04b6fc

Update R/GeospatialFunctions.R

959321c

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

fbbdbd6

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

7b7f37a

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

17f5508

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

5cd1da3

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

622991d

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

3b8aa31

Co-authored-by: Matt Brousil <[email protected]>

Merge pull request #5 from kathryn-willi/develop

9a86744

WaterType update + {arcgislayers}-ification

removed split_vector from conditional statement

c15c4d8

highlight water type in vignette

d5750f4

Merge pull request #6 from kathryn-willi/develop

2053143

Minor mods based on MB suggestions

Merge branch 'develop' into pr/533

e57fa68

arcgislyaers-ification + watertype add

ec7e275

retest, look good now

bca5cbd

Update R/GeospatialFunctions.R

23f6e52

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

718ad42

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

48697ff

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

a5a9df9

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

8f2c03e

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

d96fad0

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

b76b22b

Co-authored-by: Matt Brousil <[email protected]>

removed split_vector from conditional statement

a48c419

highlight water type in vignette

bbd951d

Merge branch 'pr/533' into review-533

ff373c3

add dependencies and details to docs

0eefbe0

- adds missing dependencies - clarify TADA_GetATTAINS and TADA_MakeSpatial docs - small improvements to TADAModule2.Rmd, including removal of eval = F on certain chunks (this part can be reverted if needed)

minor edits to TADA_GetATTAINS docs

fa0e82d

cristinamullin requested a review from hillarymarler October 29, 2024 17:30

cristinamullin requested review from kathryn-willi and wokenny13 October 29, 2024 19:36

add ATTAINS cols

e53c98a

Merge branch 'develop' into review-533

4f41ba0

cristinamullin merged commit 69792d3 into develop Nov 5, 2024
7 checks passed

cristinamullin mentioned this pull request Nov 5, 2024

TADA_GetATTAINS index usability #539

Open

cristinamullin deleted the review-533 branch November 8, 2024 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add water type + speed up fetchATTAINS(); {arcgislayers}-ify fetchNHD #536

add water type + speed up fetchATTAINS(); {arcgislayers}-ify fetchNHD #536

cristinamullin commented Oct 29, 2024

cristinamullin commented Oct 29, 2024

cristinamullin commented Oct 29, 2024

wokenny13 commented Oct 30, 2024 •

edited

Loading

hillarymarler commented Oct 30, 2024

kathryn-willi commented Oct 30, 2024

kathryn-willi commented Oct 30, 2024 •

edited

Loading

cristinamullin commented Oct 30, 2024 •

edited

Loading

cristinamullin commented Oct 30, 2024

cristinamullin commented Oct 30, 2024

hillarymarler commented Oct 31, 2024

kathryn-willi commented Oct 31, 2024

hillarymarler commented Oct 31, 2024

cristinamullin commented Nov 1, 2024

kathryn-willi commented Nov 1, 2024

add water type + speed up fetchATTAINS(); {arcgislayers}-ify fetchNHD #536

add water type + speed up fetchATTAINS(); {arcgislayers}-ify fetchNHD #536

Conversation

cristinamullin commented Oct 29, 2024

cristinamullin commented Oct 29, 2024

cristinamullin commented Oct 29, 2024

wokenny13 commented Oct 30, 2024 • edited Loading

hillarymarler commented Oct 30, 2024

kathryn-willi commented Oct 30, 2024

kathryn-willi commented Oct 30, 2024 • edited Loading

cristinamullin commented Oct 30, 2024 • edited Loading

cristinamullin commented Oct 30, 2024

cristinamullin commented Oct 30, 2024

hillarymarler commented Oct 31, 2024

kathryn-willi commented Oct 31, 2024

hillarymarler commented Oct 31, 2024

cristinamullin commented Nov 1, 2024

kathryn-willi commented Nov 1, 2024

wokenny13 commented Oct 30, 2024 •

edited

Loading

kathryn-willi commented Oct 30, 2024 •

edited

Loading

cristinamullin commented Oct 30, 2024 •

edited

Loading