392 create function to find paired data t ph hardness dependent criteria #519

hillarymarler · 2024-08-29T20:14:34Z

I had been working on this in the demo_impairment branch that @wokenny13 and I have been using to test out some module 3 ideas, but these two draft functions are now well developed enough, that a review would be really helpful.

TADA_CreatePairRef - creates a reference df of TADA.CharacteristicNames that should be searched for pairing with results

TADA_PairForCriteriaCalc - uses the previously created ref (or a user supplied one) to pair results with pH, temp, hardness, etc.

I think we should hard code default rankings for hardness, temp, pH, etc. in terms of which should be selected to pair with a result. They can be edited be the user if desired. Currently, the ranking is assigned with dplyr::cur_group_id(). The rankings are part of what prevents this function from growing the data set as there may be multiple potential pairs for a group (ex: "hardness") for a single result.

Styler updates

Added rlang to imports

wokenny13 · 2024-08-30T19:29:54Z

I will continue to look through this more on Tuesday.

Some thoughts:

For Criteria Standards that are "equation based" and dependent on hardness, would the standards be based on each row of samples from a TADA WQP data pull? Or would it be an average of all samples taken at a monitoringLocation/AUID? For example, filtered by Zinc from the example of AL, will each row have it's own standard, or would we determine if data is continuous, then aggregate by an average hardness, and use that hardness for calculating the standards?

Should hardness be included as a TADAPriorityCharConvertRef with units as mg/L? This seems to be a common unit (mg/L) to be used with hardness for calculating standards from what I've seen. Currently I see the units in UG/L.

From today's meeting (8/30) we discussed about the case if a .data does not contain the paired parameters due to if a user supplies a TADA dataframe that they've filtered by only "Nutrients" or certain "Characteristics". We would then need to pull a completed dataframe that aligns with the user's dataframe and join it with a "full data pull" similar to what a user submitted as their .data argument.

hillarymarler · 2024-09-03T12:07:05Z

For Criteria Standards that are "equation based" and dependent on hardness, would the standards be based on each row of samples from a TADA WQP data pull? Or would it be an average of all samples taken at a monitoringLocation/AUID?

@wokenny13 - I was imagining that each row would use its own hardness (or ph, temp, etc.) value for determining the numeric criteria. I am not aware of any methodologies that take an average from the whole data set and the apply that for all calculations - have you seen this in your review of state methods?

For example, filtered by Zinc from the example of AL, will each row have it's own standard, or would we determine if data is continuous, then aggregate by an average hardness, and use that hardness for calculating the standards?

Re: continuous data - my thought was users would likely filter out the continuous data earlier in their workflow if they did not want it to be included in certain standard comparisons.

Should hardness be included as a TADAPriorityCharConvertRef with units as mg/L? This seems to be a common unit (mg/L) to be used with hardness for calculating standards from what I've seen. Currently I see the units in UG/L.

Yes - this is a great suggestion. I can modify the ref table to facilitate this. We may also want to think about updating documentation for the pairing function to remind users that ideally they will have done all unit converting, etc. prior to pairing the data.

From today's meeting (8/30) we discussed about the case if a .data does not contain the paired parameters due to if a user supplies a TADA dataframe that they've filtered by only "Nutrients" or certain "Characteristics". We would then need to pull a completed dataframe that aligns with the user's dataframe and join it with a "full data pull" similar to what a user submitted as their .data argument.

What would you think about setting this up as its own function? I can see how this functionality would be very useful in certain scenarios, but think it might be more clear to have them separate. That way if a user is choosing to/needing to pull in an additional TADA df it is very clear this is happening. Additionally, they may need to do unit conversion, filter out various types of flagged data, etc. and this may be easier if they keep the new df separate until all of these steps have been accomplished.

Hardness characteristics convert to mg/L as TADA priority

hillarymarler · 2024-09-03T13:20:11Z

I updated to include hardness characteristics as a TADA priority characteristic with units of mg/L.

wokenny13 · 2024-09-03T13:31:12Z

R/CriteriaComparison.R

+        by = dplyr::join_by(MonitoringLocationIdentifier)
+      ) %>%
+      dplyr::group_by(ResultIdentifier) %>%
+      # Figure out fastest time comparison method - needs to be absolute time comparison


Is this part of the code in the code function looking at the closest ActivityStartDateTime for the paired parameters? What happens in cases of ties (for example: If the ActivityStartDateTime is 12/1/2010 16:57 and there is a ph ActivityStartDateTime 12/1/2010 16:47 and 12/1/2010 17:07?

The "slice_sample(n=1)" at line 336 selects a random sample. This means that at the end of the pairing function, only one result will remain to pair, even if there was a "tie" between pairing group rank and time difference that would have left the original TADA result with more than one temp/pH/hardness result to be paired with.

In most of the examples, that I looked at, line 336 didn't come into play as selecting the lowest ranked pairing group and the smallest time difference yielded only one result for pairing. However, including the last slice_sample does account for situations where more than one option remains.

wokenny13 · 2024-09-03T14:55:35Z

@wokenny13 - I was imagining that each row would use its own hardness (or ph, temp, etc.) value for determining the numeric criteria. I am not aware of any methodologies that take an average from the whole data set and the apply that for all calculations - have you seen this in your review of state methods?

I was under this same impression, so wanted to make sure that this was the case.

What would you think about setting this up as its own function? I can see how this functionality would be very useful in certain scenarios, but think it might be more clear to have them separate. That way if a user is choosing to/needing to pull in an additional TADA df it is very clear this is happening. Additionally, they may need to do unit conversion, filter out various types of flagged data, etc. and this may be easier if they keep the new df separate until all of these steps have been accomplished.

I like how the function returns "Error in TADA_CreatePairRef(.data) :
None of the specified pairing characteristics were found in the TADA data frame." I think a separate function could be nice, and can be used only if the user feels a parameter should be paired but is a parameter they did not originally pull from their TADA_DataRetrieval. This function could result in a long runtime so it is probably best to include it as a separate function. Decisions on how to handle similar methods on how a user originally pulled their TADA.dataframe and cleaned it would likely need to be considered, such as if they ran unit conversion, flagged data, autocleaned etc.

This could be a function that I would be interested in working on in the future.

wokenny13 · 2024-09-03T17:01:41Z

The functions look great and seem to be working as intended! Future updates and development within the function that I would love to help with, can consider additional items, such as handling of more special cases for other parameters, the ranking, and exporting of the ref file if desired.

Some other comments to consider:

Would there ever be a case where a user would want any paired parameter(s), regardless of speciation, fractions or units? And in that case, the paired parameter would just be based off on the closest ActivityStartDateTime by any of the identified TADA.CharacteristicName within any of the argument groups (pH, hardness, temp etc)?

Is the current ranking based on alphabetical sort order by TADA.CharacteristicName? Is there a way to count the number of occurences in the dataframe for each row to base the ranking on?

hillarymarler · 2024-09-04T17:50:03Z

Would there ever be a case where a user would want any paired parameter(s), regardless of speciation, fractions or units? And in that case, the paired parameter would just be based off on the closest ActivityStartDateTime by any of the identified TADA.CharacteristicName within any of the argument groups (pH, hardness, temp etc)?

I hadn't considered a case like this. I guess it is possible. If the user set all the rankings per group in their ref as equal to each other, then the time difference (and then if more than one time difference was the same, a random selection) would be the deciding factor for which was selected. I suspected that many (most?) methodologies will dictate a specific acceptable characteristic (or characteristics), but maybe that is not the case. It would be possible to add a param to TADA_CreateRef to set all ranks as equal. I don't think that would be too complicated. What do you think of that idea?

Is the current ranking based on alphabetical sort order by TADA.CharacteristicName? Is there a way to count the number of occurences in the dataframe for each row to base the ranking on?

Current ranking is just the order in which they are pulled from the original TADA df. I like the idea of counting the number of occurrences in df to generate the initial ranking. I will update function to do this.

Rank characteristics in pair ref by # of results in data frame

Added comments in code

hillarymarler added 2 commits August 29, 2024 16:06

Add Pairing Functions

9ad1555

Update CriteriaComparison.R

c59e4bb

Styler updates

hillarymarler linked an issue Aug 29, 2024 that may be closed by this pull request

Create function to find paired data (T, pH, hardness dependent criteria) #392

Closed

6 tasks

hillarymarler assigned hillarymarler, wokenny13 and cristinamullin and unassigned hillarymarler Aug 29, 2024

hillarymarler added 8 commits August 29, 2024 17:39

Add test

98fdaf5

Update test-CriteriaComparison.R

d358be5

Update Utilities.R

2bb6264

Update Utilities.R

896a39b

Documentation updates

6dc03d1

Documentation and wordlist updates

9972be6

Documentation and wordlist updates

883cc57

Update DESCRIPTION

087fa06

Added rlang to imports

Update TADAPriorityCharUnitRef.csv

6656f98

Hardness characteristics convert to mg/L as TADA priority

wokenny13 reviewed Sep 3, 2024

View reviewed changes

wokenny13 approved these changes Sep 3, 2024

View reviewed changes

hillarymarler added 3 commits September 4, 2024 15:00

Added "UNDER ACTIVE DEVELOPMENT" label

ff0ee66

Update CriteriaComparison.R

922ded2

Rank characteristics in pair ref by # of results in data frame

Update CriteriaComparison.R

3517df7

Added comments in code

hillarymarler merged commit 025f304 into develop Sep 5, 2024
7 checks passed

hillarymarler deleted the 392-create-function-to-find-paired-data-t-ph-hardness-dependent-criteria branch September 5, 2024 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

392 create function to find paired data t ph hardness dependent criteria #519

392 create function to find paired data t ph hardness dependent criteria #519

hillarymarler commented Aug 29, 2024

wokenny13 commented Aug 30, 2024

hillarymarler commented Sep 3, 2024

hillarymarler commented Sep 3, 2024

wokenny13 Sep 3, 2024

hillarymarler Sep 3, 2024

wokenny13 commented Sep 3, 2024

wokenny13 commented Sep 3, 2024

hillarymarler commented Sep 4, 2024

392 create function to find paired data t ph hardness dependent criteria #519

392 create function to find paired data t ph hardness dependent criteria #519

Conversation

hillarymarler commented Aug 29, 2024

wokenny13 commented Aug 30, 2024

hillarymarler commented Sep 3, 2024

hillarymarler commented Sep 3, 2024

wokenny13 Sep 3, 2024

Choose a reason for hiding this comment

hillarymarler Sep 3, 2024

Choose a reason for hiding this comment

wokenny13 commented Sep 3, 2024

wokenny13 commented Sep 3, 2024

hillarymarler commented Sep 4, 2024