Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

392 create function to find paired data t ph hardness dependent criteria #519

Conversation

hillarymarler
Copy link
Collaborator

I had been working on this in the demo_impairment branch that @wokenny13 and I have been using to test out some module 3 ideas, but these two draft functions are now well developed enough, that a review would be really helpful.

TADA_CreatePairRef - creates a reference df of TADA.CharacteristicNames that should be searched for pairing with results

TADA_PairForCriteriaCalc - uses the previously created ref (or a user supplied one) to pair results with pH, temp, hardness, etc.

I think we should hard code default rankings for hardness, temp, pH, etc. in terms of which should be selected to pair with a result. They can be edited be the user if desired. Currently, the ranking is assigned with dplyr::cur_group_id(). The rankings are part of what prevents this function from growing the data set as there may be multiple potential pairs for a group (ex: "hardness") for a single result.

@wokenny13
Copy link
Collaborator

I will continue to look through this more on Tuesday.

Some thoughts:

For Criteria Standards that are "equation based" and dependent on hardness, would the standards be based on each row of samples from a TADA WQP data pull? Or would it be an average of all samples taken at a monitoringLocation/AUID? For example, filtered by Zinc from the example of AL, will each row have it's own standard, or would we determine if data is continuous, then aggregate by an average hardness, and use that hardness for calculating the standards?

image

Should hardness be included as a TADAPriorityCharConvertRef with units as mg/L? This seems to be a common unit (mg/L) to be used with hardness for calculating standards from what I've seen. Currently I see the units in UG/L.

From today's meeting (8/30) we discussed about the case if a .data does not contain the paired parameters due to if a user supplies a TADA dataframe that they've filtered by only "Nutrients" or certain "Characteristics". We would then need to pull a completed dataframe that aligns with the user's dataframe and join it with a "full data pull" similar to what a user submitted as their .data argument.

@hillarymarler
Copy link
Collaborator Author

For Criteria Standards that are "equation based" and dependent on hardness, would the standards be based on each row of samples from a TADA WQP data pull? Or would it be an average of all samples taken at a monitoringLocation/AUID?

@wokenny13 - I was imagining that each row would use its own hardness (or ph, temp, etc.) value for determining the numeric criteria. I am not aware of any methodologies that take an average from the whole data set and the apply that for all calculations - have you seen this in your review of state methods?

For example, filtered by Zinc from the example of AL, will each row have it's own standard, or would we determine if data is continuous, then aggregate by an average hardness, and use that hardness for calculating the standards?

Re: continuous data - my thought was users would likely filter out the continuous data earlier in their workflow if they did not want it to be included in certain standard comparisons.

Should hardness be included as a TADAPriorityCharConvertRef with units as mg/L? This seems to be a common unit (mg/L) to be used with hardness for calculating standards from what I've seen. Currently I see the units in UG/L.

Yes - this is a great suggestion. I can modify the ref table to facilitate this. We may also want to think about updating documentation for the pairing function to remind users that ideally they will have done all unit converting, etc. prior to pairing the data.

From today's meeting (8/30) we discussed about the case if a .data does not contain the paired parameters due to if a user supplies a TADA dataframe that they've filtered by only "Nutrients" or certain "Characteristics". We would then need to pull a completed dataframe that aligns with the user's dataframe and join it with a "full data pull" similar to what a user submitted as their .data argument.

What would you think about setting this up as its own function? I can see how this functionality would be very useful in certain scenarios, but think it might be more clear to have them separate. That way if a user is choosing to/needing to pull in an additional TADA df it is very clear this is happening. Additionally, they may need to do unit conversion, filter out various types of flagged data, etc. and this may be easier if they keep the new df separate until all of these steps have been accomplished.

Hardness characteristics convert to mg/L as TADA priority
@hillarymarler
Copy link
Collaborator Author

I updated to include hardness characteristics as a TADA priority characteristic with units of mg/L.

by = dplyr::join_by(MonitoringLocationIdentifier)
) %>%
dplyr::group_by(ResultIdentifier) %>%
# Figure out fastest time comparison method - needs to be absolute time comparison
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the code in the code function looking at the closest ActivityStartDateTime for the paired parameters? What happens in cases of ties (for example: If the ActivityStartDateTime is 12/1/2010 16:57 and there is a ph ActivityStartDateTime 12/1/2010 16:47 and 12/1/2010 17:07?

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "slice_sample(n=1)" at line 336 selects a random sample. This means that at the end of the pairing function, only one result will remain to pair, even if there was a "tie" between pairing group rank and time difference that would have left the original TADA result with more than one temp/pH/hardness result to be paired with.

In most of the examples, that I looked at, line 336 didn't come into play as selecting the lowest ranked pairing group and the smallest time difference yielded only one result for pairing. However, including the last slice_sample does account for situations where more than one option remains.

@wokenny13
Copy link
Collaborator

@wokenny13 - I was imagining that each row would use its own hardness (or ph, temp, etc.) value for determining the numeric criteria. I am not aware of any methodologies that take an average from the whole data set and the apply that for all calculations - have you seen this in your review of state methods?

I was under this same impression, so wanted to make sure that this was the case.

What would you think about setting this up as its own function? I can see how this functionality would be very useful in certain scenarios, but think it might be more clear to have them separate. That way if a user is choosing to/needing to pull in an additional TADA df it is very clear this is happening. Additionally, they may need to do unit conversion, filter out various types of flagged data, etc. and this may be easier if they keep the new df separate until all of these steps have been accomplished.

I like how the function returns "Error in TADA_CreatePairRef(.data) :
None of the specified pairing characteristics were found in the TADA data frame." I think a separate function could be nice, and can be used only if the user feels a parameter should be paired but is a parameter they did not originally pull from their TADA_DataRetrieval. This function could result in a long runtime so it is probably best to include it as a separate function. Decisions on how to handle similar methods on how a user originally pulled their TADA.dataframe and cleaned it would likely need to be considered, such as if they ran unit conversion, flagged data, autocleaned etc.

This could be a function that I would be interested in working on in the future.

@wokenny13
Copy link
Collaborator

The functions look great and seem to be working as intended! Future updates and development within the function that I would love to help with, can consider additional items, such as handling of more special cases for other parameters, the ranking, and exporting of the ref file if desired.

Some other comments to consider:

Would there ever be a case where a user would want any paired parameter(s), regardless of speciation, fractions or units? And in that case, the paired parameter would just be based off on the closest ActivityStartDateTime by any of the identified TADA.CharacteristicName within any of the argument groups (pH, hardness, temp etc)?

Is the current ranking based on alphabetical sort order by TADA.CharacteristicName? Is there a way to count the number of occurences in the dataframe for each row to base the ranking on?

@hillarymarler
Copy link
Collaborator Author

Would there ever be a case where a user would want any paired parameter(s), regardless of speciation, fractions or units? And in that case, the paired parameter would just be based off on the closest ActivityStartDateTime by any of the identified TADA.CharacteristicName within any of the argument groups (pH, hardness, temp etc)?

I hadn't considered a case like this. I guess it is possible. If the user set all the rankings per group in their ref as equal to each other, then the time difference (and then if more than one time difference was the same, a random selection) would be the deciding factor for which was selected. I suspected that many (most?) methodologies will dictate a specific acceptable characteristic (or characteristics), but maybe that is not the case. It would be possible to add a param to TADA_CreateRef to set all ranks as equal. I don't think that would be too complicated. What do you think of that idea?

Is the current ranking based on alphabetical sort order by TADA.CharacteristicName? Is there a way to count the number of occurences in the dataframe for each row to base the ranking on?

Current ranking is just the order in which they are pulled from the original TADA df. I like the idea of counting the number of occurrences in df to generate the initial ranking. I will update function to do this.

Rank characteristics in pair ref by # of results in data frame
Added comments in code
@hillarymarler hillarymarler merged commit 025f304 into develop Sep 5, 2024
7 checks passed
@hillarymarler hillarymarler deleted the 392-create-function-to-find-paired-data-t-ph-hardness-dependent-criteria branch September 5, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create function to find paired data (T, pH, hardness dependent criteria)
3 participants