-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
392 create function to find paired data t ph hardness dependent criteria #519
392 create function to find paired data t ph hardness dependent criteria #519
Conversation
Styler updates
@wokenny13 - I was imagining that each row would use its own hardness (or ph, temp, etc.) value for determining the numeric criteria. I am not aware of any methodologies that take an average from the whole data set and the apply that for all calculations - have you seen this in your review of state methods?
Re: continuous data - my thought was users would likely filter out the continuous data earlier in their workflow if they did not want it to be included in certain standard comparisons.
Yes - this is a great suggestion. I can modify the ref table to facilitate this. We may also want to think about updating documentation for the pairing function to remind users that ideally they will have done all unit converting, etc. prior to pairing the data.
What would you think about setting this up as its own function? I can see how this functionality would be very useful in certain scenarios, but think it might be more clear to have them separate. That way if a user is choosing to/needing to pull in an additional TADA df it is very clear this is happening. Additionally, they may need to do unit conversion, filter out various types of flagged data, etc. and this may be easier if they keep the new df separate until all of these steps have been accomplished. |
Hardness characteristics convert to mg/L as TADA priority
I updated to include hardness characteristics as a TADA priority characteristic with units of mg/L. |
by = dplyr::join_by(MonitoringLocationIdentifier) | ||
) %>% | ||
dplyr::group_by(ResultIdentifier) %>% | ||
# Figure out fastest time comparison method - needs to be absolute time comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "slice_sample(n=1)" at line 336 selects a random sample. This means that at the end of the pairing function, only one result will remain to pair, even if there was a "tie" between pairing group rank and time difference that would have left the original TADA result with more than one temp/pH/hardness result to be paired with.
In most of the examples, that I looked at, line 336 didn't come into play as selecting the lowest ranked pairing group and the smallest time difference yielded only one result for pairing. However, including the last slice_sample does account for situations where more than one option remains.
I was under this same impression, so wanted to make sure that this was the case.
I like how the function returns "Error in TADA_CreatePairRef(.data) : This could be a function that I would be interested in working on in the future. |
The functions look great and seem to be working as intended! Future updates and development within the function that I would love to help with, can consider additional items, such as handling of more special cases for other parameters, the ranking, and exporting of the ref file if desired. Some other comments to consider: Would there ever be a case where a user would want any paired parameter(s), regardless of speciation, fractions or units? And in that case, the paired parameter would just be based off on the closest ActivityStartDateTime by any of the identified TADA.CharacteristicName within any of the argument groups (pH, hardness, temp etc)? Is the current ranking based on alphabetical sort order by TADA.CharacteristicName? Is there a way to count the number of occurences in the dataframe for each row to base the ranking on? |
I hadn't considered a case like this. I guess it is possible. If the user set all the rankings per group in their ref as equal to each other, then the time difference (and then if more than one time difference was the same, a random selection) would be the deciding factor for which was selected. I suspected that many (most?) methodologies will dictate a specific acceptable characteristic (or characteristics), but maybe that is not the case. It would be possible to add a param to TADA_CreateRef to set all ranks as equal. I don't think that would be too complicated. What do you think of that idea?
Current ranking is just the order in which they are pulled from the original TADA df. I like the idea of counting the number of occurrences in df to generate the initial ranking. I will update function to do this. |
Rank characteristics in pair ref by # of results in data frame
Added comments in code
I had been working on this in the demo_impairment branch that @wokenny13 and I have been using to test out some module 3 ideas, but these two draft functions are now well developed enough, that a review would be really helpful.
TADA_CreatePairRef - creates a reference df of TADA.CharacteristicNames that should be searched for pairing with results
TADA_PairForCriteriaCalc - uses the previously created ref (or a user supplied one) to pair results with pH, temp, hardness, etc.
I think we should hard code default rankings for hardness, temp, pH, etc. in terms of which should be selected to pair with a result. They can be edited be the user if desired. Currently, the ranking is assigned with dplyr::cur_group_id(). The rankings are part of what prevents this function from growing the data set as there may be multiple potential pairs for a group (ex: "hardness") for a single result.