Getting all combinations of characteristic-fraction-speciation-unit in harmonization table #319

ehinman · 2023-08-04T21:32:03Z

Describe the bug

We want to make the harmonization table as complete as possible for nutrients and priority parameters, so that we are providing users with the most support in terms of synonyms and harmonizing data. We used a char-frac-spec combination spreadsheet Kevin pulled from WQX to see the most common combinations of all three, but I created a test for checking to make sure we weren't missing combos using random datasets and found many new combinations that weren't in the WQX spreadsheet. At first, I thought this was because WQX does not account for NWIS char-frac-spec combinations, but I realized that this might be more extensive than just the NWIS data stream: the WQX combinations (and the WQX char validation table) do not consider blanks or NA's in any of the columns. Thus, for characteristics for which a fraction and speciation are NOT required, we would need to separately add all of those combinations to the harmonization template.

To Reproduce

Code to reproduce the behavior:

remotes::install_github("USEPA/TADA", ref = "develop")
library(TADA)

 test = TADA_RandomTestingSet()
  test1 = TADA_RunKeyFlagFunctions(test)
  ref = TADA_GetSynonymRef()
  ref_chars = unique(ref$TADA.CharacteristicName)
  test_chars = unique(subset(test1, test1$TADA.CharacteristicName%in%ref_chars)[,c("TADA.CharacteristicName","TADA.ResultSampleFractionText","TADA.MethodSpecificationName","TADA.ResultMeasure.MeasureUnitCode")])
  test_chars_ref = merge(test_chars, ref, all.x = TRUE)
  new_combos = subset(test_chars_ref, is.na(test_chars_ref$HarmonizationGroup))[,c("TADA.CharacteristicName","TADA.ResultSampleFractionText","TADA.MethodSpecificationName","TADA.ResultMeasure.MeasureUnitCode")]
  if(dim(new_combos)[1]>0){
    print("New combinations found in random dataset test:")
    print(new_combos)
  }

Expected behavior

Ideally, we could pull all combinations of these characteristics from the water quality PORTAL, where NWIS and WQX mix with all allowable values.

Reminders for TADA contributors addressing this issue

New features should include all of the following work:

Create the function/code.
Document all code using comments to describe what is does.
Create tests in tests folder.
Create help file using roxygen2 above code.
Create working examples in help file (via roxygen2).
Add to appropriate vignette (or create new one).

cristinamullin · 2023-08-07T13:18:07Z

This won't fix this issue, but it would help reduce total combinations. We could change all NONE to NA for speciation and fraction as part of autoclean, assuming NONE is equivalent to NA. Cristina A Mullin, PhD (she/her) Water Data Integration Branch Watershed Restoration, Assessment, and Protection Division Office of Wetlands, Oceans and Watersheds US EPA|Office of Water ***@***.******@***.***> From: Elise H. ***@***.***> Sent: Friday, August 4, 2023 5:32 PM To: USEPA/TADA ***@***.***> Cc: Subscribed ***@***.***> Subject: [USEPA/TADA] Getting all combinations of characteristic-fraction-speciation-unit in harmonization table (Issue #319) Describe the bug We want to make the harmonization table as complete as possible for nutrients and priority parameters, so that we are providing users with the most support in terms of synonyms and harmonizing data. We used a char-frac-spec combination spreadsheet Kevin pulled from WQX to see the most common combinations of all three, but I created a test for checking to make sure we weren't missing combos using random datasets and found many new combinations that weren't in the WQX spreadsheet. At first, I thought this was because WQX does not account for NWIS char-frac-spec combinations, but I realized that this might be more extensive than just the NWIS data stream: the WQX combinations (and the WQX char validation table) do not consider blanks or NA's in any of the columns. Thus, for characteristics for which a fraction and speciation are NOT required, we would need to separately add all of those combinations to the harmonization template. To Reproduce Code to reproduce the behavior: remotes::install_github("USEPA/TADA", ref = "develop") library(TADA) test = TADA_RandomTestingSet() test1 = TADA_RunKeyFlagFunctions(test) ref = TADA_GetSynonymRef() ref_chars = unique(ref$TADA.CharacteristicName) test_chars = unique(subset(test1, test1$TADA.CharacteristicName%in%ref_chars)[,c("TADA.CharacteristicName","TADA.ResultSampleFractionText","TADA.MethodSpecificationName","TADA.ResultMeasure.MeasureUnitCode")]) test_chars_ref = merge(test_chars, ref, all.x = TRUE) new_combos = subset(test_chars_ref, is.na(test_chars_ref$HarmonizationGroup))[,c("TADA.CharacteristicName","TADA.ResultSampleFractionText","TADA.MethodSpecificationName","TADA.ResultMeasure.MeasureUnitCode")] if(dim(new_combos)[1]>0){ print("New combinations found in random dataset test:") print(new_combos) } Expected behavior Ideally, we could pull all combinations of these characteristics from the water quality PORTAL, where NWIS and WQX mix with all allowable values. Reminders for TADA contributors addressing this issue New features should include all of the following work: * [ ] Create the function/code. * [ ] Document all code using comments to describe what is does. * [ ] Create tests in tests folder. * [ ] Create help file using roxygen2 above code. * [ ] Create working examples in help file (via roxygen2). * [ ] Add to appropriate vignette (or create new one). - Reply to this email directly, view it on GitHub<#319>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALGLGYG64KQBBA3JIC5KIPLXTVS53ANCNFSM6AAAAAA3EW4DDY>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>>

ehinman · 2023-08-07T13:21:11Z

I cringe a little but if we change all NA to NONE, it would be represented in the validation tables. Elise Hinman, Ph.D. (she/her) ORISE Participant Water Data Integration Branch Watershed Restoration, Assessment, and Protection Division Office of Wetlands, Oceans and Watersheds US EPA|Office of Water ***@***.***

…

________________________________ From: Cristina Mullin ***@***.***> Sent: Monday, August 7, 2023 9:18 AM To: USEPA/TADA ***@***.***> Cc: Hinman, Elise (she/her/hers) ***@***.***>; Author ***@***.***> Subject: Re: [USEPA/TADA] Getting all combinations of characteristic-fraction-speciation-unit in harmonization table (Issue #319) This won't fix this issue, but it would help reduce total combinations. We could change all NONE to NA for speciation and fraction as part of autoclean, assuming NONE is equivalent to NA. Cristina A Mullin, PhD (she/her) Water Data Integration Branch Watershed Restoration, Assessment, and Protection Division Office of Wetlands, Oceans and Watersheds US EPA|Office of Water ***@***.******@***.***> From: Elise H. ***@***.***> Sent: Friday, August 4, 2023 5:32 PM To: USEPA/TADA ***@***.***> Cc: Subscribed ***@***.***> Subject: [USEPA/TADA] Getting all combinations of characteristic-fraction-speciation-unit in harmonization table (Issue #319) Describe the bug We want to make the harmonization table as complete as possible for nutrients and priority parameters, so that we are providing users with the most support in terms of synonyms and harmonizing data. We used a char-frac-spec combination spreadsheet Kevin pulled from WQX to see the most common combinations of all three, but I created a test for checking to make sure we weren't missing combos using random datasets and found many new combinations that weren't in the WQX spreadsheet. At first, I thought this was because WQX does not account for NWIS char-frac-spec combinations, but I realized that this might be more extensive than just the NWIS data stream: the WQX combinations (and the WQX char validation table) do not consider blanks or NA's in any of the columns. Thus, for characteristics for which a fraction and speciation are NOT required, we would need to separately add all of those combinations to the harmonization template. To Reproduce Code to reproduce the behavior: remotes::install_github("USEPA/TADA", ref = "develop") library(TADA) test = TADA_RandomTestingSet() test1 = TADA_RunKeyFlagFunctions(test) ref = TADA_GetSynonymRef() ref_chars = unique(ref$TADA.CharacteristicName) test_chars = unique(subset(test1, test1$TADA.CharacteristicName%in%ref_chars)[,c("TADA.CharacteristicName","TADA.ResultSampleFractionText","TADA.MethodSpecificationName","TADA.ResultMeasure.MeasureUnitCode")]) test_chars_ref = merge(test_chars, ref, all.x = TRUE) new_combos = subset(test_chars_ref, is.na(test_chars_ref$HarmonizationGroup))[,c("TADA.CharacteristicName","TADA.ResultSampleFractionText","TADA.MethodSpecificationName","TADA.ResultMeasure.MeasureUnitCode")] if(dim(new_combos)[1]>0){ print("New combinations found in random dataset test:") print(new_combos) } Expected behavior Ideally, we could pull all combinations of these characteristics from the water quality PORTAL, where NWIS and WQX mix with all allowable values. Reminders for TADA contributors addressing this issue New features should include all of the following work: * [ ] Create the function/code. * [ ] Document all code using comments to describe what is does. * [ ] Create tests in tests folder. * [ ] Create help file using roxygen2 above code. * [ ] Create working examples in help file (via roxygen2). * [ ] Add to appropriate vignette (or create new one). - Reply to this email directly, view it on GitHub<#319>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALGLGYG64KQBBA3JIC5KIPLXTVS53ANCNFSM6AAAAAA3EW4DDY>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>> — Reply to this email directly, view it on GitHub<#319 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A5B72SWOR5AV3L63LKLVYG3XUDTJVANCNFSM6AAAAAA3EW4DDY>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ehinman · 2023-08-07T13:40:05Z

We could also change all NONEs in the validation table to NA, too. That might be a solid solution.

cristinamullin · 2023-08-07T13:41:29Z

That would be my preference, to change NONE to NA in autoclean and in the reference table. From: Elise H. ***@***.***> Sent: Monday, August 7, 2023 9:40 AM To: USEPA/TADA ***@***.***> Cc: Mullin, Cristina (she/her/hers) ***@***.***>; Comment ***@***.***> Subject: Re: [USEPA/TADA] Getting all combinations of characteristic-fraction-speciation-unit in harmonization table (Issue #319) We could also change all NONEs in the validation table to NA, too. That might be a solid solution. - Reply to this email directly, view it on GitHub<#319 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALGLGYDBJ77AI5VPD2JQ3YTXUDV4BANCNFSM6AAAAAA3EW4DDY>. You are receiving this because you commented.Message ID: ***@***.******@***.***>>

ehinman · 2023-08-07T13:42:17Z

Ok, got it. Sounds good. Thanks!

ehinman · 2023-08-15T19:54:03Z

TADA_Autoclean converts all NONE to NA in the fraction and speciation columns prior to validation, and the validation table has NONE set to INVALID. This still does not address additional combos coming in from NWIS (NONE vs NA aside).

cristinamullin · 2024-02-12T21:10:32Z

Note: TADA priority characteristics are here: https://usepa.sharepoint.com/:x:/r/sites/WQPDataAssessmentTeam/_layouts/15/Doc.aspx?sourcedoc=%7B65B3DD61-5856-48CD-8A3C-25415DE94955%7D&file=TADA_Supported_Characteristics.xlsx&action=default&mobileredirect=true

cristinamullin assigned hillarymarler Feb 12, 2024

cristinamullin added Module 1 MVP Top Priority labels Feb 12, 2024

hillarymarler mentioned this issue Jul 29, 2024

pH harmonization issues #454

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting all combinations of characteristic-fraction-speciation-unit in harmonization table #319

Getting all combinations of characteristic-fraction-speciation-unit in harmonization table #319

ehinman commented Aug 4, 2023

cristinamullin commented Aug 7, 2023 via email

ehinman commented Aug 7, 2023 via email

ehinman commented Aug 7, 2023

cristinamullin commented Aug 7, 2023 via email

ehinman commented Aug 7, 2023

ehinman commented Aug 15, 2023

cristinamullin commented Feb 12, 2024

Getting all combinations of characteristic-fraction-speciation-unit in harmonization table #319

Getting all combinations of characteristic-fraction-speciation-unit in harmonization table #319

Comments

ehinman commented Aug 4, 2023

cristinamullin commented Aug 7, 2023 via email

ehinman commented Aug 7, 2023 via email

ehinman commented Aug 7, 2023

cristinamullin commented Aug 7, 2023 via email

ehinman commented Aug 7, 2023

ehinman commented Aug 15, 2023

cristinamullin commented Feb 12, 2024