Skip to content
This repository has been archived by the owner on Jun 1, 2023. It is now read-only.

Fix wqp pulls #11

Merged
merged 24 commits into from
Oct 9, 2019
Merged

Fix wqp pulls #11

merged 24 commits into from
Oct 9, 2019

Conversation

limnoliver
Copy link
Member

This PR fixes old issues where a subset of calls to WQP fail. Implemented the POST script from Laura/Jordan, which seems to fix the issues. The POST solution is faster - each pull to WQP is 1-2 minutes (times 365 pulls), which seemed more like 3-4 minutes with dataRetrieval::readWQPdata.

Additionally, the workflow now uses a combiner, and does not put intermediate files into the shared cache. The combined file is now stored here.

…umn for lat/long. Also generalized the "bad orgs" that are tripping up calls to WQP
…e limit. Now, it filters partition to those with n sites < 1000 and then finds the partition with the smallest record count. Partitions with >1000 sites were tripping the POST call up.
…rance and upping n obs allowed in single call.
…e out of date according to remake, but are essentially up to date and do not need a repull. Used sc_bless!
… done yet, particularly for WQP, but this is first stab. End of 5_munge step puts everything into single file, except ignores wqp with depth measures for now.
@limnoliver
Copy link
Member Author

@aappling-usgs -- okay, ready to go! I've added a 5_munge step to QA (first rough cut), reduce to dailies (first rough cut), and combine all of the data together. The final combined file is daily_temperatures.rds.

@@ -3,5 +3,5 @@
target_inv_size: 1000

# Approximate maximum number of records to pull in each call to WQP.
# Recommended number is 500000 unless that causes exceedance of max URL size allowed by the WAF
# Recommended number is 25000 unless that causes exceedance of max URL size allowed by the WAF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a lot smaller! Is this still required even after the improvements to WQP this year?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I note that below it's 250,000 whereas in this comment it's 25,000. 250,000 isn't that much smaller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was originally just an oversight on my part (copied over from NWIS) but I don't think we actually know what this number is.

Copy link
Member

@aappling-usgs aappling-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have only one issue of real concern below (filtering temps before converting F to C), but there's enough stuff to talk about that I'll leave this PR open while we discuss.

@@ -3,5 +3,5 @@
target_inv_size: 1000

# Approximate maximum number of records to pull in each call to WQP.
# Recommended number is 500000 unless that causes exceedance of max URL size allowed by the WAF
# Recommended number is 25000 unless that causes exceedance of max URL size allowed by the WAF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I note that below it's 250,000 whereas in this comment it's 25,000. 250,000 isn't that much smaller.


# filter out site types that are not of interest

wqp_inventory <- wqp_inventory %>%
filter(!(ResolvedMonitoringLocationTypeName %in% wqp_pull_params$DropLocationTypeName))

# filter out bad org names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are bad names because WQP queries can't handle them, right? might want to add a comment on that if you still have documentation or memory of what errors you were getting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - cause WQP to fail. Have added a comment.

wqp_dat_time <- tryCatch(
{
time_start <- Sys.time()
wqp_dat <- wqp_POST(wqp_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you end up getting any errors with wqp_POST? And if so, did they also show up with readWQPdata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - sometimes you do. It's mostly because the call times out (according to Jim) - but is successful once the call is made again.

It didn't look like any of the calls were made with readWQPdata.

file1 <- tempdir()
doc <- utils::unzip(temp, exdir=file1)
unlink(temp)
retval <- suppressWarnings(read_delim(doc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, my personal preference is to add the newline right after the parenthesis so that subsequent lines can start with a 2-character indent, which leaves them a lot more room to be complete lines themselves. I think both spacing/indent patterns are pretty common on our team, but just in case you've deep down always wanted to do it my way, here's my vote of support =)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I like your style. This was just copy and pasted :) so thanks for paying attention to style!

httr::write_disk(temp))

headerInfo <- httr::headers(x)
file1 <- tempdir()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temp, file1, and doc are pretty darn hacky variable names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Modified to be more descriptive. (Also forced me to really understand what was going on here!)

activity_start_times = paste(`ActivityStartTime/Time`, collapse = ', '),
n_day = n())

dat_daily_meta <- filter(dat_reduced, !is.na(`ResultMeasure/MeasureUnitCode`)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what case is handled by the !is.na() filter in this line? Note that

> grepl('deg C|deg F', NA)
[1] FALSE

so I'd expect that line 35 above would filter these out already

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - this is redundant.

dat_daily_meta <- filter(dat_reduced, !is.na(`ResultMeasure/MeasureUnitCode`)) %>%
select(-ResultMeasureValue, -`ActivityStartTime/Time`) %>%
group_by(MonitoringLocationIdentifier, ActivityStartDate, sample_depth) %>%
summarize_all(first)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, I didn't know about first - that's really tidy!

mutate(`ResultMeasure/MeasureUnitCode` = 'deg C') %>%
filter(ResultMeasureValue < max_value)

dat_daily <- group_by(dat_reduced, MonitoringLocationIdentifier, ActivityStartDate, sample_depth) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I hadn't realized that WQP provided depth-specific values for water quality. Does this happen in rivers as well as lakes/reservoirs? If so, seems we'll want to accommodate this when collecting observations for stream temperature modeling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for now, without much digging into metadata, I've just handled depths and no depths records independently. My guess is that it's mostly lakes/reservoirs, with streams mixed in. The final pushed daily temperature product does not have these observations (obs with depth) included because I wasn't quite ready to tackle that problem.

temp_value < max_value) %>%
mutate(dateTime = as.Date(dateTime)) %>%
group_by(site_no, col_name, dateTime) %>%
summarize(temperature_mean_daily = round(mean(temp_value), 1), n_day = n())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all NWIS temperatures only reported to tenth-degree precision? It probably doesn't matter much for these daily values, but I saw unnecessary choppiness in sub-daily timeseries when I was doing metabolism modeling. If we could avoid adding any superfluous rounding here that might be nice...could round to the minimum number of sig figs among the values in temp_value, for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am going to round to 3 digits to give a bit of buffer.

ungroup() %>%
mutate(date = as.Date(dateTime),
source = 'nwiw_uv') %>%
select(site_id = site_no,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making these NWIS site_ids be USGS-XXX here so they mesh with WQP conventions early on in the munging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

…conversion, remove redundant is.na filtering, avoid rounding to too few sig figs, add "USGS" prefix to NWIS sites
@limnoliver
Copy link
Member Author

limnoliver commented Oct 9, 2019

All changes made @aappling-usgs! Also rebuilt the munge step because of the filter coming before the units conversion. scbless was perfect for this use case -- made changes to wqp_pull that weren't consequential but wanted to update downstream products.

@aappling-usgs
Copy link
Member

Woot!

@aappling-usgs aappling-usgs merged commit 02a423e into USGS-R:master Oct 9, 2019
@limnoliver limnoliver deleted the fix_wqp_pulls branch October 13, 2020 14:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants