-
Notifications
You must be signed in to change notification settings - Fork 7
Conversation
… if this is a conservative estimate.
…umn for lat/long. Also generalized the "bad orgs" that are tripping up calls to WQP
…e limit. Now, it filters partition to those with n sites < 1000 and then finds the partition with the smallest record count. Partitions with >1000 sites were tripping the POST call up.
…POST to retrieve data if that fails.
…n tries readWQPdata.
…rance and upping n obs allowed in single call.
…s. Now WQP data in one file.
…e out of date according to remake, but are essentially up to date and do not need a repull. Used sc_bless!
… done yet, particularly for WQP, but this is first stab. End of 5_munge step puts everything into single file, except ignores wqp with depth measures for now.
@aappling-usgs -- okay, ready to go! I've added a |
@@ -3,5 +3,5 @@ | |||
target_inv_size: 1000 | |||
|
|||
# Approximate maximum number of records to pull in each call to WQP. | |||
# Recommended number is 500000 unless that causes exceedance of max URL size allowed by the WAF | |||
# Recommended number is 25000 unless that causes exceedance of max URL size allowed by the WAF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's a lot smaller! Is this still required even after the improvements to WQP this year?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I note that below it's 250,000 whereas in this comment it's 25,000. 250,000 isn't that much smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was originally just an oversight on my part (copied over from NWIS) but I don't think we actually know what this number is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I have only one issue of real concern below (filtering temps before converting F to C), but there's enough stuff to talk about that I'll leave this PR open while we discuss.
@@ -3,5 +3,5 @@ | |||
target_inv_size: 1000 | |||
|
|||
# Approximate maximum number of records to pull in each call to WQP. | |||
# Recommended number is 500000 unless that causes exceedance of max URL size allowed by the WAF | |||
# Recommended number is 25000 unless that causes exceedance of max URL size allowed by the WAF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I note that below it's 250,000 whereas in this comment it's 25,000. 250,000 isn't that much smaller.
|
||
# filter out site types that are not of interest | ||
|
||
wqp_inventory <- wqp_inventory %>% | ||
filter(!(ResolvedMonitoringLocationTypeName %in% wqp_pull_params$DropLocationTypeName)) | ||
|
||
# filter out bad org names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are bad names because WQP queries can't handle them, right? might want to add a comment on that if you still have documentation or memory of what errors you were getting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - cause WQP to fail. Have added a comment.
wqp_dat_time <- tryCatch( | ||
{ | ||
time_start <- Sys.time() | ||
wqp_dat <- wqp_POST(wqp_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you end up getting any errors with wqp_POST? And if so, did they also show up with readWQPdata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - sometimes you do. It's mostly because the call times out (according to Jim) - but is successful once the call is made again.
It didn't look like any of the calls were made with readWQPdata.
1_wqp_pull/src/wqp_pull.R
Outdated
file1 <- tempdir() | ||
doc <- utils::unzip(temp, exdir=file1) | ||
unlink(temp) | ||
retval <- suppressWarnings(read_delim(doc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, my personal preference is to add the newline right after the parenthesis so that subsequent lines can start with a 2-character indent, which leaves them a lot more room to be complete lines themselves. I think both spacing/indent patterns are pretty common on our team, but just in case you've deep down always wanted to do it my way, here's my vote of support =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I like your style. This was just copy and pasted :) so thanks for paying attention to style!
1_wqp_pull/src/wqp_pull.R
Outdated
httr::write_disk(temp)) | ||
|
||
headerInfo <- httr::headers(x) | ||
file1 <- tempdir() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp
, file1
, and doc
are pretty darn hacky variable names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Modified to be more descriptive. (Also forced me to really understand what was going on here!)
5_data_munge/src/munge_data_files.R
Outdated
activity_start_times = paste(`ActivityStartTime/Time`, collapse = ', '), | ||
n_day = n()) | ||
|
||
dat_daily_meta <- filter(dat_reduced, !is.na(`ResultMeasure/MeasureUnitCode`)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what case is handled by the !is.na()
filter in this line? Note that
> grepl('deg C|deg F', NA)
[1] FALSE
so I'd expect that line 35 above would filter these out already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - this is redundant.
dat_daily_meta <- filter(dat_reduced, !is.na(`ResultMeasure/MeasureUnitCode`)) %>% | ||
select(-ResultMeasureValue, -`ActivityStartTime/Time`) %>% | ||
group_by(MonitoringLocationIdentifier, ActivityStartDate, sample_depth) %>% | ||
summarize_all(first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, I didn't know about first
- that's really tidy!
mutate(`ResultMeasure/MeasureUnitCode` = 'deg C') %>% | ||
filter(ResultMeasureValue < max_value) | ||
|
||
dat_daily <- group_by(dat_reduced, MonitoringLocationIdentifier, ActivityStartDate, sample_depth) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I hadn't realized that WQP provided depth-specific values for water quality. Does this happen in rivers as well as lakes/reservoirs? If so, seems we'll want to accommodate this when collecting observations for stream temperature modeling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for now, without much digging into metadata, I've just handled depths and no depths records independently. My guess is that it's mostly lakes/reservoirs, with streams mixed in. The final pushed daily temperature product does not have these observations (obs with depth) included because I wasn't quite ready to tackle that problem.
5_data_munge/src/munge_data_files.R
Outdated
temp_value < max_value) %>% | ||
mutate(dateTime = as.Date(dateTime)) %>% | ||
group_by(site_no, col_name, dateTime) %>% | ||
summarize(temperature_mean_daily = round(mean(temp_value), 1), n_day = n()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all NWIS temperatures only reported to tenth-degree precision? It probably doesn't matter much for these daily values, but I saw unnecessary choppiness in sub-daily timeseries when I was doing metabolism modeling. If we could avoid adding any superfluous rounding here that might be nice...could round to the minimum number of sig figs among the values in temp_value
, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am going to round to 3 digits to give a bit of buffer.
5_data_munge/src/munge_data_files.R
Outdated
ungroup() %>% | ||
mutate(date = as.Date(dateTime), | ||
source = 'nwiw_uv') %>% | ||
select(site_id = site_no, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making these NWIS site_ids be USGS-XXX
here so they mesh with WQP conventions early on in the munging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
…conversion, remove redundant is.na filtering, avoid rounding to too few sig figs, add "USGS" prefix to NWIS sites
…d objects in POST script to be descriptive.
All changes made @aappling-usgs! Also rebuilt the munge step because of the filter coming before the units conversion. |
Woot! |
This PR fixes old issues where a subset of calls to WQP fail. Implemented the POST script from Laura/Jordan, which seems to fix the issues. The POST solution is faster - each pull to WQP is 1-2 minutes (times 365 pulls), which seemed more like 3-4 minutes with
dataRetrieval::readWQPdata
.Additionally, the workflow now uses a combiner, and does not put intermediate files into the shared cache. The combined file is now stored here.