-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce index size #35
Comments
Filesizes in
|
Biggest tab <- table(fileIndex[,2:1])
tab[tab==0] <- NA
tableColVal(tab, digits=0, palette=seqPal(logbase=1.15), nameswidth=0.21) table(fileIndex[fileIndex$var=="precipitation", c("per","res")])
|
Do I get this right that less stations are providing 5-min-values than 1-min-values (e.g. for recent, 964 vs. 943)? Making use of {xts}, to be precise So, what exactly is the unique selling point here (from DWD's point of view)? |
Yes, you see that correctly. As the 5_min data is new, this may yet be expanded. I fear that just taking out the 5 minute files out of the index is not a very future-proof method - other datasets might expand and still bring the package over the 5MB limit. |
I played around a little bit, testing of ## get station ids, 1 min
html <- rvest::read_html("https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/recent/")
list <- html %>% rvest::html_elements("a") %>% rvest::html_text2() %>% stringr::str_split("_")
df <- list[2:length(list)] %>% unlist() %>% matrix(ncol=4, byrow = TRUE) %>% data.frame(stringsAsFactors = FALSE)
#> Warning in matrix(., ncol = 4, byrow = TRUE): Datenlänge [3857] ist kein Teiler
#> oder Vielfaches der Anzahl der Zeilen [965]
stations_1min <- df[["X3"]][1:(length(df[["X3"]])-2)]
length(stations_1min)
#> [1] 963
## get station ids, 5 min
html <- rvest::read_html("https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/5_minutes/precipitation/recent/")
list <- html %>% rvest::html_elements("a") %>% rvest::html_text2() %>% stringr::str_split("_")
df <- list[2:length(list)] %>% unlist() %>% matrix(ncol=4, byrow = TRUE) %>% data.frame(stringsAsFactors = FALSE)
stations_5min <- df[["X3"]][2:length(df[["X3"]])]
length(stations_5min)
#> [1] 943
## get setdiffs
dplyr::setdiff(stations_1min, stations_5min)
#> [1] "00071" "00410" "00430" "01473" "01578" "02184" "02292" "02556" "03147"
#> [10] "03552" "04623" "05468" "05614" "05616" "05646" "05758" "06186" "06276"
#> [19] "06312" "07099"
dplyr::setdiff(stations_5min, stations_1min)
#> character(0) Created on 2022-05-17 by the reprex package (v2.0.1) So basically, it seems like at the moment there is no advantage of the 5 min values over the 1 min ones... I can imagine data to be included here, which was digitized in the course of MUNSTAR but this would only explain additional historical oberservations, I assume. Hm, but who knows. However, I also get your "not my business" point - and not addressing this issue in general won't help in the long term. I'm not that familiar with your internal structure to be honest - so no idea if I'm even of help here - but I assume you basically indexed the content being downloadable via If so, is the index being updated regularly? And would it also be possible to build it dynamically on-the-fly based on the query issued (combination of product/parameter/resolution/quality, ...) without having to store everything? |
I for now try to get The index is used to select data to get urls, see the package structure diagram. I would rather not generate it on the fly, as running I update the indexes irregularly, and there is an option to generate a query-based current version if needed, see the fileindex page |
81% of the fileIndex is sub-hourly precipitation. I think a different index concept is needed only there.
|
Hi Berry, on topic:
|
Thanks @valentingar for your ideas!
With the huge speedup in Note: indexFTP needs RCurl to work, so a local version of paths (or query my github repo instead?) might be preferred |
…n of id and date range, should be helpful with #35.
With the new 5-minute data (April 2022), the
fileIndex
and derived indexes are getting very big.I think
rdwd
cannot be published atomatically on CRAN with the check NOTE installed size is 5.3Mb (R 3.5, data 1.4)Ideas on package size reduction are welcome!
The text was updated successfully, but these errors were encountered: