reduce index size #35

brry · 2022-04-28T21:15:24Z

With the new 5-minute data (April 2022), the fileIndex and derived indexes are getting very big.
I think rdwd cannot be published atomatically on CRAN with the check NOTE installed size is 5.3Mb (R 3.5, data 1.4)

Ideas on package size reduction are welcome!

The text was updated successfully, but these errors were encountered:

brry · 2022-05-13T07:20:29Z

Filesizes in rdwd/data/ folder:

fileIndex: 744 KB (2181 KB without resave. 635 KB only fileIndex$path. 109 KB without path column)
metaIndex: 430 KB
 geoIndex: 222 KB (reduced to 110 KB if display column removed)
gridIndex:  26 KB
formatIndex: 3 KB

brry · 2022-05-13T07:42:51Z

Biggest fileIndex size contributors:

tab <- table(fileIndex[,2:1])
tab[tab==0] <- NA
tableColVal(tab, digits=0, palette=seqPal(logbase=1.15), nameswidth=0.21)

table(fileIndex[fileIndex$var=="precipitation", c("per","res")])

            res
per          1_minute 10_minutes 5_minutes hourly
  historical   217274       3267    158216   1043
  meta_data      1145       1155      1145      0
  now             960        963       937      0
  recent          964        980       943    984

dimfalk · 2022-05-15T11:51:41Z

Do I get this right that less stations are providing 5-min-values than 1-min-values (e.g. for recent, 964 vs. 943)?

Making use of {xts}, to be precise xts::period.apply() to aggregate values to 5-min-sums is a matter of seconds in the end.

So, what exactly is the unique selling point here (from DWD's point of view)?

brry · 2022-05-16T09:31:39Z

Yes, you see that correctly. As the 5_min data is new, this may yet be expanded.
I do not know why they have this data. I cannot find information in the changelog. I presume they may have different data sources or methods.
Currently, I feel rdwd should have the full index of all available files and leave the discussion of usefullness to the individual user.

I fear that just taking out the 5 minute files out of the index is not a very future-proof method - other datasets might expand and still bring the package over the 5MB limit.

dimfalk · 2022-05-17T20:49:55Z

I played around a little bit, testing of rvest and reprex included... 😏

## get station ids, 1 min

html <- rvest::read_html("https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/recent/")

list <- html %>% rvest::html_elements("a") %>% rvest::html_text2() %>% stringr::str_split("_")

df <- list[2:length(list)] %>% unlist() %>% matrix(ncol=4, byrow = TRUE) %>% data.frame(stringsAsFactors = FALSE)
#> Warning in matrix(., ncol = 4, byrow = TRUE): Datenlänge [3857] ist kein Teiler
#> oder Vielfaches der Anzahl der Zeilen [965]

stations_1min <- df[["X3"]][1:(length(df[["X3"]])-2)]

length(stations_1min)
#> [1] 963

## get station ids, 5 min

html <- rvest::read_html("https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/5_minutes/precipitation/recent/")

list <- html %>% rvest::html_elements("a") %>% rvest::html_text2() %>% stringr::str_split("_")

df <- list[2:length(list)] %>% unlist() %>% matrix(ncol=4, byrow = TRUE) %>% data.frame(stringsAsFactors = FALSE)

stations_5min <- df[["X3"]][2:length(df[["X3"]])]

length(stations_5min)
#> [1] 943

## get setdiffs

dplyr::setdiff(stations_1min, stations_5min)
#>  [1] "00071" "00410" "00430" "01473" "01578" "02184" "02292" "02556" "03147"
#> [10] "03552" "04623" "05468" "05614" "05616" "05646" "05758" "06186" "06276"
#> [19] "06312" "07099"

dplyr::setdiff(stations_5min, stations_1min)
#> character(0)

^{Created on 2022-05-17 by the reprex package (v2.0.1)}

So basically, it seems like at the moment there is no advantage of the 5 min values over the 1 min ones... I can imagine data to be included here, which was digitized in the course of MUNSTAR but this would only explain additional historical oberservations, I assume. Hm, but who knows.

However, I also get your "not my business" point - and not addressing this issue in general won't help in the long term.

I'm not that familiar with your internal structure to be honest - so no idea if I'm even of help here - but I assume you basically indexed the content being downloadable via rdwd in order to facilitate function calls or the like?

If so, is the index being updated regularly? And would it also be possible to build it dynamically on-the-fly based on the query issued (combination of product/parameter/resolution/quality, ...) without having to store everything?

brry · 2022-05-30T09:32:42Z

I for now try to get rdwd on CRAN despite the size of the package.

The index is used to select data to get urls, see the package structure diagram. I would rather not generate it on the fly, as running createIndex takes several minutes for the full process.

I update the indexes irregularly, and there is an option to generate a query-based current version if needed, see the fileindex page

brry · 2022-05-30T09:38:22Z

81% of the fileIndex is sub-hourly precipitation. I think a different index concept is needed only there.

mean(fileIndex$var=="precipitation" & fileIndex$res != "hourly")

valentingar · 2022-07-09T10:48:01Z

Hi Berry,
I had just written a quick script yesterday to download DWD climate data and thought "Hey that would make for a good package!", when I stumbled across your package which looks awesome! Looking forward to checking it out in detail!

on topic:
3 ideas:

turn columns c("res", "var", "per") into integer indices and map them in separate data.frames. Then write an internal function that joins the tables together when fileIndex is needed. Potential size decrease 10 %.
Only store the paths and speed up your createIndex()-function (e.g. add fixed = TRUE to the calls to strsplit()). Then add a function to .onLoad() that prepares the dataset once on package loadup. Potential size decrease 20 %.
Consider hosting a separate package with only the index files. This will also be more kind to CRAN, as you don't have to resubmit the data every time you update the package without reindexing the database. This could give you more headspace for the future as well.

brry · 2023-04-14T09:40:54Z

Thanks @valentingar for your ideas!

the fileIndex etc as a full data.frame are very useful for queries, so I would like to keep that.
I would have never guessed it, but setting fixed=T in strsplit and grepl, along with using endsWith makes a huge difference in computation time. Thanks so much for the suggestion!
Whenever I update the package on CRAN, updated nndexes are provided as well, so that doesn't really help much in this particular case.

With the huge speedup in createIndex, I think I will remove the 1/5/10 minute data from the regular indexes and run indexFTP + createIndex if they are queried.

Note: indexFTP needs RCurl to work, so a local version of paths (or query my github repo instead?) might be preferred

…n of id and date range, should be helpful with #35.

brry added the help wanted label May 2, 2022

brry added a commit that referenced this issue Apr 7, 2023

raw indexes removed from version control (too large by now), see #35

02967fc

brry added a commit that referenced this issue Apr 14, 2023

createIndex: significantly faster (and now also correct) determinatio…

b550c08

…n of id and date range, should be helpful with #35.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce index size #35

reduce index size #35

brry commented Apr 28, 2022 •

edited

Loading

brry commented May 13, 2022 •

edited

Loading

brry commented May 13, 2022 •

edited

Loading

dimfalk commented May 15, 2022

brry commented May 16, 2022

dimfalk commented May 17, 2022 •

edited

Loading

brry commented May 30, 2022

brry commented May 30, 2022

valentingar commented Jul 9, 2022

brry commented Apr 14, 2023 •

edited

Loading

reduce index size #35

reduce index size #35

Comments

brry commented Apr 28, 2022 • edited Loading

brry commented May 13, 2022 • edited Loading

brry commented May 13, 2022 • edited Loading

dimfalk commented May 15, 2022

brry commented May 16, 2022

dimfalk commented May 17, 2022 • edited Loading

brry commented May 30, 2022

brry commented May 30, 2022

valentingar commented Jul 9, 2022

brry commented Apr 14, 2023 • edited Loading

brry commented Apr 28, 2022 •

edited

Loading

brry commented May 13, 2022 •

edited

Loading

brry commented May 13, 2022 •

edited

Loading

dimfalk commented May 17, 2022 •

edited

Loading

brry commented Apr 14, 2023 •

edited

Loading