-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pivot_wider doesn't seem to be lazy #598
Comments
Unfortunately, the data needs to be collected for In |
This comment has been minimized.
This comment has been minimized.
Ah, slightly disappointed to hear |
@kmishra9 The pivot itself is calculated in the database, so depending on your kind of data and backend this might be faster. Note that with the next release of
|
Oh interesting. Sorry if I'm being dense here, just trying to understand this real-world scenario better: I've got a few massive ~100 million row long dbplyr tables that I want to pivot_wider before left joining all together. Based on what you said, it sounds like the pivot occurs in the DB and then the wide table is the thing that is forcibly collected (optimal and much faster for my use case)? Or does it collect these massive long tables (suboptimal and my current workflow right now) and then pivot_wider locally? If its the first case, awesome -- I'll see if I can post some benchmarks of before and after I refactor to use dbplyr's pivot_wider! If it's the latter case, unfortunately not a big help/departure from my current workflow (though I appreciate the work done regardless!) |
@kmishra9
collect(distinct(data, !!!syms(names_from))) This requires quite a bit of computation in the database but very few values are actually collected. So, very little computation occurs locally. But the first step requires quite a bit of calculation in your database. When |
Oh excellent. That's extremely helpful -- I'll be refactoring within the next few weeks and I'll post any updates from my end around speedups. I'm working with fairly large data that involves collecting then pivoting wider super long tables, but it seems like this is a much faster workflow and has the extremely important benefit, like of all of |
Realized the refactor was a) even simpler than I thought and b) somewhat necessary for the scale of the datasets I'm using, and here's what I observed on an EC2 instance in the same datasets pre and post using the new pivot_wider implementation:
So obviously a situation where YMMV, but certainly appears to be a noticeable improvement! A question @mgirlich: do you think DB resources (i.e. in Redshift, having more nodes/CPUs/RAM) would improve computation time, given that the pivot is calculated in the database? I would imagine so, but not sure if there's anything under the hood that would prevent the natural SQL parallelization in grouping/aggregation commands that we can often take advantage of implicitly. |
The documentation was updated to make it clear that |
Interestingly this is a similar approach to duckdb's: duckdb/duckdb#6387
|
This issue is making my very slow query have to run twice, once to do pivot_wider halfway through the chain and once at the end when I collect the results. I've spent today trying alternatives including pivot_wider_spec (I know the new column names) but I can't make it work (I'm getting this error even though I know and have checked the column names). Would be nice if a lazy version of pivot_wider_spec() could be in dbplyr.
|
@woodwards-dnz Since it doesn't look like |
Thanks you very much! I tried that and it didn't seem to make any difference. Maybe there's another reason why it's not lazy. I guess the lesson is to try and avoid using pivot_wider in dbplyr. |
Can you post a reprex of your code that doesn't seem to be lazy? |
|
As @mgirlich said, you need to use library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
spec <- us_rent_income %>%
build_wider_spec(names_from = variable, values_from = estimate)
us_rent_income2 <- memdb_frame(
GEOID = "01",
NAME = "Alabama",
variable = c("income", "rent"),
estimate = c(24476, 747),
moe = c(136,3)
)
us_rent_income2 %>%
select(-moe) %>%
dbplyr_pivot_wider_spec(spec) |>
show_query()
#> <SQL>
#> SELECT
#> `GEOID`,
#> `NAME`,
#> MAX(CASE WHEN (`variable` = 'income') THEN `estimate` END) AS `income`,
#> MAX(CASE WHEN (`variable` = 'rent') THEN `estimate` END) AS `rent`
#> FROM (
#> SELECT `GEOID`, `NAME`, `variable`, `estimate`
#> FROM `dbplyr_001`
#> ) AS `q01`
#> GROUP BY `GEOID`, `NAME` Created on 2023-07-17 with reprex v2.0.2 |
Oooh ... I have to load the development version of dbplyr! I wondered why I couldn't access that function. I think it works now. |
Connecting R to an Oracle database using DBI and ODBC. I do a lot of querying using dplyr syntax and avoid using collect() until the very last moment, which I really like and works as it should. However, I used pivot_wider for the first time now inside a long query, without collect() at the end, and it seems as if it actually performs the query, which I have never experienced before, guess that this must be a bug? Do I have to specify somewhere myself that I want it to be executed lazy?
I am including an example so that you can see what I mean. Even though I am saving it as an object it executes, which I only expect if I use collect() or don't save it as an object. Using dbplyr 2.1.0 installed from CRAN.
The text was updated successfully, but these errors were encountered: