Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_wider doesn't seem to be lazy #598

Closed
Janlow opened this issue Feb 9, 2021 · 18 comments
Closed

pivot_wider doesn't seem to be lazy #598

Janlow opened this issue Feb 9, 2021 · 18 comments

Comments

@Janlow
Copy link

Janlow commented Feb 9, 2021

Connecting R to an Oracle database using DBI and ODBC. I do a lot of querying using dplyr syntax and avoid using collect() until the very last moment, which I really like and works as it should. However, I used pivot_wider for the first time now inside a long query, without collect() at the end, and it seems as if it actually performs the query, which I have never experienced before, guess that this must be a bug? Do I have to specify somewhere myself that I want it to be executed lazy?

I am including an example so that you can see what I mean. Even though I am saving it as an object it executes, which I only expect if I use collect() or don't save it as an object. Using dbplyr 2.1.0 installed from CRAN.

test <- memdb_frame(
  id = rep(1:5000000, each = 2),
  key = rep(c("x", "y"), 5000000),
  value = 1:10000000
) %>%
  pivot_wider(id_cols = id,
              names_from = key, 
              values_from = value)
@abalter
Copy link

abalter commented Feb 9, 2021

@Janlow -- I'm guessing @hadley is going to ask you to create a reprex.

One method is to create a local database, but doubt there is an Oracle version--might have to use RSQLite.

https://dbplyr.tidyverse.org/articles/reprex.html

@mgirlich
Copy link
Collaborator

mgirlich commented Feb 10, 2021

Unfortunately, the data needs to be collected for pivot_wider(). The reason is that, at least in the standard SQL, the columns from pivotting have to be explicitly specified. It seems that this has slipped the documentation, sorry about that.

In tidyr you can first create a specification and then pass it to pivot_wider_spec(). Unfortunately, tidyr::pivot_wider_spec() is not a generic so we didn't export so we didn't export the internally used equivalent of it. But it might be an option for the future (corresponding tidyr issue tidyverse/tidyr#1071).

@100gle

This comment has been minimized.

@kmishra9
Copy link

Ah, slightly disappointed to hear pivot_wider collects first :/
Is there any performance benefit to using pivot_wider in a dbplyr chain of verbs vs collecting and executing pivot_wider on the corresponding tibble?

@mgirlich
Copy link
Collaborator

@kmishra9 The pivot itself is calculated in the database, so depending on your kind of data and backend this might be faster. Note that with the next release of tidyr the function pivot_wider_spec() will likely become a generic. Then (after another release of dbplyr()) you will have two options:

  • pivot_wider(): no need to provide a specification but collects the data.
  • pivot_wider_spec(): need to provide a specification but doesn't collect the data.

@kmishra9
Copy link

Oh interesting. Sorry if I'm being dense here, just trying to understand this real-world scenario better:

I've got a few massive ~100 million row long dbplyr tables that I want to pivot_wider before left joining all together. Based on what you said, it sounds like the pivot occurs in the DB and then the wide table is the thing that is forcibly collected (optimal and much faster for my use case)? Or does it collect these massive long tables (suboptimal and my current workflow right now) and then pivot_wider locally?

If its the first case, awesome -- I'll see if I can post some benchmarks of before and after I refactor to use dbplyr's pivot_wider! If it's the latter case, unfortunately not a big help/departure from my current workflow (though I appreciate the work done regardless!)

@mgirlich
Copy link
Collaborator

@kmishra9 pivot_wider.tbl_lazy() works as follows (I'll also add that to the documentation):

  1. Get the distinct values in the names_from column(s):
collect(distinct(data, !!!syms(names_from)))

This requires quite a bit of computation in the database but very few values are actually collected.
2. Build a pivot_wider() specification with the information of the unique values. This steps simply calls tidyr::build_wider_spec(), doesn't require the database and is very fast.
3. Do the pivotting in the database and then collect the data.

So, very little computation occurs locally. But the first step requires quite a bit of calculation in your database.

When pivot_wider_spec() is a generic then steps 1 and 2 are skipped. I guess for long tables you then might see a speed up of up to 50%.

@kmishra9
Copy link

Oh excellent. That's extremely helpful -- I'll be refactoring within the next few weeks and I'll post any updates from my end around speedups. I'm working with fairly large data that involves collecting then pivoting wider super long tables, but it seems like this is a much faster workflow and has the extremely important benefit, like of all of dbplyr, of reducing memory strain. Thank you for the implementation, looking forward to giving it a spin!

@kmishra9
Copy link

kmishra9 commented Apr 27, 2021

Realized the refactor was a) even simpler than I thought and b) somewhat necessary for the scale of the datasets I'm using, and here's what I observed on an EC2 instance in the same datasets pre and post using the new pivot_wider implementation:

254 -> 176 seconds
5.0 -> 5.2 seconds
663 -> 400 seconds
3505 -> 1278 seconds
63 -> 15 seconds

So obviously a situation where YMMV, but certainly appears to be a noticeable improvement!

A question @mgirlich: do you think DB resources (i.e. in Redshift, having more nodes/CPUs/RAM) would improve computation time, given that the pivot is calculated in the database? I would imagine so, but not sure if there's anything under the hood that would prevent the natural SQL parallelization in grouping/aggregation commands that we can often take advantage of implicitly.

@mgirlich
Copy link
Collaborator

The documentation was updated to make it clear that pivot_wider() is not lazy.

@hadley
Copy link
Member

hadley commented Mar 8, 2023

Interestingly this is a similar approach to duckdb's: duckdb/duckdb#6387

The true magic (or hack - depending on your perspective :)) in this PR is in allowing PIVOT without specifying the column names up front. DuckDB, like most SQL engines, is completely dependent on knowing all the column names and types within a single SQL statement up front. The magic - and what happens here - is that running a PIVOT without specifying the names will result in running multiple SQL statements.

@woodwards-dnz
Copy link

woodwards-dnz commented Jul 13, 2023

This issue is making my very slow query have to run twice, once to do pivot_wider halfway through the chain and once at the end when I collect the results. I've spent today trying alternatives including pivot_wider_spec (I know the new column names) but I can't make it work (I'm getting this error even though I know and have checked the column names). Would be nice if a lazy version of pivot_wider_spec() could be in dbplyr.

spec <- tibble(NAME = variables, .name = variables, .value = "VALUEI")
weathnew4 <- weathnew3 %>%
  pivot_wider_spec(spec)

Error in `data[vec_unique(c(id_cols, names_from_cols, values_from_cols, unused_cols))]`:
! Can't subset columns with `vec_unique(c(id_cols, names_from_cols, values_from_cols, unused_cols))`.
✖ Subscript `vec_unique(c(id_cols, names_from_cols, values_from_cols, unused_cols))` can't contain missing values.
✖ It has a missing value at location 1.
Run `rlang::last_trace()` to see where the error occurred.

@mgirlich
Copy link
Collaborator

@woodwards-dnz Since it doesn't look like pivot_wider_spec() will become a generic soon, I exported dbplyr_pivot_wider_spec() (which already existed but wasn't exported). Note that this will be removed soon after pivot_wider_spec() becomes a generic.
Further, I opened an issue in tidyr to give a better error message if passing a lazy table to pivot_wider_spec() (tidyverse/tidyr#1510).

@woodwards-dnz
Copy link

Thanks you very much! I tried that and it didn't seem to make any difference. Maybe there's another reason why it's not lazy. I guess the lesson is to try and avoid using pivot_wider in dbplyr.

@mgirlich
Copy link
Collaborator

Can you post a reprex of your code that doesn't seem to be lazy?

@woodwards-dnz
Copy link


library(dplyr)
library(tidyr)
library(dbplyr)
library(RSQLite)

# not lazy
us_rent_income
spec1 <- us_rent_income %>%
  build_wider_spec(names_from = variable, values_from = estimate)
spec1
result <- us_rent_income %>%
  select(-moe) %>% 
  pivot_wider_spec(spec1)

# lazy?
us_rent_income2 <- memdb_frame(
  GEOID = "01",
  NAME = "Alabama", 
  variable = c("income", "rent"), 
  estimate = c(24476, 747),
  moe = c(136,3))
result2 <- us_rent_income %>%
  select(-moe) %>% 
  pivot_wider_spec(spec1)

@hadley
Copy link
Member

hadley commented Jul 17, 2023

As @mgirlich said, you need to use dbplyr_pivot_wider_spec() to get laziness:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)

spec <- us_rent_income %>%
  build_wider_spec(names_from = variable, values_from = estimate)

us_rent_income2 <- memdb_frame(
  GEOID = "01",
  NAME = "Alabama", 
  variable = c("income", "rent"), 
  estimate = c(24476, 747),
  moe = c(136,3)
)
us_rent_income2 %>%
  select(-moe) %>% 
  dbplyr_pivot_wider_spec(spec) |> 
  show_query()
#> <SQL>
#> SELECT
#>   `GEOID`,
#>   `NAME`,
#>   MAX(CASE WHEN (`variable` = 'income') THEN `estimate` END) AS `income`,
#>   MAX(CASE WHEN (`variable` = 'rent') THEN `estimate` END) AS `rent`
#> FROM (
#>   SELECT `GEOID`, `NAME`, `variable`, `estimate`
#>   FROM `dbplyr_001`
#> ) AS `q01`
#> GROUP BY `GEOID`, `NAME`

Created on 2023-07-17 with reprex v2.0.2

@woodwards-dnz
Copy link

Oooh ... I have to load the development version of dbplyr! I wondered why I couldn't access that function. I think it works now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants