pivot_wider doesn't seem to be lazy #598

Janlow · 2021-02-09T15:29:09Z

Connecting R to an Oracle database using DBI and ODBC. I do a lot of querying using dplyr syntax and avoid using collect() until the very last moment, which I really like and works as it should. However, I used pivot_wider for the first time now inside a long query, without collect() at the end, and it seems as if it actually performs the query, which I have never experienced before, guess that this must be a bug? Do I have to specify somewhere myself that I want it to be executed lazy?

I am including an example so that you can see what I mean. Even though I am saving it as an object it executes, which I only expect if I use collect() or don't save it as an object. Using dbplyr 2.1.0 installed from CRAN.

test <- memdb_frame(
  id = rep(1:5000000, each = 2),
  key = rep(c("x", "y"), 5000000),
  value = 1:10000000
) %>%
  pivot_wider(id_cols = id,
              names_from = key, 
              values_from = value)

abalter · 2021-02-09T20:10:39Z

@Janlow -- I'm guessing @hadley is going to ask you to create a reprex.

One method is to create a local database, but doubt there is an Oracle version--might have to use RSQLite.

https://dbplyr.tidyverse.org/articles/reprex.html

mgirlich · 2021-02-10T06:42:18Z

Unfortunately, the data needs to be collected for pivot_wider(). The reason is that, at least in the standard SQL, the columns from pivotting have to be explicitly specified. It seems that this has slipped the documentation, sorry about that.

In tidyr you can first create a specification and then pass it to pivot_wider_spec(). Unfortunately, tidyr::pivot_wider_spec() is not a generic so we didn't export so we didn't export the internally used equivalent of it. But it might be an option for the future (corresponding tidyr issue tidyverse/tidyr#1071).

kmishra9 · 2021-04-22T23:05:57Z

Ah, slightly disappointed to hear pivot_wider collects first :/
Is there any performance benefit to using pivot_wider in a dbplyr chain of verbs vs collecting and executing pivot_wider on the corresponding tibble?

mgirlich · 2021-04-23T06:22:56Z

@kmishra9 The pivot itself is calculated in the database, so depending on your kind of data and backend this might be faster. Note that with the next release of tidyr the function pivot_wider_spec() will likely become a generic. Then (after another release of dbplyr()) you will have two options:

pivot_wider(): no need to provide a specification but collects the data.
pivot_wider_spec(): need to provide a specification but doesn't collect the data.

kmishra9 · 2021-04-23T18:21:19Z

Oh interesting. Sorry if I'm being dense here, just trying to understand this real-world scenario better:

I've got a few massive ~100 million row long dbplyr tables that I want to pivot_wider before left joining all together. Based on what you said, it sounds like the pivot occurs in the DB and then the wide table is the thing that is forcibly collected (optimal and much faster for my use case)? Or does it collect these massive long tables (suboptimal and my current workflow right now) and then pivot_wider locally?

If its the first case, awesome -- I'll see if I can post some benchmarks of before and after I refactor to use dbplyr's pivot_wider! If it's the latter case, unfortunately not a big help/departure from my current workflow (though I appreciate the work done regardless!)

mgirlich · 2021-04-26T06:08:44Z

@kmishra9 pivot_wider.tbl_lazy() works as follows (I'll also add that to the documentation):

Get the distinct values in the names_from column(s):

collect(distinct(data, !!!syms(names_from)))

This requires quite a bit of computation in the database but very few values are actually collected.
2. Build a pivot_wider() specification with the information of the unique values. This steps simply calls tidyr::build_wider_spec(), doesn't require the database and is very fast.
3. Do the pivotting in the database and then collect the data.

So, very little computation occurs locally. But the first step requires quite a bit of calculation in your database.

When pivot_wider_spec() is a generic then steps 1 and 2 are skipped. I guess for long tables you then might see a speed up of up to 50%.

kmishra9 · 2021-04-26T23:03:10Z

Oh excellent. That's extremely helpful -- I'll be refactoring within the next few weeks and I'll post any updates from my end around speedups. I'm working with fairly large data that involves collecting then pivoting wider super long tables, but it seems like this is a much faster workflow and has the extremely important benefit, like of all of dbplyr, of reducing memory strain. Thank you for the implementation, looking forward to giving it a spin!

kmishra9 · 2021-04-27T01:11:52Z

Realized the refactor was a) even simpler than I thought and b) somewhat necessary for the scale of the datasets I'm using, and here's what I observed on an EC2 instance in the same datasets pre and post using the new pivot_wider implementation:

254 -> 176 seconds
5.0 -> 5.2 seconds
663 -> 400 seconds
3505 -> 1278 seconds
63 -> 15 seconds

So obviously a situation where YMMV, but certainly appears to be a noticeable improvement!

A question @mgirlich: do you think DB resources (i.e. in Redshift, having more nodes/CPUs/RAM) would improve computation time, given that the pivot is calculated in the database? I would imagine so, but not sure if there's anything under the hood that would prevent the natural SQL parallelization in grouping/aggregation commands that we can often take advantage of implicitly.

mgirlich · 2022-03-29T06:34:56Z

The documentation was updated to make it clear that pivot_wider() is not lazy.

hadley · 2023-03-08T15:55:41Z

Interestingly this is a similar approach to duckdb's: duckdb/duckdb#6387

The true magic (or hack - depending on your perspective :)) in this PR is in allowing PIVOT without specifying the column names up front. DuckDB, like most SQL engines, is completely dependent on knowing all the column names and types within a single SQL statement up front. The magic - and what happens here - is that running a PIVOT without specifying the names will result in running multiple SQL statements.

woodwards-dnz · 2023-07-13T01:56:52Z

This issue is making my very slow query have to run twice, once to do pivot_wider halfway through the chain and once at the end when I collect the results. I've spent today trying alternatives including pivot_wider_spec (I know the new column names) but I can't make it work (I'm getting this error even though I know and have checked the column names). Would be nice if a lazy version of pivot_wider_spec() could be in dbplyr.

spec <- tibble(NAME = variables, .name = variables, .value = "VALUEI")
weathnew4 <- weathnew3 %>%
  pivot_wider_spec(spec)

Error in `data[vec_unique(c(id_cols, names_from_cols, values_from_cols, unused_cols))]`:
! Can't subset columns with `vec_unique(c(id_cols, names_from_cols, values_from_cols, unused_cols))`.
✖ Subscript `vec_unique(c(id_cols, names_from_cols, values_from_cols, unused_cols))` can't contain missing values.
✖ It has a missing value at location 1.
Run `rlang::last_trace()` to see where the error occurred.

mgirlich · 2023-07-13T06:10:42Z

@woodwards-dnz Since it doesn't look like pivot_wider_spec() will become a generic soon, I exported dbplyr_pivot_wider_spec() (which already existed but wasn't exported). Note that this will be removed soon after pivot_wider_spec() becomes a generic.
Further, I opened an issue in tidyr to give a better error message if passing a lazy table to pivot_wider_spec() (tidyverse/tidyr#1510).

woodwards-dnz · 2023-07-16T20:59:07Z

Thanks you very much! I tried that and it didn't seem to make any difference. Maybe there's another reason why it's not lazy. I guess the lesson is to try and avoid using pivot_wider in dbplyr.

mgirlich · 2023-07-17T05:09:11Z

Can you post a reprex of your code that doesn't seem to be lazy?

woodwards-dnz · 2023-07-17T20:23:40Z


library(dplyr)
library(tidyr)
library(dbplyr)
library(RSQLite)

# not lazy
us_rent_income
spec1 <- us_rent_income %>%
  build_wider_spec(names_from = variable, values_from = estimate)
spec1
result <- us_rent_income %>%
  select(-moe) %>% 
  pivot_wider_spec(spec1)

# lazy?
us_rent_income2 <- memdb_frame(
  GEOID = "01",
  NAME = "Alabama", 
  variable = c("income", "rent"), 
  estimate = c(24476, 747),
  moe = c(136,3))
result2 <- us_rent_income %>%
  select(-moe) %>% 
  pivot_wider_spec(spec1)

hadley · 2023-07-17T22:40:06Z

As @mgirlich said, you need to use dbplyr_pivot_wider_spec() to get laziness:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)

spec <- us_rent_income %>%
  build_wider_spec(names_from = variable, values_from = estimate)

us_rent_income2 <- memdb_frame(
  GEOID = "01",
  NAME = "Alabama", 
  variable = c("income", "rent"), 
  estimate = c(24476, 747),
  moe = c(136,3)
)
us_rent_income2 %>%
  select(-moe) %>% 
  dbplyr_pivot_wider_spec(spec) |> 
  show_query()
#> <SQL>
#> SELECT
#>   `GEOID`,
#>   `NAME`,
#>   MAX(CASE WHEN (`variable` = 'income') THEN `estimate` END) AS `income`,
#>   MAX(CASE WHEN (`variable` = 'rent') THEN `estimate` END) AS `rent`
#> FROM (
#>   SELECT `GEOID`, `NAME`, `variable`, `estimate`
#>   FROM `dbplyr_001`
#> ) AS `q01`
#> GROUP BY `GEOID`, `NAME`

^{Created on 2023-07-17 with reprex v2.0.2}

woodwards-dnz · 2023-07-18T01:50:58Z

Oooh ... I have to load the development version of dbplyr! I wondered why I couldn't access that function. I think it works now.

This comment has been minimized.

Sign in to view

mgirlich mentioned this issue Jan 4, 2022

more functions should be generic tidyverse/tidyr#1071

Open

7 tasks

mgirlich mentioned this issue Mar 16, 2022

Update pivot_wider() documentation #794

Merged

mgirlich closed this as completed Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pivot_wider doesn't seem to be lazy #598

pivot_wider doesn't seem to be lazy #598

Janlow commented Feb 9, 2021

abalter commented Feb 9, 2021

mgirlich commented Feb 10, 2021 •

edited

Loading

This comment has been minimized.

kmishra9 commented Apr 22, 2021

mgirlich commented Apr 23, 2021

kmishra9 commented Apr 23, 2021

mgirlich commented Apr 26, 2021

kmishra9 commented Apr 26, 2021

kmishra9 commented Apr 27, 2021 •

edited

Loading

mgirlich commented Mar 29, 2022

hadley commented Mar 8, 2023

woodwards-dnz commented Jul 13, 2023 •

edited

Loading

mgirlich commented Jul 13, 2023

woodwards-dnz commented Jul 16, 2023

mgirlich commented Jul 17, 2023

woodwards-dnz commented Jul 17, 2023

hadley commented Jul 17, 2023

woodwards-dnz commented Jul 18, 2023

pivot_wider doesn't seem to be lazy #598

pivot_wider doesn't seem to be lazy #598

Comments

Janlow commented Feb 9, 2021

abalter commented Feb 9, 2021

mgirlich commented Feb 10, 2021 • edited Loading

This comment has been minimized.

kmishra9 commented Apr 22, 2021

mgirlich commented Apr 23, 2021

kmishra9 commented Apr 23, 2021

mgirlich commented Apr 26, 2021

kmishra9 commented Apr 26, 2021

kmishra9 commented Apr 27, 2021 • edited Loading

mgirlich commented Mar 29, 2022

hadley commented Mar 8, 2023

woodwards-dnz commented Jul 13, 2023 • edited Loading

mgirlich commented Jul 13, 2023

woodwards-dnz commented Jul 16, 2023

mgirlich commented Jul 17, 2023

woodwards-dnz commented Jul 17, 2023

hadley commented Jul 17, 2023

woodwards-dnz commented Jul 18, 2023

mgirlich commented Feb 10, 2021 •

edited

Loading

kmishra9 commented Apr 27, 2021 •

edited

Loading

woodwards-dnz commented Jul 13, 2023 •

edited

Loading