Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple warnings : "ORDER BY is ignored in subqueries without LIMIT" #1092

Closed
bobbymc0 opened this issue Jan 10, 2023 · 9 comments
Closed

Comments

@bobbymc0
Copy link

bobbymc0 commented Jan 10, 2023

ORDER BY is ignored in subqueries without LIMIT

Issue
Script output is filled with the same warning message (over and over again, +30 times) :

ORDER BY is ignored in subqueries without LIMIT
Do you need to move arrange() later in the pipeline or use window_order() instead?

I have the same issue everywhere (local, databricks, Hadoop, etc.)
I don't know exactly what is causing the warning but it appears generally when I use dplyr::group_by() with dplyr::arrange().


Reproducible example

library("dplyr"); library("sparklyr") 
spark_version <- "3.1.3"
sc <- spark_connect(master = "local", version = spark_version)

# --- coping mtcars data into spark for the example
sdf_mtcars = sdf_copy_to(sc = sc, x = mtcars, name = "sdf_mtcars")

# --- example generating the warning message (twice)
# note : I had to use mutate() twice to make the warning appear
# the warning does not appear in this simple example with if only one of the two columns cum_mpg or max_mpg is created
example_mtcars = sdf_mtcars %>%
  group_by(cyl) %>%
  arrange(mpg, .by_group = TRUE) %>%
  mutate(cum_mpg = cumsum(mpg)) %>%
  mutate(max_mpg = cummax(mpg)) %>%
  ungroup() %>% 
  sdf_collect()

Warning messages :
1: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
2: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?


utils::sessionInfo()

  • R version 4.2.1 (2022-06-23)
  • other attached packages:
    [1] dbplyr_2.2.1 sparklyr_1.7.8 dplyr_1.0.10

(but the warning is the same with other older versions of those packages, R and spark)

Thank you.

@mgirlich
Copy link
Collaborator

I think the warning gives a pretty good hint what you should try to do. You learn the most if you try to understand what the warning is telling you and try to solve the problem yourself 😉
If you still have no success, please post what you tried out and then we can still help you 😄

@bobbymc0
Copy link
Author

bobbymc0 commented Jan 10, 2023

I don't think the warnings are supposed to be here. The code works fine, the order_by is not ignored.

I can not move "arrange() later in the pipeline" because I need to arrange() before creating the 2 columns.
I can not use window_order() because I need to arrange() by group.

Furthermore, if the 2 columns cum_mpg et max_mpg are created inside the same mutate(), then the warnings do not appear:

example_mtcars = sdf_mtcars %>%
  group_by(cyl) %>%
  arrange(mpg, .by_group = TRUE) %>%
  mutate(cum_mpg = cumsum(mpg),
  max_mpg = cummax(mpg)) %>%
  ungroup() %>% 
  sdf_collect()

@mgirlich
Copy link
Collaborator

If you create both columns in the same mutate, there is no subquery:

SELECT
  *,
  SUM(`mpg`) OVER `win1` AS `cum_mpg`,
  MAX(`mpg`) OVER `win1` AS `max_mpg`
FROM `dbplyr_001`
WINDOW `win1` AS (PARTITION BY `cyl` ORDER BY `cyl`, `mpg` ROWS UNBOUNDED PRECEDING)
ORDER BY `cyl`, `mpg`

and with the two separate mutates you get:

SELECT
  *,
  MAX(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `cyl`, `mpg` ROWS UNBOUNDED PRECEDING) AS `max_mpg`
FROM (
  SELECT
    *,
    SUM(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `cyl`, `mpg` ROWS UNBOUNDED PRECEDING) AS `cum_mpg`
  FROM `dbplyr_001`
  -- there would have to be an `ORDER BY` here to respect your dplyr query
)
-- not sure why `ORDER BY` is missing here, this might be a bug

I can not use window_order() because I need to arrange() by group.

You don't need it actually 😉 Neither in dplyr nor in dbplyr actually. The cumsum() is the same in both cases because in both cases per group the data is ordered the same way: by mpg. The .by_group only affects the overall order of the data frame.

library(dplyr)

mtcars |> 
  group_by(cyl) %>%
  arrange(mpg, .by_group = TRUE) %>%
  mutate(
    cum_mpg = cumsum(mpg),
    max_mpg = cummax(mpg)
  ) |> 
  ungroup()
#> # A tibble: 32 × 13
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cum_mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2    21.4
#>  2  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1    42.9
#>  3  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1    65.7
#>  4  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2    88.5
#>  5  24.4     4 147.     62  3.69  3.19  20       1     0     4     2   113. 
#>  6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2   139. 
#>  7  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1   166. 
#>  8  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2   197. 
#>  9  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2   227  
#> 10  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1   259. 
#> # … with 22 more rows, and 1 more variable: max_mpg <dbl>

mtcars |> 
  group_by(cyl) %>%
  arrange(mpg) %>%
  mutate(
    cum_mpg = cumsum(mpg),
    max_mpg = cummax(mpg)
  ) |> 
  ungroup() |> 
  # need to `arrange()` here to have the same order as in the case above
  arrange(cyl, mpg)
#> # A tibble: 32 × 13
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cum_mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2    21.4
#>  2  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1    42.9
#>  3  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1    65.7
#>  4  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2    88.5
#>  5  24.4     4 147.     62  3.69  3.19  20       1     0     4     2   113. 
#>  6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2   139. 
#>  7  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1   166. 
#>  8  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2   197. 
#>  9  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2   227  
#> 10  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1   259. 
#> # … with 22 more rows, and 1 more variable: max_mpg <dbl>

identical(is, max)
#> [1] FALSE

Created on 2023-01-10 with reprex v2.0.2

@hadley
Copy link
Member

hadley commented Jan 17, 2023

Isn't window_order() exaclty what you need here?

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  tbl_lazy() |> 
  window_order(mpg) |> 
  mutate(
    cum_mpg = cumsum(mpg),
    max_mpg = cummax(mpg),
    .by = cyl
  )
#> <SQL>
#> SELECT
#>   *,
#>   SUM(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `cum_mpg`,
#>   MAX(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `max_mpg`
#> FROM `df`

Created on 2023-01-17 with reprex v2.0.2

@fh-mthomson
Copy link
Contributor

I encountered this same puzzle and the error message was very informative -- window_order() was a perfect substitute. Accordingly I adapted my logic along the lines of:

if (inherits(., "tbl_sql")) {
  dbplyr::window_order(., x)
} else {
  dplyr::arrange(., c)
}

Would it be reasonable to embed similar behavior in the lazy tbl method for arrange()?

cc: @fh-dobeng

@hadley
Copy link
Member

hadley commented Jan 18, 2023

@fh-mthomson I don't it makes sense to change the behaviour of arrange() since using window_order() won't actually yield the same result (i.e. the rows won't be ordered). That's why we give a warning so that the user can figure out what to do.

@iangow
Copy link

iangow commented Apr 16, 2023

Isn't window_order() exaclty what you need here?

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  tbl_lazy() |> 
  window_order(mpg) |> 
  mutate(
    cum_mpg = cumsum(mpg),
    max_mpg = cummax(mpg),
    .by = cyl
  )
#> <SQL>
#> SELECT
#>   *,
#>   SUM(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `cum_mpg`,
#>   MAX(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `max_mpg`
#> FROM `df`

Created on 2023-01-17 with reprex v2.0.2

@hadley

Interestingly, R4DS (2e) says:

Here it’s important to arrange() the data, because SQL tables have no intrinsic order. In fact, if you don’t use arrange() you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the ORDER BY clause of the main query doesn’t automatically apply to window functions.

I was confused by this guidance, as I took the warnings and always use window_order() for these situations.

(But note I still get weird behaviour at times, see #1248.)

@jeffwitmer11
Copy link

jeffwitmer11 commented Jun 9, 2023

Regardless if window_order() works for the use case in this thread:

  1. The warning is unclear. The ORDER BY in the subquery is not ignored, per se, it just that the resultant data frame is not ordered. But the ordering is still happening in the groups.
  2. If dplyr::arrange() behaves differently on a Spark data frames, should this be an error instead of a warning? It feels half-baked that dplyr::arrange() can be used with Spark data frames in the same way it can be used with in-memory data frames, but only for some use cases.

@hadley
Copy link
Member

hadley commented Nov 2, 2023

Closing this thread because I believe dbplyr is already doing the best that it can here.

@hadley hadley closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants