Multiple warnings : "ORDER BY is ignored in subqueries without LIMIT" #1092

bobbymc0 · 2023-01-10T10:43:30Z

ORDER BY is ignored in subqueries without LIMIT

Issue
Script output is filled with the same warning message (over and over again, +30 times) :

ORDER BY is ignored in subqueries without LIMIT
Do you need to move arrange() later in the pipeline or use window_order() instead?

I have the same issue everywhere (local, databricks, Hadoop, etc.)
I don't know exactly what is causing the warning but it appears generally when I use dplyr::group_by() with dplyr::arrange().

Reproducible example

library("dplyr"); library("sparklyr") 
spark_version <- "3.1.3"
sc <- spark_connect(master = "local", version = spark_version)

# --- coping mtcars data into spark for the example
sdf_mtcars = sdf_copy_to(sc = sc, x = mtcars, name = "sdf_mtcars")

# --- example generating the warning message (twice)
# note : I had to use mutate() twice to make the warning appear
# the warning does not appear in this simple example with if only one of the two columns cum_mpg or max_mpg is created
example_mtcars = sdf_mtcars %>%
  group_by(cyl) %>%
  arrange(mpg, .by_group = TRUE) %>%
  mutate(cum_mpg = cumsum(mpg)) %>%
  mutate(max_mpg = cummax(mpg)) %>%
  ungroup() %>% 
  sdf_collect()

Warning messages :
1: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
2: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?

utils::sessionInfo()

R version 4.2.1 (2022-06-23)
other attached packages:
[1] dbplyr_2.2.1 sparklyr_1.7.8 dplyr_1.0.10

(but the warning is the same with other older versions of those packages, R and spark)

Thank you.

mgirlich · 2023-01-10T12:24:04Z

I think the warning gives a pretty good hint what you should try to do. You learn the most if you try to understand what the warning is telling you and try to solve the problem yourself 😉
If you still have no success, please post what you tried out and then we can still help you 😄

bobbymc0 · 2023-01-10T12:36:47Z

I don't think the warnings are supposed to be here. The code works fine, the order_by is not ignored.

I can not move "arrange() later in the pipeline" because I need to arrange() before creating the 2 columns.
I can not use window_order() because I need to arrange() by group.

Furthermore, if the 2 columns cum_mpg et max_mpg are created inside the same mutate(), then the warnings do not appear:

example_mtcars = sdf_mtcars %>%
  group_by(cyl) %>%
  arrange(mpg, .by_group = TRUE) %>%
  mutate(cum_mpg = cumsum(mpg),
  max_mpg = cummax(mpg)) %>%
  ungroup() %>% 
  sdf_collect()

mgirlich · 2023-01-10T12:50:16Z

If you create both columns in the same mutate, there is no subquery:

SELECT
  *,
  SUM(`mpg`) OVER `win1` AS `cum_mpg`,
  MAX(`mpg`) OVER `win1` AS `max_mpg`
FROM `dbplyr_001`
WINDOW `win1` AS (PARTITION BY `cyl` ORDER BY `cyl`, `mpg` ROWS UNBOUNDED PRECEDING)
ORDER BY `cyl`, `mpg`

and with the two separate mutates you get:

SELECT
  *,
  MAX(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `cyl`, `mpg` ROWS UNBOUNDED PRECEDING) AS `max_mpg`
FROM (
  SELECT
    *,
    SUM(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `cyl`, `mpg` ROWS UNBOUNDED PRECEDING) AS `cum_mpg`
  FROM `dbplyr_001`
  -- there would have to be an `ORDER BY` here to respect your dplyr query
)
-- not sure why `ORDER BY` is missing here, this might be a bug

I can not use window_order() because I need to arrange() by group.

You don't need it actually 😉 Neither in dplyr nor in dbplyr actually. The cumsum() is the same in both cases because in both cases per group the data is ordered the same way: by mpg. The .by_group only affects the overall order of the data frame.

library(dplyr)

mtcars |> 
  group_by(cyl) %>%
  arrange(mpg, .by_group = TRUE) %>%
  mutate(
    cum_mpg = cumsum(mpg),
    max_mpg = cummax(mpg)
  ) |> 
  ungroup()
#> # A tibble: 32 × 13
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cum_mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2    21.4
#>  2  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1    42.9
#>  3  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1    65.7
#>  4  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2    88.5
#>  5  24.4     4 147.     62  3.69  3.19  20       1     0     4     2   113. 
#>  6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2   139. 
#>  7  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1   166. 
#>  8  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2   197. 
#>  9  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2   227  
#> 10  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1   259. 
#> # … with 22 more rows, and 1 more variable: max_mpg <dbl>

mtcars |> 
  group_by(cyl) %>%
  arrange(mpg) %>%
  mutate(
    cum_mpg = cumsum(mpg),
    max_mpg = cummax(mpg)
  ) |> 
  ungroup() |> 
  # need to `arrange()` here to have the same order as in the case above
  arrange(cyl, mpg)
#> # A tibble: 32 × 13
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cum_mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2    21.4
#>  2  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1    42.9
#>  3  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1    65.7
#>  4  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2    88.5
#>  5  24.4     4 147.     62  3.69  3.19  20       1     0     4     2   113. 
#>  6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2   139. 
#>  7  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1   166. 
#>  8  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2   197. 
#>  9  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2   227  
#> 10  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1   259. 
#> # … with 22 more rows, and 1 more variable: max_mpg <dbl>

identical(is, max)
#> [1] FALSE

^{Created on 2023-01-10 with reprex v2.0.2}

hadley · 2023-01-17T15:29:45Z

Isn't window_order() exaclty what you need here?

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  tbl_lazy() |> 
  window_order(mpg) |> 
  mutate(
    cum_mpg = cumsum(mpg),
    max_mpg = cummax(mpg),
    .by = cyl
  )
#> <SQL>
#> SELECT
#>   *,
#>   SUM(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `cum_mpg`,
#>   MAX(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `max_mpg`
#> FROM `df`

^{Created on 2023-01-17 with reprex v2.0.2}

fh-mthomson · 2023-01-18T02:45:39Z

I encountered this same puzzle and the error message was very informative -- window_order() was a perfect substitute. Accordingly I adapted my logic along the lines of:

if (inherits(., "tbl_sql")) {
  dbplyr::window_order(., x)
} else {
  dplyr::arrange(., c)
}

Would it be reasonable to embed similar behavior in the lazy tbl method for arrange()?

cc: @fh-dobeng

hadley · 2023-01-18T13:42:35Z

@fh-mthomson I don't it makes sense to change the behaviour of arrange() since using window_order() won't actually yield the same result (i.e. the rows won't be ordered). That's why we give a warning so that the user can figure out what to do.

iangow · 2023-04-16T10:21:24Z

Isn't window_order() exaclty what you need here?

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  tbl_lazy() |> 
  window_order(mpg) |> 
  mutate(
    cum_mpg = cumsum(mpg),
    max_mpg = cummax(mpg),
    .by = cyl
  )
#> <SQL>
#> SELECT
#>   *,
#>   SUM(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `cum_mpg`,
#>   MAX(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `max_mpg`
#> FROM `df`

Created on 2023-01-17 with reprex v2.0.2

@hadley

Interestingly, R4DS (2e) says:

Here it’s important to arrange() the data, because SQL tables have no intrinsic order. In fact, if you don’t use arrange() you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the ORDER BY clause of the main query doesn’t automatically apply to window functions.

I was confused by this guidance, as I took the warnings and always use window_order() for these situations.

(But note I still get weird behaviour at times, see #1248.)

jeffwitmer11 · 2023-06-09T15:49:59Z

Regardless if window_order() works for the use case in this thread:

The warning is unclear. The ORDER BY in the subquery is not ignored, per se, it just that the resultant data frame is not ordered. But the ordering is still happening in the groups.
If dplyr::arrange() behaves differently on a Spark data frames, should this be an error instead of a warning? It feels half-baked that dplyr::arrange() can be used with Spark data frames in the same way it can be used with in-memory data frames, but only for some use cases.

hadley · 2023-11-02T21:41:22Z

Closing this thread because I believe dbplyr is already doing the best that it can here.

hadley closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple warnings : "ORDER BY is ignored in subqueries without LIMIT" #1092

Multiple warnings : "ORDER BY is ignored in subqueries without LIMIT" #1092

bobbymc0 commented Jan 10, 2023 •

edited

Loading

mgirlich commented Jan 10, 2023

bobbymc0 commented Jan 10, 2023 •

edited

Loading

mgirlich commented Jan 10, 2023

hadley commented Jan 17, 2023

fh-mthomson commented Jan 18, 2023

hadley commented Jan 18, 2023

iangow commented Apr 16, 2023

jeffwitmer11 commented Jun 9, 2023 •

edited

Loading

hadley commented Nov 2, 2023

Multiple warnings : "ORDER BY is ignored in subqueries without LIMIT" #1092

Multiple warnings : "ORDER BY is ignored in subqueries without LIMIT" #1092

Comments

bobbymc0 commented Jan 10, 2023 • edited Loading

ORDER BY is ignored in subqueries without LIMIT

mgirlich commented Jan 10, 2023

bobbymc0 commented Jan 10, 2023 • edited Loading

mgirlich commented Jan 10, 2023

hadley commented Jan 17, 2023

fh-mthomson commented Jan 18, 2023

hadley commented Jan 18, 2023

iangow commented Apr 16, 2023

jeffwitmer11 commented Jun 9, 2023 • edited Loading

hadley commented Nov 2, 2023

bobbymc0 commented Jan 10, 2023 •

edited

Loading

bobbymc0 commented Jan 10, 2023 •

edited

Loading

jeffwitmer11 commented Jun 9, 2023 •

edited

Loading