-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple warnings : "ORDER BY is ignored in subqueries without LIMIT" #1092
Comments
I think the warning gives a pretty good hint what you should try to do. You learn the most if you try to understand what the warning is telling you and try to solve the problem yourself 😉 |
I don't think the warnings are supposed to be here. The code works fine, the order_by is not ignored. I can not move "arrange() later in the pipeline" because I need to arrange() before creating the 2 columns. Furthermore, if the 2 columns cum_mpg et max_mpg are created inside the same mutate(), then the warnings do not appear:
|
If you create both columns in the same mutate, there is no subquery: SELECT
*,
SUM(`mpg`) OVER `win1` AS `cum_mpg`,
MAX(`mpg`) OVER `win1` AS `max_mpg`
FROM `dbplyr_001`
WINDOW `win1` AS (PARTITION BY `cyl` ORDER BY `cyl`, `mpg` ROWS UNBOUNDED PRECEDING)
ORDER BY `cyl`, `mpg` and with the two separate mutates you get: SELECT
*,
MAX(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `cyl`, `mpg` ROWS UNBOUNDED PRECEDING) AS `max_mpg`
FROM (
SELECT
*,
SUM(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `cyl`, `mpg` ROWS UNBOUNDED PRECEDING) AS `cum_mpg`
FROM `dbplyr_001`
-- there would have to be an `ORDER BY` here to respect your dplyr query
)
-- not sure why `ORDER BY` is missing here, this might be a bug
You don't need it actually 😉 Neither in dplyr nor in dbplyr actually. The library(dplyr)
mtcars |>
group_by(cyl) %>%
arrange(mpg, .by_group = TRUE) %>%
mutate(
cum_mpg = cumsum(mpg),
max_mpg = cummax(mpg)
) |>
ungroup()
#> # A tibble: 32 × 13
#> mpg cyl disp hp drat wt qsec vs am gear carb cum_mpg
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 21.4
#> 2 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 42.9
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 65.7
#> 4 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 88.5
#> 5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 113.
#> 6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 139.
#> 7 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 166.
#> 8 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 197.
#> 9 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 227
#> 10 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 259.
#> # … with 22 more rows, and 1 more variable: max_mpg <dbl>
mtcars |>
group_by(cyl) %>%
arrange(mpg) %>%
mutate(
cum_mpg = cumsum(mpg),
max_mpg = cummax(mpg)
) |>
ungroup() |>
# need to `arrange()` here to have the same order as in the case above
arrange(cyl, mpg)
#> # A tibble: 32 × 13
#> mpg cyl disp hp drat wt qsec vs am gear carb cum_mpg
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 21.4
#> 2 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 42.9
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 65.7
#> 4 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 88.5
#> 5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 113.
#> 6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 139.
#> 7 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 166.
#> 8 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 197.
#> 9 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 227
#> 10 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 259.
#> # … with 22 more rows, and 1 more variable: max_mpg <dbl>
identical(is, max)
#> [1] FALSE Created on 2023-01-10 with reprex v2.0.2 |
Isn't library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
mtcars |>
tbl_lazy() |>
window_order(mpg) |>
mutate(
cum_mpg = cumsum(mpg),
max_mpg = cummax(mpg),
.by = cyl
)
#> <SQL>
#> SELECT
#> *,
#> SUM(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `cum_mpg`,
#> MAX(`mpg`) OVER (PARTITION BY `cyl` ORDER BY `mpg` ROWS UNBOUNDED PRECEDING) AS `max_mpg`
#> FROM `df` Created on 2023-01-17 with reprex v2.0.2 |
I encountered this same puzzle and the error message was very informative --
Would it be reasonable to embed similar behavior in the lazy tbl method for cc: @fh-dobeng |
@fh-mthomson I don't it makes sense to change the behaviour of |
Interestingly, R4DS (2e) says:
I was confused by this guidance, as I took the warnings and always use (But note I still get weird behaviour at times, see #1248.) |
Regardless if
|
Closing this thread because I believe dbplyr is already doing the best that it can here. |
ORDER BY is ignored in subqueries without LIMIT
Issue
Script output is filled with the same warning message (over and over again, +30 times) :
ORDER BY is ignored in subqueries without LIMIT
Do you need to move arrange() later in the pipeline or use window_order() instead?
I have the same issue everywhere (local, databricks, Hadoop, etc.)
I don't know exactly what is causing the warning but it appears generally when I use dplyr::group_by() with dplyr::arrange().
Reproducible example
Warning messages :
1: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
2: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
utils::sessionInfo()
[1] dbplyr_2.2.1 sparklyr_1.7.8 dplyr_1.0.10
(but the warning is the same with other older versions of those packages, R and spark)
Thank you.
The text was updated successfully, but these errors were encountered: