Query statements scan one or more tables or expressions and return the computed result rows. This topic describes the syntax for SQL queries in ZetaSQL.
query_statement: query_expr query_expr: [ WITH with_query_name AS ( query_expr ) [, ...] ] { select | ( query_expr ) | query_expr set_op query_expr } [ ORDER BY expression [{ ASC | DESC }] [, ...] ] [ LIMIT count [ OFFSET skip_rows ] ] select: SELECT [ AS { typename | STRUCT | VALUE } ] [{ ALL | DISTINCT }] { [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
[ REPLACE ( expression [ AS ] column_name [, ...] ) ]
| expression [ [ AS ] alias ] } [, ...] [ WITH ANONYMIZATION OPTIONS( privacy_parameters ) ] [ FROM from_item [ tablesample_type ] [, ...] ] [ WHERE bool_expression ] [ GROUP BY { expression [, ...] | ROLLUP ( expression [, ...] ) } ] [ HAVING bool_expression ] [ WINDOW named_window_expression AS { named_window | ( [ window_definition ] ) } [, ...] ] set_op: UNION { ALL | DISTINCT } | INTERSECT { ALL | DISTINCT } | EXCEPT { ALL | DISTINCT } from_item: { table_name [ [ AS ] alias ] | join_operation | ( query_expr ) [ [ AS ] alias ] | field_path | { UNNEST( array_expression ) | UNNEST( array_path ) | array_path } [ [ AS ] alias ] [ WITH OFFSET [ [ AS ] alias ] ] | with_query_name [ [ AS ] alias ] } tablesample_type: TABLESAMPLE sample_method (sample_size percent_or_rows ) sample_method: { BERNOULLI | SYSTEM | RESERVOIR } sample_size: numeric_value_expression percent_or_rows: { PERCENT | ROWS }
Notation rules
- Square brackets "[ ]" indicate optional clauses.
- Parentheses "( )" indicate literal parentheses.
- The vertical bar "|" indicates a logical OR.
- Curly braces "{ }" enclose a set of options.
- A comma followed by an ellipsis within square brackets "[, ... ]" indicates that the preceding item can repeat in a comma-separated list.
The following tables are used to illustrate the behavior of different query clauses in this reference.
The Roster
table includes a list of player names (LastName
) and the
unique ID assigned to their school (SchoolID
). It looks like this:
+-----------------------+
| LastName | SchoolID |
+-----------------------+
| Adams | 50 |
| Buchanan | 52 |
| Coolidge | 52 |
| Davis | 51 |
| Eisenhower | 77 |
+-----------------------+
You can use this WITH
clause to emulate a temporary table name for the
examples in this reference:
WITH Roster AS
(SELECT 'Adams' as LastName, 50 as SchoolID UNION ALL
SELECT 'Buchanan', 52 UNION ALL
SELECT 'Coolidge', 52 UNION ALL
SELECT 'Davis', 51 UNION ALL
SELECT 'Eisenhower', 77)
SELECT * FROM Roster
The PlayerStats
table includes a list of player names (LastName
) and the
unique ID assigned to the opponent they played in a given game (OpponentID
)
and the number of points scored by the athlete in that game (PointsScored
).
+----------------------------------------+
| LastName | OpponentID | PointsScored |
+----------------------------------------+
| Adams | 51 | 3 |
| Buchanan | 77 | 0 |
| Coolidge | 77 | 1 |
| Adams | 52 | 4 |
| Buchanan | 50 | 13 |
+----------------------------------------+
You can use this WITH
clause to emulate a temporary table name for the
examples in this reference:
WITH PlayerStats AS
(SELECT 'Adams' as LastName, 51 as OpponentID, 3 as PointsScored UNION ALL
SELECT 'Buchanan', 77, 0 UNION ALL
SELECT 'Coolidge', 77, 1 UNION ALL
SELECT 'Adams', 52, 4 UNION ALL
SELECT 'Buchanan', 50, 13)
SELECT * FROM PlayerStats
The TeamMascot
table includes a list of unique school IDs (SchoolID
) and the
mascot for that school (Mascot
).
+---------------------+
| SchoolID | Mascot |
+---------------------+
| 50 | Jaguars |
| 51 | Knights |
| 52 | Lakers |
| 53 | Mustangs |
+---------------------+
You can use this WITH
clause to emulate a temporary table name for the
examples in this reference:
WITH TeamMascot AS
(SELECT 50 as SchoolID, 'Jaguars' as Mascot UNION ALL
SELECT 51, 'Knights' UNION ALL
SELECT 52, 'Lakers' UNION ALL
SELECT 53, 'Mustangs')
SELECT * FROM TeamMascot
SELECT [ AS { typename | STRUCT | VALUE } ] [{ ALL | DISTINCT }] { [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
[ REPLACE ( expression [ AS ] column_name [, ...] ) ]
| expression [ [ AS ] alias ] } [, ...]
The SELECT
list defines the columns that the query will return. Expressions in
the SELECT
list can refer to columns in any of the from_item
s in its
corresponding FROM
clause.
Each item in the SELECT
list is one of:
*
expression
expression.*
SELECT *
, often referred to as select star, produces one output column for
each column that is visible after executing the full query.
SELECT * FROM (SELECT "apple" AS fruit, "carrot" AS vegetable);
+-------+-----------+
| fruit | vegetable |
+-------+-----------+
| apple | carrot |
+-------+-----------+
Items in a SELECT
list can be expressions. These expressions evaluate to a
single value and produce one output column, with an optional explicit alias
.
If the expression does not have an explicit alias, it receives an implicit alias according to the rules for implicit aliases, if possible. Otherwise, the column is anonymous and you cannot refer to it by name elsewhere in the query.
An item in a SELECT
list can also take the form of expression.*
. This
produces one output column for each column or top-level field of expression
.
The expression must either be a table alias or evaluate to a single value of a
data type with fields, such as a STRUCT.
The following query produces one output column for each column in the table
groceries
, aliased as g
.
WITH groceries AS
(SELECT "milk" AS dairy,
"eggs" AS protein,
"bread" AS grain)
SELECT g.*
FROM groceries AS g;
+-------+---------+-------+
| dairy | protein | grain |
+-------+---------+-------+
| milk | eggs | bread |
+-------+---------+-------+
More examples:
WITH locations AS
(SELECT STRUCT("Seattle" AS city, "Washington" AS state) AS location
UNION ALL
SELECT STRUCT("Phoenix" AS city, "Arizona" AS state) AS location)
SELECT l.location.*
FROM locations l;
+---------+------------+
| city | state |
+---------+------------+
| Seattle | Washington |
| Phoenix | Arizona |
+---------+------------+
WITH locations AS
(SELECT ARRAY<STRUCT<city STRING, state STRING>>[("Seattle", "Washington"),
("Phoenix", "Arizona")] AS location)
SELECT l.LOCATION[offset(0)].*
FROM locations l;
+---------+------------+
| city | state |
+---------+------------+
| Seattle | Washington |
+---------+------------+
A SELECT * EXCEPT
statement specifies the names of one or more columns to
exclude from the result. All matching column names are omitted from the output.
WITH orders AS
(SELECT 5 as order_id,
"sprocket" as item_name,
200 as quantity)
SELECT * EXCEPT (order_id)
FROM orders;
+-----------+----------+
| item_name | quantity |
+-----------+----------+
| sprocket | 200 |
+-----------+----------+
Note: SELECT * EXCEPT
does not exclude columns that do not have names.
A SELECT * REPLACE
statement specifies one or more
expression AS identifier
clauses. Each identifier must match a column name
from the SELECT *
statement. In the output column list, the column that
matches the identifier in a REPLACE
clause is replaced by the expression in
that REPLACE
clause.
A SELECT * REPLACE
statement does not change the names or order of columns.
However, it can change the value and the value type.
WITH orders AS
(SELECT 5 as order_id,
"sprocket" as item_name,
200 as quantity)
SELECT * REPLACE ("widget" AS item_name)
FROM orders;
+----------+-----------+----------+
| order_id | item_name | quantity |
+----------+-----------+----------+
| 5 | widget | 200 |
+----------+-----------+----------+
WITH orders AS
(SELECT 5 as order_id,
"sprocket" as item_name,
200 as quantity)
SELECT * REPLACE (quantity/2 AS quantity)
FROM orders;
+----------+-----------+----------+
| order_id | item_name | quantity |
+----------+-----------+----------+
| 5 | sprocket | 100 |
+----------+-----------+----------+
Note: SELECT * REPLACE
does not replace columns that do not have names.
You can modify the results returned from a SELECT
query, as follows.
A SELECT DISTINCT
statement discards duplicate rows and returns only the
remaining rows. SELECT DISTINCT
cannot return columns of the following types:
PROTO
A SELECT ALL
statement returns all rows, including duplicate rows.
SELECT ALL
is the default behavior of SELECT
.
In ZetaSQL, a value table is a table where the row type is a single value. In a regular table, each row is made up of columns, each of which has a name and a type. In a value table, the row type is just a single value, and there are no column names.
Most commonly, value tables are used for protocol buffer value tables, where the table contains a stream of protocol buffer values. In this case, the top-level protocol buffer fields can be used in the same way that column names are used when querying a regular table.
In contexts where a query with exactly one column is expected, a value table query can be used instead. For example, scalar subqueries and array subqueries (see Subqueries) normally require a single-column query, but in ZetaSQL, they also allow using a value table query.
A query will produce a value table if it uses SELECT AS
, using one of the
syntaxes below:
SELECT AS STRUCT expr [[AS] struct_field_name1] [,...]
This produces a value table with a STRUCT row type,
where the STRUCT field names and types match the
column names and types produced in the SELECT
list.
Example:
SELECT ARRAY(SELECT AS STRUCT 1 a, 2 b)
SELECT AS STRUCT
can be used in a scalar or array subquery to produce a single
STRUCT type grouping multiple values together. Scalar
and array subqueries (see Subqueries) are normally not
allowed to return multiple columns, but can return a single column with
STRUCT type.
Anonymous columns are allowed.
Example:
SELECT AS STRUCT 1 x, 2, 3
The query above produces STRUCT values of type
STRUCT<int64 x, int64, int64>.
The first field has the name x
while the
second and third fields are anonymous.
The example above produces the same result as this SELECT AS VALUE
query using
a struct constructor:
SELECT AS VALUE STRUCT(1 AS x, 2, 3)
Duplicate columns are allowed.
Example:
SELECT AS STRUCT 1 x, 2 y, 3 x
The query above produces STRUCT values of type
STRUCT<int64 x, int64 y, int64 x>.
The first and third fields have the same
name x
while the second field has the name y
.
The example above produces the same result as this SELECT AS VALUE
query
using a struct constructor:
SELECT AS VALUE STRUCT(1 AS x, 2 AS y, 3 AS x)
SELECT AS VALUE
produces a value table from any SELECT
list that
produces exactly one column. Instead of producing an output table with one
column, possibly with a name, the output will be a value table where the row
type is just the value type that was produced in the one SELECT
column. Any
alias the column had will be discarded in the value table.
Example:
SELECT AS VALUE 1
The query above produces a table with row type INT64.
Example:
SELECT AS VALUE STRUCT(1 AS a, 2 AS b) xyz
The query above produces a table with row type STRUCT<a int64, b int64>
.
Example:
SELECT AS VALUE v FROM (SELECT AS STRUCT 1 a, true b) v WHERE v.b
Given a value table v
as input, the query above filters out certain values in
the WHERE
clause, and then produces a value table using the exact same value
that was in the input table. If the query above did not use SELECT AS VALUE
,
then the output table schema would differ from the input table schema because
the output table would be a regular table with a column named v
containing the
input value.
See Using Aliases for information on syntax and visibility for
SELECT
list aliases.
from_item: { table_name [ [ AS ] alias ] | join_operation | ( query_expr ) [ [ AS ] alias ] | field_path | { UNNEST( array_expression ) | UNNEST( array_path ) | array_path } [ [ AS ] alias ] [ WITH OFFSET [ [ AS ] alias ] ] | with_query_name [ [ AS ] alias ] }
The FROM
clause indicates the table or tables from which to retrieve rows, and
specifies how to join those rows together to produce a single stream of
rows for processing in the rest of the query.
The name (optionally qualified) of an existing table.
SELECT * FROM Roster; SELECT * FROM db.Roster;
See JOIN operation.
( select ) [ [ AS ] alias ]
is a table subquery.
In the FROM
clause, field_path
is any path that
resolves to a field within a data type. field_path
can go
arbitrarily deep into a nested data structure.
Some examples of valid field_path
values include:
SELECT * FROM T1 t1, t1.array_column;
SELECT * FROM T1 t1, t1.struct_column.array_field;
SELECT (SELECT ARRAY_AGG(c) FROM t1.array_column c) FROM T1 t1;
SELECT a.struct_field1 FROM T1 t1, t1.array_of_structs a;
SELECT (SELECT STRING_AGG(a.struct_field1) FROM t1.array_of_structs a) FROM T1 t1;
Field paths in the FROM
clause must end in an
array or a repeated field. In
addition, field paths cannot contain arrays
or repeated fields before the end of the path. For example, the path
array_column.some_array.some_array_field
is invalid because it
contains an array before the end of the path.
Note: If a path has only one name, it is interpreted as a table.
To work around this, wrap the path using UNNEST
, or use the
fully-qualified path.
Note: If a path has more than one name, and it matches a field
name, it is interpreted as a field name. To force the path to be interpreted as
a table name, wrap the path using `
.
The UNNEST
operator takes an ARRAY
and returns a
table, with one row for each element in the ARRAY
.
You can also use UNNEST
outside of the FROM
clause with the
IN
operator.
For input ARRAY
s of most element types, the output of UNNEST
generally has
one column. This single column has an optional alias
, which you can use to
refer to the column elsewhere in the query. ARRAYS
with these element types
return multiple columns:
- STRUCT
- PROTO
UNNEST
destroys the order of elements in the input
ARRAY
. Use the optional WITH OFFSET
clause to
return a second column with the array element indexes (see following).
For several ways to use UNNEST
, including construction, flattening, and
filtering, see Working with arrays
.
For an input ARRAY
of STRUCT
s, UNNEST
returns a row for each STRUCT
, with a separate column for each field in the
STRUCT
. The alias for each column is the name of the corresponding STRUCT
field.
Example:
SELECT *
FROM UNNEST(ARRAY<STRUCT<x INT64, y STRING>>[(1, 'foo'), (3, 'bar')]);
+---+-----+
| x | y |
+---+-----+
| 3 | bar |
| 1 | foo |
+---+-----+
Because the UNNEST
operator returns a
value table,
you can alias UNNEST
to define a range variable that you can reference
elsewhere in the query. If you reference the range variable in the SELECT
list, the query returns a STRUCT
containing all of the fields of the original
STRUCT
in the input table.
Example:
SELECT *, struct_value
FROM UNNEST(ARRAY<STRUCT<x INT64, y STRING>>[(1, 'foo'), (3, 'bar')])
AS struct_value;
+---+-----+--------------+
| x | y | struct_value |
+---+-----+--------------+
| 3 | bar | {3, bar} |
| 1 | foo | {1, foo} |
+---+-----+--------------+
For an input ARRAY
of PROTO
s, UNNEST
returns a row for each PROTO
, with a separate column for each field in the
PROTO
. The alias for each column is the name of the corresponding PROTO
field.
Example:
SELECT *
FROM UNNEST(
ARRAY<zetasql.examples.music.Album>[
NEW zetasql.examples.music.Album (
'The Goldberg Variations' AS album_name,
['Aria', 'Variation 1', 'Variation 2'] AS song
)
]
);
+-------------------------+--------+----------------------------------+
| album_name | singer | song |
+-------------------------+--------+----------------------------------+
| The Goldberg Variations | NULL | [Aria, Variation 1, Variation 2] |
+-------------------------+--------+----------------------------------+
As with STRUCT
s, you can alias UNNEST
here to define a range variable. You
can reference this alias in the SELECT
list to return a value table where each
row is a PROTO
element from the ARRAY
.
SELECT proto_value
FROM UNNEST(
ARRAY<zetasql.examples.music.Album>[
NEW zetasql.examples.music.Album (
'The Goldberg Variations' AS album_name,
['Aria', 'Var. 1'] AS song
)
]
) AS proto_value;
+---------------------------------------------------------------------+
| proto_value |
+---------------------------------------------------------------------+
| {album_name: "The Goldberg Variations" song: "Aria" song: "Var. 1"} |
+---------------------------------------------------------------------+
ARRAY
unnesting can be either explicit or implicit.
In explicit unnesting, array_expression
must return an
ARRAY
value but does not need to resolve to an ARRAY
, and the UNNEST
keyword is required.
Example:
SELECT * FROM UNNEST ([1, 2, 3]);
In implicit unnesting, array_path
must resolve to an ARRAY
and the
UNNEST
keyword is optional.
Example:
SELECT x
FROM mytable AS t,
t.struct_typed_column.array_typed_field1 AS x;
In this scenario, array_path
can go arbitrarily deep into a data
structure, but the last field must be ARRAY
-typed. No previous field in the
expression can be ARRAY
-typed because it is not possible to extract a named
field from an ARRAY
.
The UNNEST
operator accepts a flatten path
as its argument for array_expression
. When the argument is a flatten path, the
UNNEST
operator produces one row for each element in the array that results
from applying the FLATTEN
operator to the flatten path.
To learn more about the relationship between these operators and flattening,
see Flattening tree-structured data into arrays.
UNNEST
treats NULLs as follows:
- NULL and empty arrays produces zero rows.
- An array containing NULLs produces rows containing NULL values.
The optional WITH OFFSET
clause returns a separate
column containing the "offset" value (i.e. counting starts at zero) for each row
produced by the UNNEST
operation. This column has an optional
alias
; the default alias is offset.
Example:
SELECT * FROM UNNEST ( ) WITH OFFSET AS num;
The query names in a WITH
clause (see WITH Clause) act like
names of temporary tables that you can reference anywhere in the FROM
clause.
In the example below, subQ1
and subQ2
are with_query_names
.
Example:
WITH
subQ1 AS (SELECT * FROM Roster WHERE SchoolID = 52),
subQ2 AS (SELECT SchoolID FROM subQ1)
SELECT DISTINCT * FROM subQ2;
The WITH
clause hides any permanent tables with the same name
for the duration of the query, unless you qualify the table name, for example:
db.Roster
.
tablesample_type:
TABLESAMPLE sample_method (sample_size percent_or_rows [ partition_by ])
[ REPEATABLE(repeat_argument) ]
[ WITH WEIGHT [AS alias] ]
sample_method:
{ BERNOULLI | SYSTEM | RESERVOIR }
sample_size:
numeric_value_expression
percent_or_rows:
{ PERCENT | ROWS }
partition_by:
PARTITION BY partition_expression [, ...]
Description
You can use the TABLESAMPLE
operator to select a random sample of a data
set. This operator is useful when working with tables that have large amounts of
data and precise answers are not required.
sample_method
: When using theTABLESAMPLE
operator, you must specify the sampling algorithm to use:BERNOULLI
: Each row is independently selected with the probability given in thepercent
clause. As a result, you get approximatelyN * percent/100
rows.SYSTEM
: Produces a sample using an unspecified engine-dependent method, which may be more efficient but less probabilistically random than other methods. For example, it could choose random disk blocks and return data from those blocks.RESERVOIR
: Takes as parameter an actual sample size K (expressed as a number of rows). If the input is smaller than K, it outputs the entire input relation. If the input is larger than K, reservoir sampling outputs a sample of size exactly K, where any sample of size K is equally likely.
sample_size
: The size of the sample.percent_or_rows
: TheTABLESAMPLE
operator requires that you choose eitherROWS
orPERCENT
. If you choosePERCENT
, the value must be between 0 and 100. If you chooseROWS
, the value must be greater than or equal to 0.partition_by
: Optional. Perform stratefied sampling for each distinct group identified by thePARTITION BY
clause. That is, if the number of rows in a particular group is less than the specified row count, all rows in that group are assigned to the sample. Otherwise, it randomly selects the specified number of rows for each group, where for a particular group, every sample of that size is equally likely.REPEATABLE
: Optional. When it is used, repeated executions of the sampling operation return a result table with identical rows for a given repeat argument, as long as the underlying data does not change.repeat_argument
represents a sampling seed and must be a positive value of typeINT64
.WITH WEIGHT
: Optional. Retrieves scaling weight. If specified, theTableSample
operator outputs one extra column of typeDOUBLE
that is greater than or equal 1.0 to represent the actual scaling weight. If an alias is not provided, the default name weight is used.- In Bernoulli sampling, the weight is
1 / provided sampling probability
. For example,TABLESAMPLE BERNOULLI (1 percent)
will expose the weight of1 / 0.01
. - In System sampling, the weight is approximated or computed exactly in some engine-defined way, as long as its type and value range is specified.
- In non-stratified fixed row count sampling, (RESERVOIR without the PARTITION BY clause), the weight is equal to the total number of input rows divided by the count of sampled rows.
- In stratified sampling, (RESERVOIR with the PARTITION BY clause), the weight for rows from a particular group is equal to the group cardinality divided by the count of sampled rows for that group.
- In Bernoulli sampling, the weight is
Examples
The following examples illustrate the use of the TABLESAMPLE
operator.
Select from a table using the RESERVOIR
sampling method:
SELECT MessageId
FROM Messages TABLESAMPLE RESERVOIR (100 ROWS);
Select from a table using the BERNOULLI
sampling method:
SELECT MessageId
FROM Messages TABLESAMPLE BERNOULLI (0.1 PERCENT);
Use TABLESAMPLE
with a repeat argument:
SELECT MessageId
FROM Messages TABLESAMPLE RESERVOIR (100 ROWS) REPEATABLE(10);
Use TABLESAMPLE
with a subquery:
SELECT Subject FROM
(SELECT MessageId, Subject FROM Messages WHERE ServerId="test")
TABLESAMPLE BERNOULLI(50 PERCENT)
WHERE MessageId > 3;
Use a TABLESAMPLE
operation with a join to another table.
SELECT S.Subject
FROM
(SELECT MessageId, ThreadId FROM Messages WHERE ServerId="test") AS R
TABLESAMPLE RESERVOIR(5 ROWS),
Threads AS S
WHERE S.ServerId="test" AND R.ThreadId = S.ThreadId;
Group results by country, using stratefied sampling:
SELECT country, SUM(click_cost) FROM ClickEvents
TABLESAMPLE RESERVOIR (100 ROWS PARTITION BY country)
GROUP BY country;
Add scaling weight to stratefied sampling:
SELECT country, SUM(click_cost * sampling_weight) FROM ClickEvents
TABLESAMPLE RESERVOIR (100 ROWS PARTITION BY country)
WITH WEIGHT AS sampling_weight
GROUP BY country;
This is equivelant to the previous example. Note that you don't have to use
an alias after WITH WEIGHT
. If you don't, the default alias weight
is used.
SELECT country, SUM(click_cost * weight) FROM ClickEvents
TABLESAMPLE RESERVOIR (100 ROWS PARTITION BY country)
WITH WEIGHT
GROUP BY country;
If you want better quality generated samples for under-represented groups,
you can use stratefied sampling. Stratefied sampling helps you
avoid samples with missing groups. To allow stratified sampling per
distinct group, use PARTITION BY
with RESERVOIR
in the TABLESAMPLE
clause.
Stratified sampling performs RESERVOIR
sampling for each distinct group
identified by the PARTITION BY
clause. If the number of rows in a particular
group is less than the specified row count, all rows in that group are assigned
to the sample. Otherwise, it randomly selects the specified number of rows for
each group, where for a particular group, every sample of that size is equally
likely.
Example
Let’s consider a table named ClickEvents
representing a stream of
click events, each of which has two fields: country
and click_cost
.
country
represents the country from which the click was originated
and click_cost
represents how much the click costs. In this example,
100 rows are randomly selected for each country.
SELECT click_cost, country FROM ClickEvents
TABLESAMPLE RESERVOIR (100 ROWS PARTITION BY country)
With scaling weight, you can perform fast and reasonable population estimates
from generated samples or estimate the aggregate results from samples. You can
capture scaling weight for a tablesample with the WITH WEIGHT
clause.
Scaling weight represents the reciprocal of the actual, observed sampling
rate for a tablesample, making it easier to estimate aggregate results for
samples. The exposition of scaling weight generally applies to all variations
of TABLESAMPLE
, including stratified Reservoir, non-stratified Reservoir,
Bernoulli, and System.
Let’s consider a table named ClickEvents
representing a stream of
click events, each of which has two fields: country
and click_cost
.
country
represents the country from which the click was originated
and click_cost
represents how much the click costs. To calculate the
total click cost per country, you can use the following query:
SELECT country, SUM(click_cost)
FROM ClickEvents
GROUP BY country;
You can leverage the existing uniform sampling with fixed probability, using Bernoulli sampling and run this query to estimate the result of the previous query:
SELECT country, SUM(click_cost * weight)
FROM ClickEvents TABLESAMPLE BERNOULLI (1 PERCENT)
WITH WEIGHT
GROUP BY country;
You can break the second query into two steps:
- Materialize a sample for reuse.
- Perform aggregate estimates of the materialized sample.
Instead of aggregating the entire table, you use a 1% uniform sample to aggregate a fraction of the original table and to compute the total click cost. Because only 1% of the rows flow into the aggregation operator, you need to scale the aggregate with a certain weight. Specifically, we multiply the aggregate with 100, the reciprocal of the provided sampling probability, for each group. And because we use uniform sampling, the scaling weight for each group is effectively equal to the scaling weight for each row of the table, which is 100.
Even though this sample provides a statistically accurate representation
of the original table, it might miss an entire group of rows, such as countries
in the running example, with small cardinality. For example, suppose that
the ClickEvents
table contains 10000 rows, with 9990 rows of value US
and 10 rows of value VN
. The number of distinct countries in this example
is two. With 1% uniform sampling, it is statistically probable that all the
sampled rows are from the US
and none of them are from the VN
partition.
As a result, the output of the second query does not contain the SUM
estimate
for the group VN
. We refer to this as the missing-group problem, which
can be solved with stratefied sampling.
See Using Aliases for information on syntax and visibility for
FROM
clause aliases.
join_operation: { cross_join_operation | join_operation_with_condition } cross_join_operation: from_item CROSS JOIN from_item join_operation_with_condition: from_item [ join_type ] JOIN from_item [ { on_clause | using_clause } ] join_type: { [INNER] | CROSS | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] } on_clause: ON bool_expression using_clause: USING ( join_column [, ...] )
The JOIN
operation merges two from_item
s so that the SELECT
clause can
query them as one source. The join_type
and ON
or USING
clause (a
"join condition") specify how to combine and discard rows from the two
from_item
s to form a single source.
All JOIN
operations require a join_type
. If no join_type
is provided with
a JOIN
operation, an INNER JOIN
is performed.
A JOIN
operation requires a join condition unless one of the following conditions
is true:
join_type
isCROSS
.- One or both of the
from_item
s is not a table, for example, anarray_path
orfield_path
.
An INNER JOIN
, or simply JOIN
, effectively calculates the Cartesian product
of the two from_item
s and discards all rows that do not meet the join
condition. "Effectively" means that it is possible to implement an INNER JOIN
without actually calculating the Cartesian product.
FROM A INNER JOIN B ON A.w = B.y
Table A Table B Result
+-------+ +-------+ +---------------+
| w | x | * | y | z | = | w | x | y | z |
+-------+ +-------+ +---------------+
| 1 | a | | 2 | k | | 2 | b | 2 | k |
| 2 | b | | 3 | m | | 3 | c | 3 | m |
| 3 | c | | 3 | n | | 3 | c | 3 | n |
| 3 | d | | 4 | p | | 3 | d | 3 | m |
+-------+ +-------+ | 3 | d | 3 | n |
+---------------+
FROM A INNER JOIN B USING (x)
Table A Table B Result
+-------+ +-------+ +-----------+
| x | y | * | x | z | = | x | y | z |
+-------+ +-------+ +-----------+
| 1 | a | | 2 | k | | 2 | b | k |
| 2 | b | | 3 | m | | 3 | c | m |
| 3 | c | | 3 | n | | 3 | c | n |
| 3 | d | | 4 | p | | 3 | d | m |
+-------+ +-------+ | 3 | d | n |
+-----------+
Example
This query performs an INNER JOIN
on the Roster
and TeamMascot
tables.
SELECT Roster.LastName, TeamMascot.Mascot
FROM Roster JOIN TeamMascot ON Roster.SchoolID = TeamMascot.SchoolID;
+---------------------------+
| LastName | Mascot |
+---------------------------+
| Adams | Jaguars |
| Buchanan | Lakers |
| Coolidge | Lakers |
| Davis | Knights |
+---------------------------+
CROSS JOIN
returns the Cartesian product of the two from_item
s. In other
words, it combines each row from the first from_item
with each row from the
second from_item
.
If the rows of the two from_item
s are independent, then the result has M *
N rows, given M rows in one from_item
and N in the other. Note that this
still holds for the case when either from_item
has zero rows.
FROM A CROSS JOIN B
Table A Table B Result
+-------+ +-------+ +---------------+
| w | x | * | y | z | = | w | x | y | z |
+-------+ +-------+ +---------------+
| 1 | a | | 2 | c | | 1 | a | 2 | c |
| 2 | b | | 3 | d | | 1 | a | 3 | d |
+-------+ +-------+ | 2 | b | 2 | c |
| 2 | b | 3 | d |
+---------------+
You can use correlated CROSS JOIN
s to
flatten ARRAY
columns. In this case, the rows of the
second from_item
vary for each row of the first from_item
.
FROM A CROSS JOIN A.y
Table A Result
+-------------------+ +-----------+
| w | x | y | -> | w | x | y |
+-------------------+ +-----------+
| 1 | a | [P, Q] | | 1 | a | P |
| 2 | b | [R, S, T] | | 1 | a | Q |
+-------------------+ | 2 | b | R |
| 2 | b | S |
| 2 | b | T |
+-----------+
CROSS JOIN
s can be written explicitly like this:
FROM a CROSS JOIN b
Or implicitly as a comma cross join like this:
FROM a, b
You cannot write comma cross joins inside parentheses:
FROM a CROSS JOIN (b, c) // INVALID
See Sequences of JOINs for details on how a comma cross join behaves in a sequence of JOINs.
Examples
This query performs an explicit CROSS JOIN
on the Roster
and TeamMascot
tables.
SELECT Roster.LastName, TeamMascot.Mascot
FROM Roster CROSS JOIN TeamMascot;
+---------------------------+
| LastName | Mascot |
+---------------------------+
| Adams | Jaguars |
| Adams | Knights |
| Adams | Lakers |
| Adams | Mustangs |
| Buchanan | Jaguars |
| Buchanan | Knights |
| Buchanan | Lakers |
| Buchanan | Mustangs |
| ... |
+---------------------------+
This query performs a comma cross join that produces the same results as the
explicit CROSS JOIN
above:
SELECT Roster.LastName, TeamMascot.Mascot
FROM Roster, TeamMascot;
A FULL OUTER JOIN
(or simply FULL JOIN
) returns all fields for all rows in
both from_item
s that meet the join condition.
FULL
indicates that all rows from both from_item
s are
returned, even if they do not meet the join condition.
OUTER
indicates that if a given row from one from_item
does not
join to any row in the other from_item
, the row will return with NULLs
for all columns from the other from_item
.
FROM A FULL OUTER JOIN B ON A.w = B.y
Table A Table B Result
+-------+ +-------+ +---------------------------+
| w | x | * | y | z | = | w | x | y | z |
+-------+ +-------+ +---------------------------+
| 1 | a | | 2 | k | | 1 | a | NULL | NULL |
| 2 | b | | 3 | m | | 2 | b | 2 | k |
| 3 | c | | 3 | n | | 3 | c | 3 | m |
| 3 | d | | 4 | p | | 3 | c | 3 | n |
+-------+ +-------+ | 3 | d | 3 | m |
| 3 | d | 3 | n |
| NULL | NULL | 4 | p |
+---------------------------+
FROM A FULL OUTER JOIN B USING (x)
Table A Table B Result
+-------+ +-------+ +--------------------+
| x | y | * | x | z | = | x | y | z |
+-------+ +-------+ +--------------------+
| 1 | a | | 2 | k | | 1 | a | NULL |
| 2 | b | | 3 | m | | 2 | b | k |
| 3 | c | | 3 | n | | 3 | c | m |
| 3 | d | | 4 | p | | 3 | c | n |
+-------+ +-------+ | 3 | d | m |
| 3 | d | n |
| 4 | NULL | p |
+--------------------+
Example
This query performs a FULL JOIN
on the Roster
and TeamMascot
tables.
SELECT Roster.LastName, TeamMascot.Mascot
FROM Roster FULL JOIN TeamMascot ON Roster.SchoolID = TeamMascot.SchoolID;
+---------------------------+
| LastName | Mascot |
+---------------------------+
| Adams | Jaguars |
| Buchanan | Lakers |
| Coolidge | Lakers |
| Davis | Knights |
| Eisenhower | NULL |
| NULL | Mustangs |
+---------------------------+
The result of a LEFT OUTER JOIN
(or simply LEFT JOIN
) for two
from_item
s always retains all rows of the left from_item
in the
JOIN
operation, even if no rows in the right from_item
satisfy the join
predicate.
LEFT
indicates that all rows from the left from_item
are
returned; if a given row from the left from_item
does not join to any row
in the right from_item
, the row will return with NULLs for all
columns from the right from_item
. Rows from the right from_item
that
do not join to any row in the left from_item
are discarded.
FROM A LEFT OUTER JOIN B ON A.w = B.y
Table A Table B Result
+-------+ +-------+ +---------------------------+
| w | x | * | y | z | = | w | x | y | z |
+-------+ +-------+ +---------------------------+
| 1 | a | | 2 | k | | 1 | a | NULL | NULL |
| 2 | b | | 3 | m | | 2 | b | 2 | k |
| 3 | c | | 3 | n | | 3 | c | 3 | m |
| 3 | d | | 4 | p | | 3 | c | 3 | n |
+-------+ +-------+ | 3 | d | 3 | m |
| 3 | d | 3 | n |
+---------------------------+
FROM A LEFT OUTER JOIN B USING (x)
Table A Table B Result
+-------+ +-------+ +--------------------+
| x | y | * | x | z | = | x | y | z |
+-------+ +-------+ +--------------------+
| 1 | a | | 2 | k | | 1 | a | NULL |
| 2 | b | | 3 | m | | 2 | b | k |
| 3 | c | | 3 | n | | 3 | c | m |
| 3 | d | | 4 | p | | 3 | c | n |
+-------+ +-------+ | 3 | d | m |
| 3 | d | n |
+--------------------+
Example
This query performs a LEFT JOIN
on the Roster
and TeamMascot
tables.
SELECT Roster.LastName, TeamMascot.Mascot
FROM Roster LEFT JOIN TeamMascot ON Roster.SchoolID = TeamMascot.SchoolID;
+---------------------------+
| LastName | Mascot |
+---------------------------+
| Adams | Jaguars |
| Buchanan | Lakers |
| Coolidge | Lakers |
| Davis | Knights |
| Eisenhower | NULL |
+---------------------------+
The result of a RIGHT OUTER JOIN
(or simply RIGHT JOIN
) is similar and
symmetric to that of LEFT OUTER JOIN
.
FROM A RIGHT OUTER JOIN B ON A.w = B.y
Table A Table B Result
+-------+ +-------+ +---------------------------+
| w | x | * | y | z | = | w | x | y | z |
+-------+ +-------+ +---------------------------+
| 1 | a | | 2 | k | | 2 | b | 2 | k |
| 2 | b | | 3 | m | | 3 | c | 3 | m |
| 3 | c | | 3 | n | | 3 | c | 3 | n |
| 3 | d | | 4 | p | | 3 | d | 3 | m |
+-------+ +-------+ | 3 | d | 3 | n |
| NULL | NULL | 4 | p |
+---------------------------+
FROM A RIGHT OUTER JOIN B USING (x)
Table A Table B Result
+-------+ +-------+ +--------------------+
| x | y | * | x | z | = | x | y | z |
+-------+ +-------+ +--------------------+
| 1 | a | | 2 | k | | 2 | b | k |
| 2 | b | | 3 | m | | 3 | c | m |
| 3 | c | | 3 | n | | 3 | c | n |
| 3 | d | | 4 | p | | 3 | d | m |
+-------+ +-------+ | 3 | d | n |
| 4 | NULL | p |
+--------------------+
Example
This query performs a RIGHT JOIN
on the Roster
and TeamMascot
tables.
SELECT Roster.LastName, TeamMascot.Mascot
FROM Roster RIGHT JOIN TeamMascot ON Roster.SchoolID = TeamMascot.SchoolID;
+---------------------------+
| LastName | Mascot |
+---------------------------+
| Adams | Jaguars |
| Buchanan | Lakers |
| Coolidge | Lakers |
| Davis | Knights |
| NULL | Mustangs |
+---------------------------+
The ON
clause contains a bool_expression
. A combined row (the result of
joining two rows) meets the join condition if bool_expression
returns
TRUE.
FROM A JOIN B ON A.x = B.x
Table A Table B Result (A.x, B.x)
+---+ +---+ +-------+
| x | * | x | = | x | x |
+---+ +---+ +-------+
| 1 | | 2 | | 2 | 2 |
| 2 | | 3 | | 3 | 3 |
| 3 | | 4 | +-------+
+---+ +---+
Example
This query performs an INNER JOIN
on the
Roster
and TeamMascot
table.
SELECT Roster.LastName, TeamMascot.Mascot
FROM Roster JOIN TeamMascot ON Roster.SchoolID = TeamMascot.SchoolID;
+---------------------------+
| LastName | Mascot |
+---------------------------+
| Adams | Jaguars |
| Buchanan | Lakers |
| Coolidge | Lakers |
| Davis | Knights |
+---------------------------+
The USING
clause requires a column_list
of one or more columns which
occur in both input tables. It performs an equality comparison on that column,
and the rows meet the join condition if the equality comparison returns TRUE.
FROM A JOIN B USING (x)
Table A Table B Result
+---+ +---+ +---+
| x | * | x | = | x |
+---+ +---+ +---+
| 1 | | 2 | | 2 |
| 2 | | 3 | | 3 |
| 3 | | 4 | +---+
+---+ +---+
NOTE: The USING
keyword is not supported in
strict
mode.
Example
This query performs an INNER JOIN
on the
Roster
and TeamMascot
table.
This statement returns the rows from Roster
and TeamMascot
where
Roster.SchooldID
is the same as TeamMascot.SchooldID
. The results include
a single SchooldID
column.
SELECT * FROM Roster INNER JOIN TeamMascot USING (SchoolID);
+----------------------------------------+
| SchoolID | LastName | Mascot |
+----------------------------------------+
| 50 | Adams | Jaguars |
| 52 | Buchanan | Lakers |
| 52 | Coolidge | Lakers |
| 51 | Davis | Knights |
+----------------------------------------+
The ON
and USING
keywords are not equivalent, but they are similar.
ON
returns multiple columns, and USING
returns one.
FROM A JOIN B ON A.x = B.x
FROM A JOIN B USING (x)
Table A Table B Result ON Result USING
+---+ +---+ +-------+ +---+
| x | * | x | = | x | x | | x |
+---+ +---+ +-------+ +---+
| 1 | | 2 | | 2 | 2 | | 2 |
| 2 | | 3 | | 3 | 3 | | 3 |
| 3 | | 4 | +-------+ +---+
+---+ +---+
Although ON
and USING
are not equivalent, they can return the same results
if you specify the columns you want to return.
SELECT x FROM A JOIN B USING (x);
SELECT A.x FROM A JOIN B ON A.x = B.x;
Table A Table B Result
+---+ +---+ +---+
| x | * | x | = | x |
+---+ +---+ +---+
| 1 | | 2 | | 2 |
| 2 | | 3 | | 3 |
| 3 | | 4 | +---+
+---+ +---+
The FROM
clause can contain multiple JOIN
operations in a sequence.
JOIN
s are bound from left to right. For example:
FROM A JOIN B USING (x) JOIN C USING (x)
-- A JOIN B USING (x) = result_1
-- result_1 JOIN C USING (x) = result_2
-- result_2 = return value
You can also insert parentheses to group JOIN
s:
FROM ( (A JOIN B USING (x)) JOIN C USING (x) )
-- A JOIN B USING (x) = result_1
-- result_1 JOIN C USING (x) = result_2
-- result_2 = return value
With parentheses, you can group JOIN
s so that they are bound in a different
order:
FROM ( A JOIN (B JOIN C USING (x)) USING (x) )
-- B JOIN C USING (x) = result_1
-- A JOIN result_1 = result_2
-- result_2 = return value
A FROM
clause can have multiple joins. Provided there are no comma joins in
the FROM
clause, joins do not require parenthesis, though parenthesis can
help readability:
FROM A JOIN B JOIN C JOIN D USING (w) ON B.x = C.y ON A.z = B.x
If your clause contains comma joins, you must use parentheses:
FROM A, B JOIN C JOIN D ON C.x = D.y ON B.z = C.x // INVALID
FROM A, B JOIN (C JOIN D ON C.x = D.y) ON B.z = C.x // VALID
When comma cross joins are present in a query with a sequence of JOINs, they
group from left to right like other JOIN
types:
FROM A JOIN B USING (x) JOIN C USING (x), D
-- A JOIN B USING (x) = result_1
-- result_1 JOIN C USING (x) = result_2
-- result_2 CROSS JOIN D = return value
There cannot be a RIGHT JOIN
or FULL JOIN
after a comma join:
FROM A, B RIGHT JOIN C ON TRUE // INVALID
FROM A, B FULL JOIN C ON TRUE // INVALID
FROM A, B JOIN C ON TRUE // VALID
WHERE bool_expression
The WHERE
clause filters out rows by evaluating each row against
bool_expression
, and discards all rows that do not return TRUE (that is,
rows that return FALSE or NULL).
Example:
SELECT * FROM Roster
WHERE SchoolID = 52;
The bool_expression
can contain multiple sub-conditions.
Example:
SELECT * FROM Roster
WHERE STARTS_WITH(LastName, "Mc") OR STARTS_WITH(LastName, "Mac");
You cannot reference column aliases from the SELECT
list in the WHERE
clause.
Expressions in an INNER JOIN
have an equivalent expression in the
WHERE
clause. For example, a query using INNER
JOIN
and ON
has an
equivalent expression using CROSS JOIN
and WHERE
.
Example - this query:
SELECT Roster.LastName, TeamMascot.Mascot
FROM Roster INNER JOIN TeamMascot
ON Roster.SchoolID = TeamMascot.SchoolID;
is equivalent to:
SELECT Roster.LastName, TeamMascot.Mascot
FROM Roster CROSS JOIN TeamMascot
WHERE Roster.SchoolID = TeamMascot.SchoolID;
GROUP BY { expression [, ...] | ROLLUP ( expression [, ...] ) }
The GROUP BY
clause groups together rows in a table with non-distinct values
for the expression
in the GROUP BY
clause. For multiple rows in the
source table with non-distinct values for expression
, the
GROUP BY
clause produces a single combined row. GROUP BY
is commonly used
when aggregate functions are present in the SELECT
list, or to eliminate
redundancy in the output. The data type of expression
must be groupable.
Example:
SELECT SUM(PointsScored), LastName
FROM PlayerStats
GROUP BY LastName;
The GROUP BY
clause can refer to expression names in the SELECT
list. The
GROUP BY
clause also allows ordinal references to expressions in the SELECT
list using integer values. 1
refers to the first expression in the
SELECT
list, 2
the second, and so forth. The expression list can combine
ordinals and expression names.
Example:
SELECT SUM(PointsScored), LastName, FirstName
FROM PlayerStats
GROUP BY LastName, FirstName;
The query above is equivalent to:
SELECT SUM(PointsScored), LastName, FirstName
FROM PlayerStats
GROUP BY 2, FirstName;
GROUP BY
clauses may also refer to aliases. If a query contains aliases in
the SELECT
clause, those aliases override names in the corresponding FROM
clause.
Example:
SELECT SUM(PointsScored), LastName as last_name
FROM PlayerStats
GROUP BY last_name;
GROUP BY
can group rows by the value of an ARRAY
.
GROUP BY
will group two arrays if they have the same number of elements and
all corresponding elements are in the same groups, or if both arrays are null.
GROUP BY ROLLUP
returns the results of GROUP BY
for
prefixes of the expressions in the ROLLUP
list, each of which is known as a
grouping set. For the ROLLUP
list (a, b, c)
, the grouping sets are
(a, b, c)
, (a, b)
, (a)
, ()
. When evaluating the results of GROUP BY
for a particular grouping set, GROUP BY ROLLUP
treats expressions that are not
in the grouping set as having a NULL
value. A SELECT
statement like this
one:
SELECT a, b, SUM(c) FROM Input GROUP BY ROLLUP(a, b);
uses the rollup list (a, b)
. The result will include the
results of GROUP BY
for the grouping sets (a, b)
, (a)
, and ()
, which
includes all rows. This returns the same rows as:
SELECT NULL, NULL, SUM(c) FROM Input UNION ALL
SELECT a, NULL, SUM(c) FROM Input GROUP BY a UNION ALL
SELECT a, b, SUM(c) FROM Input GROUP BY a, b;
This allows the computation of aggregates for the grouping sets defined by the
expressions in the ROLLUP
list and the prefixes of that list.
Example:
WITH Sales AS (
SELECT 123 AS sku, 1 AS day, 9.99 AS price UNION ALL
SELECT 123, 1, 8.99 UNION ALL
SELECT 456, 1, 4.56 UNION ALL
SELECT 123, 2, 9.99 UNION ALL
SELECT 789, 3, 1.00 UNION ALL
SELECT 456, 3, 4.25 UNION ALL
SELECT 789, 3, 0.99
)
SELECT
day,
SUM(price) AS total
FROM Sales
GROUP BY ROLLUP(day);
The query above outputs a row for each day in addition to the rolled up total
across all days, as indicated by a NULL
day:
+------+-------+
| day | total |
+------+-------+
| NULL | 39.77 |
| 1 | 23.54 |
| 2 | 9.99 |
| 3 | 6.24 |
+------+-------+
Example:
WITH Sales AS (
SELECT 123 AS sku, 1 AS day, 9.99 AS price UNION ALL
SELECT 123, 1, 8.99 UNION ALL
SELECT 456, 1, 4.56 UNION ALL
SELECT 123, 2, 9.99 UNION ALL
SELECT 789, 3, 1.00 UNION ALL
SELECT 456, 3, 4.25 UNION ALL
SELECT 789, 3, 0.99
)
SELECT
sku,
day,
SUM(price) AS total
FROM Sales
GROUP BY ROLLUP(sku, day)
ORDER BY sku, day;
The query above returns rows grouped by the following grouping sets:
- sku and day
- sku (day is
NULL
) - The empty grouping set (day and sku are
NULL
)
The sums for these grouping sets correspond to the total for each distinct sku-day combination, the total for each sku across all days, and the grand total:
+------+------+-------+
| sku | day | total |
+------+------+-------+
| NULL | NULL | 39.77 |
| 123 | NULL | 28.97 |
| 123 | 1 | 18.98 |
| 123 | 2 | 9.99 |
| 456 | NULL | 8.81 |
| 456 | 1 | 4.56 |
| 456 | 3 | 4.25 |
| 789 | 3 | 1.99 |
| 789 | NULL | 1.99 |
+------+------+-------+
HAVING bool_expression
The HAVING
clause is similar to the WHERE
clause: it filters out rows that
do not return TRUE when they are evaluated against the bool_expression
.
As with the WHERE
clause, the bool_expression
can be any expression
that returns a boolean, and can contain multiple sub-conditions.
The HAVING
clause differs from the WHERE
clause in that:
- The
HAVING
clause requiresGROUP BY
or aggregation to be present in the query. - The
HAVING
clause occurs afterGROUP BY
and aggregation, and beforeORDER BY
. This means that theHAVING
clause is evaluated once for every aggregated row in the result set. This differs from theWHERE
clause, which is evaluated beforeGROUP BY
and aggregation.
The HAVING
clause can reference columns available via the FROM
clause, as
well as SELECT
list aliases. Expressions referenced in the HAVING
clause
must either appear in the GROUP BY
clause or they must be the result of an
aggregate function:
SELECT LastName
FROM Roster
GROUP BY LastName
HAVING SUM(PointsScored) > 15;
If a query contains aliases in the SELECT
clause, those aliases override names
in a FROM
clause.
SELECT LastName, SUM(PointsScored) AS ps
FROM Roster
GROUP BY LastName
HAVING ps > 0;
Aggregation does not have to be present in the HAVING
clause itself, but
aggregation must be present in at least one of the following forms:
SELECT LastName, SUM(PointsScored) AS total
FROM PlayerStats
GROUP BY LastName
HAVING total > 15;
SELECT LastName
FROM PlayerStats
GROUP BY LastName
HAVING SUM(PointsScored) > 15;
When aggregation functions are present in both the SELECT
list and HAVING
clause, the aggregation functions and the columns they reference do not need
to be the same. In the example below, the two aggregation functions,
COUNT()
and SUM()
, are different and also use different columns.
SELECT LastName, COUNT(*)
FROM PlayerStats
GROUP BY LastName
HAVING SUM(PointsScored) > 15;
ORDER BY expression [{ ASC | DESC }] [, ...]
The ORDER BY
clause specifies a column or expression as the sort criterion for
the result set. If an ORDER BY clause is not present, the order of the results
of a query is not defined. Column aliases from a FROM
clause or SELECT
list
are allowed. If a query contains aliases in the SELECT
clause, those aliases
override names in the corresponding FROM
clause.
Optional Clauses
ASC | DESC
: Sort the results in ascending or descending order ofexpression
values.ASC
is the default value.
Examples
Use the default sort order (ascending).
SELECT x, y
FROM (SELECT 1 AS x, true AS y UNION ALL
SELECT 9, true)
ORDER BY x;
+------+-------+
| x | y |
+------+-------+
| 1 | true |
| 9 | true |
+------+-------+
Use descending sort order.
SELECT x, y
FROM (SELECT 1 AS x, true AS y UNION ALL
SELECT 9, true)
ORDER BY x DESC;
+------+-------+
| x | y |
+------+-------+
| 9 | true |
| 1 | true |
+------+-------+
It is possible to order by multiple columns. In the example below, the result
set is ordered first by SchoolID
and then by LastName
:
SELECT LastName, PointsScored, OpponentID
FROM PlayerStats
ORDER BY SchoolID, LastName;
The following rules apply when ordering values:
- NULLs: In the context of the
ORDER BY
clause, NULLs are the minimum possible value; that is, NULLs appear first inASC
sorts and last inDESC
sorts. - Floating point data types: see Floating Point Semantics on ordering and grouping.
When used in conjunction with set operators, the ORDER BY
clause applies to the result set of the entire query; it does not
apply only to the closest SELECT
statement. For this reason, it can be helpful
(though it is not required) to use parentheses to show the scope of the ORDER BY
.
This query without parentheses:
SELECT * FROM Roster
UNION ALL
SELECT * FROM TeamMascot
ORDER BY SchoolID;
is equivalent to this query with parentheses:
( SELECT * FROM Roster
UNION ALL
SELECT * FROM TeamMascot )
ORDER BY SchoolID;
but is not equivalent to this query, where the ORDER BY
clause applies only to
the second SELECT
statement:
SELECT * FROM Roster
UNION ALL
( SELECT * FROM TeamMascot
ORDER BY SchoolID );
You can also use integer literals as column references in ORDER BY
clauses. An
integer literal becomes an ordinal (for example, counting starts at 1) into
the SELECT
list.
Example - the following two queries are equivalent:
SELECT SUM(PointsScored), LastName
FROM PlayerStats
ORDER BY LastName;
SELECT SUM(PointsScored), LastName
FROM PlayerStats
ORDER BY 2;
You can use the COLLATE
clause to refine how a data is ordered from an ORDER BY
clause. Collation refers to a set of rules that determine how
STRINGs are compared according to the conventions and
standards of a particular language, region or country. These rules might define
the correct character sequence, with options for specifying case-insensitivity.
Note: You can use COLLATE
only on columns of type
STRING.
You add collation to your statement as follows:
SELECT ...
FROM ...
ORDER BY value COLLATE collation_string
A collation_string
contains a collation_name
and can have an optional
collation_attribute
as a suffix, separated by a colon. The collation_string
is a literal or a parameter. Usually, this name is two letters that represent
the language optionally followed by an underscore and two letters that
represent the region—for example, en_US
. These names are defined by the
Common Locale Data Repository (CLDR).
A statement can also have a collation_name
of unicode
. This value means that
the statement should return data using the default unicode collation.
In addition to the collation_name
, a collation_string
can have an optional
collation_attribute
as a suffix, separated by a colon. This attribute
specifies if the data comparisons should be case sensitive. Allowed values are
cs
, for case sensitive, and ci
, for case insensitive. If a
collation_attribute
is not supplied, the
CLDR defaults
are used.
Examples
Collate results using English - Canada:
SELECT Place
FROM Locations
ORDER BY Place COLLATE "en_CA"
Collate results using a parameter:
#@collate_param = "arg_EG"
SELECT Place
FROM Locations
ORDER BY Place COLLATE @collate_param
Using multiple COLLATE
clauses in a statement:
SELECT APlace, BPlace, CPlace
FROM Locations
ORDER BY APlace COLLATE "en_US" ASC,
BPlace COLLATE "ar_EG" DESC,
CPlace COLLATE "en" DESC
Case insensitive collation:
SELECT Place
FROM Locations
ORDER BY Place COLLATE "en_US:ci"
Default Unicode case-insensitive collation:
SELECT Place
FROM Locations
ORDER BY Place COLLATE "unicode:ci"
WINDOW named_window_expression [, ...] named_window_expression: named_window AS { named_window | ( [ window_specification ] ) }
A WINDOW
clause defines a list of named windows.
A named window represents a group of rows in a table upon which to use an
analytic function. A named window can be defined with
a window specification or reference another
named window. If another named window is referenced, the definition of the
referenced window must precede the referencing window.
Examples
These examples reference a table called Produce
.
They all return the same result. Note the different
ways you can combine named windows and use them in an analytic function's
OVER
clause.
SELECT item, purchases, category, LAST_VALUE(item)
OVER (item_window) AS most_popular
FROM Produce
WINDOW item_window AS (
PARTITION BY category
ORDER BY purchases
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)
SELECT item, purchases, category, LAST_VALUE(item)
OVER (d) AS most_popular
FROM Produce
WINDOW
a AS (PARTITION BY category),
b AS (a ORDER BY purchases),
c AS (b ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING),
d AS (c)
SELECT item, purchases, category, LAST_VALUE(item)
OVER (c ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS most_popular
FROM Produce
WINDOW
a AS (PARTITION BY category),
b AS (a ORDER BY purchases),
c AS b
UNION { ALL | DISTINCT } | INTERSECT { ALL | DISTINCT } | EXCEPT { ALL | DISTINCT }
Set operators combine results from two or
more input queries into a single result set. You
must specify ALL
or DISTINCT
; if you specify ALL
, then all rows are
retained. If DISTINCT
is specified, duplicate rows are discarded.
If a given row R appears exactly m times in the first input query and n times in the second input query (m >= 0, n >= 0):
- For
UNION ALL
, R appears exactly m + n times in the result. - For
INTERSECT ALL
, R will appear exactlyMIN(m, n)
in the result. - For
EXCEPT ALL
, R appears exactlyMAX(m - n, 0)
in the result. - For
UNION DISTINCT
, theDISTINCT
is computed after theUNION
is computed, so R appears exactly one time. - For
INTERSECT DISTINCT
, theDISTINCT
is computed after the result above is computed. - For
EXCEPT DISTINCT
, row R appears once in the output if m > 0 and n = 0. - If there are more than two input queries, the above operations generalize and the output is the same as if the inputs were combined incrementally from left to right.
The following rules apply:
- For set operations other than
UNION ALL
, all column types must support equality comparison. - The input queries on each side of the operator must return the same number of columns.
- The operators pair the columns returned by each input query according to
the columns' positions in their respective
SELECT
lists. That is, the first column in the first input query is paired with the first column in the second input query. - The result set always uses the column names from the first input query.
- The result set always uses the supertypes of input types in corresponding columns, so paired columns must also have either the same data type or a common supertype.
- You must use parentheses to separate different set
operations; for this purpose, set operations such as
UNION ALL
andUNION DISTINCT
are different. If the statement only repeats the same set operation, parentheses are not necessary.
Examples:
query1 UNION ALL (query2 UNION DISTINCT query3)
query1 UNION ALL query2 UNION ALL query3
Invalid:
query1 UNION ALL query2 UNION DISTINCT query3
query1 UNION ALL query2 INTERSECT ALL query3; // INVALID.
The UNION
operator combines the result sets of two or more input queries by
pairing columns from the result set of each query and vertically concatenating
them.
The INTERSECT
operator returns rows that are found in the result sets of both
the left and right input queries. Unlike EXCEPT
, the positioning of the input
queries (to the left versus right of the INTERSECT
operator) does not matter.
The EXCEPT
operator returns rows from the left input query that are
not present in the right input query.
Example:
SELECT * FROM UNNEST(ARRAY<int64>[1, 2, 3]) AS number
EXCEPT DISTINCT SELECT 1;
+--------+
| number |
+--------+
| 2 |
| 3 |
+--------+
LIMIT count [ OFFSET skip_rows ]
LIMIT
specifies a non-negative count
of type INT64,
and no more than count
rows will be returned. LIMIT
0
returns 0 rows.
If there is a set operation, LIMIT
is applied after the set operation is
evaluated.
OFFSET
specifies a non-negative number of rows to skip before applying
LIMIT
. skip_rows
is of type INT64.
These clauses accept only literal or parameter values. The rows that are
returned by LIMIT
and OFFSET
are unspecified unless these
operators are used after ORDER BY
.
Examples:
SELECT *
FROM UNNEST(ARRAY<STRING>['a', 'b', 'c', 'd', 'e']) AS letter
ORDER BY letter ASC LIMIT 2
+---------+
| letter |
+---------+
| a |
| b |
+---------+
SELECT *
FROM UNNEST(ARRAY<STRING>['a', 'b', 'c', 'd', 'e']) AS letter
ORDER BY letter ASC LIMIT 3 OFFSET 1
+---------+
| letter |
+---------+
| b |
| c |
| d |
+---------+
The WITH
clause binds the results of one or more named
subqueries to temporary table names. Each introduced
table name is visible in subsequent SELECT
expressions within the same
query expression. This includes the following kinds of SELECT
expressions:
- Any
SELECT
expressions in subsequentWITH
bindings - Top level
SELECT
expressions in the query expression on both sides of a set operator such asUNION
SELECT
expressions inside subqueries within the same query expression
Example:
WITH subQ1 AS (SELECT SchoolID FROM Roster),
subQ2 AS (SELECT OpponentID FROM PlayerStats)
SELECT * FROM subQ1
UNION ALL
SELECT * FROM subQ2;
You can use WITH
to break up more complex queries into a WITH
SELECT
statement and WITH
clauses, where the less desirable alternative is writing
nested table subqueries. For example:
WITH q1 AS (my_query)
SELECT *
FROM
(WITH q2 AS (SELECT * FROM q1) SELECT * FROM q2)
Note: If a WITH
clause contains multiple subqueries, the subquery names cannot
repeat.
The following are scoping rules for WITH
clauses:
- Aliases are scoped so that the aliases introduced in a
WITH
clause are visible only in the later subqueries in the sameWITH
clause, and in the query under theWITH
clause. - Aliases introduced in the same
WITH
clause must be unique, but the same alias can be used in multipleWITH
clauses in the same query. The local alias overrides any outer aliases anywhere that the local alias is visible. - Aliased subqueries in a
WITH
clause can never be correlated. No columns from outside the query are visible. The only names from outside that are visible are otherWITH
aliases that were introduced earlier in the sameWITH
clause.
Here's an example of a statement that uses aliases in WITH
subqueries:
WITH q1 AS (my_query)
SELECT *
FROM
(WITH q2 AS (SELECT * FROM q1), # q1 resolves to my_query
q3 AS (SELECT * FROM q1), # q1 resolves to my_query
q1 AS (SELECT * FROM q1), # q1 (in the query) resolves to my_query
q4 AS (SELECT * FROM q1) # q1 resolves to the WITH subquery
# on the previous line.
SELECT * FROM q1) # q1 resolves to the third inner WITH subquery.
WITH RECURSIVE
is not supported.
This clause lets you anonymize the results of a query with differentially private aggregations. To learn more about this clause, see Anonymization and Differential Privacy.
Note: the WITH ANONYMIZATION
clause cannot be used with the WITH
clause.
Support for this clause in query patterns is limited.
An alias is a temporary name given to a table, column, or expression present in
a query. You can introduce explicit aliases in the SELECT
list or FROM
clause, or ZetaSQL will infer an implicit alias for some expressions.
Expressions with neither an explicit nor implicit alias are anonymous and the
query cannot reference them by name.
You can introduce explicit aliases in either the FROM
clause or the SELECT
list.
In a FROM
clause, you can introduce explicit aliases for any item, including
tables, arrays, subqueries, and UNNEST
clauses, using [AS] alias
. The AS
keyword is optional.
Example:
SELECT s.FirstName, s2.SongName
FROM Singers AS s, (SELECT * FROM Songs) AS s2;
You can introduce explicit aliases for any expression in the SELECT
list using
[AS] alias
. The AS
keyword is optional.
Example:
SELECT s.FirstName AS name, LOWER(s.FirstName) AS lname
FROM Singers s;
In the SELECT
list, if there is an expression that does not have an explicit
alias, ZetaSQL assigns an implicit alias according to the following
rules. There can be multiple columns with the same alias in the SELECT
list.
- For identifiers, the alias is the identifier. For example,
SELECT abc
impliesAS abc
. - For path expressions, the alias is the last identifier in the path. For
example,
SELECT abc.def.ghi
impliesAS ghi
. - For field access using the "dot" member field access operator, the alias is
the field name. For example,
SELECT (struct_function()).fname
impliesAS fname
.
In all other cases, there is no implicit alias, so the column is anonymous and cannot be referenced by name. The data from that column will still be returned and the displayed query results may have a generated label for that column, but the label cannot be used like an alias.
In a FROM
clause, from_item
s are not required to have an alias. The
following rules apply:
- If there is an expression that does not have an explicit alias, ZetaSQL assigns an implicit alias in these cases:
-
For identifiers, the alias is the identifier. For example,
FROM abc
impliesAS abc
. -
For path expressions, the alias is the last identifier in the path. For
example,
FROM abc.def.ghi
impliesAS ghi
-
The column produced using
WITH OFFSET
has the implicit aliasoffset
. - Table subqueries do not have implicit aliases.
-
FROM UNNEST(x)
does not have an implicit alias.
After you introduce an explicit alias in a query, there are restrictions on where else in the query you can reference that alias. These restrictions on alias visibility are the result of ZetaSQL's name scoping rules.
ZetaSQL processes aliases in a FROM
clause from left to right,
and aliases are visible only to subsequent path expressions in a FROM
clause.
Example:
Assume the Singers
table had a Concerts
column of ARRAY
type.
SELECT FirstName
FROM Singers AS s, s.Concerts;
Invalid:
SELECT FirstName
FROM s.Concerts, Singers AS s; // INVALID.
FROM
clause aliases are not visible to subqueries in the same FROM
clause. Subqueries in a FROM
clause cannot contain correlated references to
other tables in the same FROM
clause.
Invalid:
SELECT FirstName
FROM Singers AS s, (SELECT (2020 - ReleaseDate) FROM s) // INVALID.
You can use any column name from a table in the FROM
as an alias anywhere in
the query, with or without qualification with the table name.
Example:
SELECT FirstName, s.ReleaseDate
FROM Singers s WHERE ReleaseDate = 1975;
If the FROM
clause contains an explicit alias, you must use the explicit alias
instead of the implicit alias for the remainder of the query (see
Implicit Aliases). A table alias is useful for brevity or
to eliminate ambiguity in cases such as self-joins, where the same table is
scanned multiple times during query processing.
Example:
SELECT * FROM Singers as s, Songs as s2
ORDER BY s.LastName
Invalid — ORDER BY
does not use the table alias:
SELECT * FROM Singers as s, Songs as s2
ORDER BY Singers.LastName; // INVALID.
Aliases in the SELECT
list are visible only to the following clauses:
GROUP BY
clauseORDER BY
clauseHAVING
clause
Example:
SELECT LastName AS last, SingerID
FROM Singers
ORDER BY last;
These three clauses, GROUP BY
, ORDER BY
, and HAVING
, can refer to only the
following values:
- Tables in the
FROM
clause and any of their columns. - Aliases from the
SELECT
list.
GROUP BY
and ORDER BY
can also refer to a third group:
- Integer literals, which refer to items in the
SELECT
list. The integer1
refers to the first item in theSELECT
list,2
refers to the second item, etc.
Example:
SELECT SingerID AS sid, COUNT(Songid) AS s2id
FROM Songs
GROUP BY 1
ORDER BY 2 DESC;
The previous query is equivalent to:
SELECT SingerID AS sid, COUNT(Songid) AS s2id
FROM Songs
GROUP BY sid
ORDER BY s2id DESC;
A SELECT
list or subquery containing multiple explicit or implicit aliases
of the same name is allowed, as long as the alias name is not referenced
elsewhere in the query, since the reference would be
ambiguous.
Example:
SELECT 1 AS a, 2 AS a;
+---+---+
| a | a |
+---+---+
| 1 | 2 |
+---+---+
ZetaSQL provides an error if accessing a name is ambiguous, meaning it can resolve to more than one unique object in the query or in a table schema, including the schema of a destination table.
Examples:
This query contains column names that conflict between tables, since both
Singers
and Songs
have a column named SingerID
:
SELECT SingerID
FROM Singers, Songs;
This query contains aliases that are ambiguous in the GROUP BY
clause because
they are duplicated in the SELECT
list:
SELECT FirstName AS name, LastName AS name,
FROM Singers
GROUP BY name;
This query contains aliases that are ambiguous in the SELECT
list and FROM
clause because they share the same name. Assume table
has columns x
, y
,
and z
. z
is of type STRUCT and has fields
v
, w
, and x
.
Example:
SELECT x, z AS T
FROM table AS T
GROUP BY T.x;
The alias T
is ambiguous and will produce an error because T.x
in the GROUP BY
clause could refer to either table.x
or table.z.x
.
A name is not ambiguous in GROUP BY
, ORDER BY
or HAVING
if it is both
a column name and a SELECT
list alias, as long as the name resolves to the
same underlying object.
Example:
SELECT LastName, BirthYear AS BirthYear
FROM Singers
GROUP BY BirthYear;
The alias BirthYear
is not ambiguous because it resolves to the same
underlying column, Singers.BirthYear
.
In the SELECT
list, if there is an expression that does not have an explicit
alias, ZetaSQL assigns an implicit alias according to the following
rules. There can be multiple columns with the same alias in the SELECT
list.
- For identifiers, the alias is the identifier. For example,
SELECT abc
impliesAS abc
. - For path expressions, the alias is the last identifier in the path. For
example,
SELECT abc.def.ghi
impliesAS ghi
. - For field access using the "dot" member field access operator, the alias is
the field name. For example,
SELECT (struct_function()).fname
impliesAS fname
.
In all other cases, there is no implicit alias, so the column is anonymous and cannot be referenced by name. The data from that column will still be returned and the displayed query results may have a generated label for that column, but the label cannot be used like an alias.
In a FROM
clause, from_item
s are not required to have an alias. The
following rules apply:
- If there is an expression that does not have an explicit alias,
ZetaSQL assigns an implicit alias in these cases:
- For identifiers, the alias is the identifier. For example,
FROM abc
impliesAS abc
. - For path expressions, the alias is the last identifier in the path. For
example,
FROM abc.def.ghi
impliesAS ghi
- The column produced using
WITH OFFSET
has the implicit aliasoffset
.
- For identifiers, the alias is the identifier. For example,
- Table subqueries do not have implicit aliases.
FROM UNNEST(x)
does not have an implicit alias.
In ZetaSQL, a range variable is a table expression alias in the
FROM
clause. Sometimes a range variable is known as a table alias
. A
range variable lets you reference rows being scanned from a table expression.
A table expression represents an item in the FROM
clause that returns a table.
Common items that this expression can represent include
tables, value tables, subqueries,
table value functions (TVFs), joins, and parenthesized joins.
In general, a range variable provides a reference to the rows of a table
expression. A range variable can be used to qualify a column reference and
unambiguously identify the related table, for example range_variable.column_1
.
When referencing a range variable on its own without a specified column suffix,
the result of a table expression is the row type of the related table.
Value tables have explicit row types, so for range variables related
to value tables, the result type is the value table's row type. Other tables
do not have explicit row types, and for those tables, the range variable
type is a dynamically defined STRUCT
that includes all of the
columns in the table.
Examples
In these examples, the WITH
clause is used to emulate a temporary table
called Grid
. This table has columns x
and y
. A range variable called
Coordinate
refers to the current row as the table is scanned. Coordinate
can be used to access the entire row or columns in the row.
The following example selects column x
from range variable Coordinate
,
which in effect selects column x
from table Grid
.
WITH Grid AS (SELECT 1 x, 2 y)
SELECT Coordinate.x FROM Grid AS Coordinate;
+---+
| x |
+---+
| 1 |
+---+
The following example selects all columns from range variable Coordinate
,
which in effect selects all columns from table Grid
.
WITH Grid AS (SELECT 1 x, 2 y)
SELECT Coordinate.* FROM Grid AS Coordinate;
+---+---+
| x | y |
+---+---+
| 1 | 2 |
+---+---+
The following example selects the range variable Coordinate
, which is a
reference to rows in table Grid
. Since Grid
is not a value table,
the result type of Coordinate
is a STRUCT
that contains all the columns
from Grid
.
WITH Grid AS (SELECT 1 x, 2 y)
SELECT Coordinate FROM Grid AS Coordinate;
+--------------+
| Coordinate |
+--------------+
| {x: 1, y: 2} |
+--------------+
These examples include statements which perform queries on the
Roster
and TeamMascot
,
and PlayerStats
tables.
Example:
SELECT LastName, SUM(PointsScored)
FROM PlayerStats
GROUP BY LastName;
LastName | SUM |
---|---|
Adams | 7 |
Buchanan | 13 |
Coolidge | 1 |
The UNION
operator combines the result sets of two or more SELECT
statements
by pairing columns from the result set of each SELECT
statement and vertically
concatenating them.
Example:
SELECT Mascot AS X, SchoolID AS Y
FROM TeamMascot
UNION ALL
SELECT LastName, PointsScored
FROM PlayerStats;
Results:
X | Y |
---|---|
Jaguars | 50 |
Knights | 51 |
Lakers | 52 |
Mustangs | 53 |
Adams | 3 |
Buchanan | 0 |
Coolidge | 1 |
Adams | 4 |
Buchanan | 13 |
This query returns the last names that are present in both Roster and PlayerStats.
SELECT LastName
FROM Roster
INTERSECT ALL
SELECT LastName
FROM PlayerStats;
Results:
LastName |
---|
Adams |
Coolidge |
Buchanan |
The query below returns last names in Roster that are not present in PlayerStats.
SELECT LastName
FROM Roster
EXCEPT DISTINCT
SELECT LastName
FROM PlayerStats;
Results:
LastName |
---|
Eisenhower |
Davis |
Reversing the order of the SELECT
statements will return last names in
PlayerStats that are not present in Roster:
SELECT LastName
FROM PlayerStats
EXCEPT DISTINCT
SELECT LastName
FROM Roster;
Results:
(empty)