OpenSearch and Spark Integration P1 Demo #317

dai-chen · 2023-07-18T16:49:37Z

dai-chen
Jul 18, 2023
Maintainer

Here is a follow-up demo after https://github.com/opensearch-project/sql/discussions/1465. The demo covers how to use our Spark extension to create skipping index with OpenSearch which can accelerate query with applicable filtering condition. Please find more details in https://github.com/opensearch-project/sql/blob/feature/flint/flint/docs/index.md.

Demo Video

Flint.skipping.index.demo.P1.-.Part1.mov

Flint.skipping.index.demo.P1.-.Part2.mov

Flint.skipping.index.demo.P1.-.Part3.mov

Demo Setup

Demo Data

Use lineitem dataset in TPC-H:

CREATE EXTERNAL TABLE lineitem_tiny (
  l_orderkey BIGINT,
  l_partkey BIGINT,
  l_suppkey BIGINT,
  l_linenumber INTEGER,
  l_quantity FLOAT,
  l_extendedprice FLOAT,
  l_discount FLOAT,
  l_tax FLOAT,
  l_returnflag STRING,
  l_linestatus STRING,
  l_commitdate STRING,
  l_receiptdate STRING,
  l_shipinstruct STRING,
  l_shipmode STRING,
  l_comment STRING,
  l_shipdate DATE
)
USING JSON
OPTIONS (
  path 's3://.../tpch-lineitem-tiny',
  compression 'gzip'
);

Launch SparkSQL CLI

Login to EMR primary node and start spark-sql:

spark-sql  \
  --jars /home/hadoop/flint-spark-integration-assembly-0.1.0-SNAPSHOT.jar  \
  --conf spark.sql.extensions=org.opensearch.flint.spark.FlintSparkExtensions \
  --conf spark.datasource.flint.host=search-xxx.es.amazonaws.com \
  --conf spark.datasource.flint.port=-1 \
  --conf spark.datasource.flint.scheme=https \
  --conf spark.datasource.flint.auth=sigv4 \
  --conf spark.datasource.flint.region=us-xxx-x

Create Skipping Index

CREATE SKIPPING INDEX ON lineitem_tiny
(l_shipdate VALUE_SET);

DESC SKIPPING INDEX ON lineitem_tiny;

REFRESH SKIPPING INDEX ON lineitem_tiny;

DROP SKIPPING INDEX ON lineitem_tiny;

You can check what OpenSearch index looks like:

GET _cat/indices?v

GET flint_lineitem_tiny_skipping_index/_mapping

POST flint_lineitem_tiny_skipping_index/_refresh

POST flint_lineitem_tiny_skipping_index/_search

Query Test

Use TPC-H Q6:

#Disable Flint optimization  
SET spark.flint.optimizer.enabled=false;

#Explain query without index
EXPLAIN
SELECT
  SUM(l_extendedprice*l_discount) AS revenue
FROM
  lineitem_tiny
WHERE
  l_shipdate = '1994-06-12'
  AND l_discount between 0.02 - 0.01 and 0.02 + 0.01
  AND l_quantity < 24;


#Enable Flint optimization  
SET spark.flint.optimizer.enabled=true;

EXPLAIN
SELECT
  SUM(l_extendedprice*l_discount) AS revenue
FROM
  lineitem_tiny
WHERE
  l_shipdate = '1994-06-12'
  AND l_discount between 0.02 - 0.01 and 0.02 + 0.01
  AND l_quantity < 24;

anirudha · 2023-10-16T15:38:25Z

anirudha
Oct 16, 2023
Maintainer

demo.mov

0 replies

dai-chen · 2024-04-26T22:25:28Z

dai-chen
Apr 26, 2024
Maintainer Author

Flint Covering Index Acceleration Demo

VID_20240426173659784.mov

Test Table

Make use of the same lineitem table above and meanwhile create another TPC-H orders table for TPC-H Q3 JOIN query test:

CREATE TABLE ds_tables.orders_tiny (
  o_orderkey BIGINT,
  o_orderdate DATE,
  o_orderstatus STRING,
  o_shippriority INT
)
USING JSON
OPTIONS (
  'compression' = 'gzip')
LOCATION 's3://.../tpch-orders-tiny';

SHOW CREATE TABLE ds_tables.lineitem_tiny;
SHOW CREATE TABLE ds_tables.orders_tiny;

Creating Covering Index

Create covering index with all required columns indexed:

CREATE INDEX demo
ON ds_tables.lineitem_tiny (
  l_extendedprice,
  l_discount,
  l_shipdate,
  l_quantity,
  l_orderkey
)
WITH (
  auto_refresh = true
);

Query Test

# Same test query as above
EXPLAIN
SELECT
  SUM(l_extendedprice*l_discount) AS revenue
FROM
  ds_tables.lineitem_tiny
WHERE
  l_shipdate = '1997-02-06'
  AND l_discount between 0.02 - 0.01 and 0.02 + 0.01
  AND l_quantity < 24;

# Without covering index acceleration:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum((l_extendedprice#15 * l_discount#16))])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=213]
      +- HashAggregate(keys=[], functions=[partial_sum((l_extendedprice#15 * l_discount#16))])
         +- Project [l_extendedprice#15, l_discount#16]
            +- Filter ((((((isnotnull(l_shipdate#25) AND isnotnull(l_discount#16)) AND isnotnull(l_quantity#14))
AND (l_shipdate#25 = 1997-02-06)) AND (l_discount#16 > 0.01)) 
AND (l_discount#16 <= 0.03)) AND (l_quantity#14 < 24.0))
               +- FileScan json ds_tables.lineitem_tiny[l_quantity#14,l_extendedprice#15,l_discount#16,l_shipdate#25]
Batched: false, DataFilters: [isnotnull(l_shipdate#25), isnotnull(l_discount#16), isnotnull(l_quantity#14), (l_shipdate#25 = 1..., 
Format: JSON, Location: InMemoryFileIndex(1 paths)[s3://.../tpch-lineitem-tiny], PartitionFilters: [],
PushedFilters: [IsNotNull(l_shipdate), IsNotNull(l_discount), IsNotNull(l_quantity), EqualTo(l_shipdate,1997-02-...,
ReadSchema: struct

# With covering index acceleration:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum((l_extendedprice#15 * l_discount#16))])
   +- HashAggregate(keys=[], functions=[partial_sum((l_extendedprice#15 * l_discount#16))])
      +- Project [l_extendedprice#15, l_discount#16]
         +- BatchScan[l_quantity#14, l_shipdate#25, l_extendedprice#15, l_discount#16, l_orderkey#10L] class
org.apache.spark.sql.flint.FlintScan, PushedPredicates: [l_shipdate IS NOT NULL, l_discount IS NOT NULL,
l_quantity IS NOT NULL, l_shipdate = 9898, l_discount > 0.01, l_discount <= 0.03, l_quantity < 24.0]
RuntimeFilters: []

# TPC-H Q3 query test:
EXPLAIN
SELECT
  l_orderkey,
  SUM(l_extendedprice * (1 - l_discount)) AS revenue,
  o_orderdate,
  o_shippriority
FROM
  ds_tables.orders_tiny AS o
JOIN
  ds_tables.lineitem_tiny AS l ON o.o_orderkey = l.l_orderkey
WHERE
  l_shipdate = '1997-02-06'
GROUP BY
  l_orderkey, o_orderdate, o_shippriority
ORDER BY
  revenue DESC,
  o_orderdate;

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [revenue#157 DESC NULLS LAST, o_orderdate#70 ASC NULLS FIRST], true, 0
   +- HashAggregate(keys=[l_orderkey#10L, o_orderdate#70, o_shippriority#72], functions=[sum((l_extendedprice#15 * (1.0 - l_discount#16)))])
      +- HashAggregate(keys=[l_orderkey#10L, o_orderdate#70, o_shippriority#72], functions=[partial_sum((l_extendedprice#15 * (1.0 - l_discount#16)))])
         +- Project [o_orderdate#70, o_shippriority#72, l_orderkey#10L, l_extendedprice#15, l_discount#16]
            +- BroadcastHashJoin [o_orderkey#69L], [l_orderkey#10L], Inner, BuildLeft, false
               :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [plan_id=286]
               :  +- Filter isnotnull(o_orderkey#69L)
               :     +- FileScan json ds_tables.orders_tiny[o_orderkey#69L,o_orderdate#70,o_shippriority#72] Batched: false,
DataFilters: [isnotnull(o_orderkey#69L)], Format: JSON, Location: InMemoryFileIndex(1 paths)[s3://.../tpch-orders-tiny],
PartitionFilters: [], PushedFilters: [IsNotNull(o_orderkey)], ReadSchema: struct
               +- Project [l_orderkey#10L, l_extendedprice#15, l_discount#16]
                  +- BatchScan[l_quantity#14, l_shipdate#25, l_extendedprice#15, l_discount#16, l_orderkey#10L] class
org.apache.spark.sql.flint.FlintScan, PushedPredicates: [l_shipdate IS NOT NULL, l_shipdate = 9898,
l_orderkey IS NOT NULL] RuntimeFilters: []

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSearch and Spark Integration P1 Demo #317

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

OpenSearch and Spark Integration P1 Demo #317

dai-chen Jul 18, 2023 Maintainer

Demo Video

Demo Setup

Demo Data

Launch SparkSQL CLI

Create Skipping Index

Query Test

Replies: 2 comments

anirudha Oct 16, 2023 Maintainer

dai-chen Apr 26, 2024 Maintainer Author

Flint Covering Index Acceleration Demo

Test Table

Creating Covering Index

Query Test

dai-chen
Jul 18, 2023
Maintainer

anirudha
Oct 16, 2023
Maintainer

dai-chen
Apr 26, 2024
Maintainer Author