PPL command implementation for `appendCol` #990

andy-k-improving · 2024-12-11T02:25:28Z

Description

Introduce the new PPL command appendCol which aim to aggregate result from multiple searches into a single comprehensive table for user to view.

This is accomplished by reading both main-search and sub-search in the form of node then transform it into the following of SQL with by adding _row_number_ for the dataset's natural order, then join both main and sub search with the _row_number_ column.

select t1.*, t2.* 

FROM (
     SELECT *, row_number() over (order by '1') as row_org 
     FROM employees) as t1

 FULL OUTER JOIN (
     SELECT *, row_number() over (order by '1') as row_app 
     FROM employees) as t2 

ON t1.row_org = t2.row_app;

Related Issues

Resolves: #956

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Test plan:

# Produce the artifact
sbt clean sparkPPLCosmetic/publishM2

# Start Spark with the plugin
bin/spark-sql --jars "/ABSOLUTE_PATH_TO_ARTIFACT/opensearch-spark-ppl_2.12-0.6.0-SNAPSHOT.jar" \
--conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions"  \
--conf "spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog" \
--conf "spark.hadoop.hive.cli.print.header=true"

# Insert test table and data
CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT, con STRING);

INSERT INTO employees VALUES ("Lisa", "Sales", 10000, 35, 'test');
INSERT INTO employees VALUES ("Evan", "Sales", 32000, 38, 'test');
INSERT INTO employees VALUES ("Fred", "Engineering", 21000, 28, 'test');
INSERT INTO employees VALUES ("Alex", "Sales", 30000, 33, 'test');
INSERT INTO employees VALUES ("Tom", "Engineering", 23000, 33, 'test');
INSERT INTO employees VALUES ("Jane", "Marketing", 29000, 28, 'test');
INSERT INTO employees VALUES ("Jeff", "Marketing", 35000, 38, 'test');
INSERT INTO employees VALUES ("Paul", "Engineering", 29000, 23, 'test');
INSERT INTO employees VALUES ("Chloe", "Engineering", 23000, 25, 'test');

# Append one sub-search:

source=employees | FIELDS name, dept, salary | APPENDCOL  [ stats count() as event_count];

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL


# Append multiple sub-searches:

source=employees | FIELDS name, dept, salary | APPENDCOL  [ stats count() as event_count] | APPENDCOL [stats avg(age) as avg_age];

name	dept	salary	event_count	avg_age
Lisa	Sales------	10000	9	31.22222222222222
Fred	Engineering	21000	NULL	NULL
Paul	Engineering	29000	NULL	NULL
Evan	Sales------	32000	NULL	NULL
Chloe	Engineering	23000	NULL	NULL
Tom	Engineering	23000	NULL	NULL
Alex	Sales	30000	NULL	NULL
Jane	Marketing	29000	NULL	NULL
Jeff	Marketing	35000	NULL	NULL



# With override option (`salary` column from the main-search is being dropped and replaced by the `salary` column over the sub-search)

source=employees | FIELDS name, dept, salary | APPENDCOL OVERRIDE=true [stats avg(salary) as salary];

name	dept	salary
Lisa	Sales------	25777.777777777777
Fred	Engineering	NULL
Paul	Engineering	NULL
Evan	Sales------	NULL
Chloe	Engineering	NULL
Tom	Engineering	NULL
Alex	Sales	NULL
Jane	Marketing	NULL
Jeff	Marketing	NULL

docs/ppl-lang/ppl-appendcol-command.md

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/utils/AppendColCatalystUtils.java

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/utils/AppendColCatalystUtils.java

LantaoJin · 2024-12-18T02:55:39Z

Two high level questions:

appendCol command syntax is
APPENDCOL <override=?> [sub-search]...

And the sub-search syntax is

opensearch-spark/ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

Lines 27 to 29 in 957de4e

    
           subSearch 
        
              : searchCommand (PIPE commands)* 
        
              ;

Seems this PR doesn't follow the sub-search syntax.
I prefer to follow the current sub-search syntax, in case we could combine columns from different tables. for examples:
source=employees | FIELDS name, dept, salary | APPENDCOL [ search source = company | stats count() as event_count ]
But if this is intentional (appendcol only works for one table), I am okey for current syntax.

why the result of query source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]
is

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL

instead of

name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	9
Paul	Engineering	29000	9
Evan	Sales------	32000	9
Chloe	Engineering	23000	9
Tom	Engineering	23000	9
Alex	Sales	30000	9
Jane	Marketing	29000	9
Jeff	Marketing	35000	9

PS, what is the expected result of query source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]?

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java

andy-k-improving · 2024-12-19T00:09:53Z

Two high level questions:

appendCol command syntax is
APPENDCOL <override=?> [sub-search]...

And the sub-search syntax is

opensearch-spark/ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

Lines 27 to 29 in 957de4e

subSearch

: searchCommand (PIPE commands)*

;

Seems this PR doesn't follow the sub-search syntax.
I prefer to follow the current sub-search syntax, in case we could combine columns from different tables. for examples:
source=employees | FIELDS name, dept, salary | APPENDCOL [ search source = company | stats count() as event_count ]
But if this is intentional (appendcol only works for one table), I am okey for current syntax.
2. why the result of query source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]
is
name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL 
instead of
name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	9
Paul	Engineering	29000	9
Evan	Sales------	32000	9
Chloe	Engineering	23000	9
Tom	Engineering	23000	9
Alex	Sales	30000	9
Jane	Marketing	29000	9
Jeff	Marketing	35000	9
PS, what is the expected result of query source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]?

Yep, the sub-search under appendcol is restricted to use the same dataSource as the main PPL command, for the expected result, we can probably discuss it on the other thread, in order to centralise the convo.

docs/ppl-lang/README.md

docs/ppl-lang/ppl-appendcol-command.md

ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

ppl-spark-integration/src/main/java/org/opensearch/sql/ast/tree/AppendCol.java

ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java

YANG-DB · 2024-12-31T17:55:23Z

Two high level questions:

appendCol command syntax is
APPENDCOL <override=?> [sub-search]...

And the sub-search syntax is

opensearch-spark/ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

Lines 27 to 29 in 957de4e

subSearch

: searchCommand (PIPE commands)*

;

Seems this PR doesn't follow the sub-search syntax.
I prefer to follow the current sub-search syntax, in case we could combine columns from different tables. for examples:
source=employees | FIELDS name, dept, salary | APPENDCOL [ search source = company | stats count() as event_count ]
But if this is intentional (appendcol only works for one table), I am okey for current syntax.
2. why the result of query source=employees | FIELDS name, dept, salary | APPENDCOL [ stats count() as event_count]
is
name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	NULL
Paul	Engineering	29000	NULL
Evan	Sales------	32000	NULL
Chloe	Engineering	23000	NULL
Tom	Engineering	23000	NULL
Alex	Sales	30000	NULL
Jane	Marketing	29000	NULL
Jeff	Marketing	35000	NULL 
instead of
name	dept	salary	event_count
Lisa	Sales------	10000	9
Fred	Engineering	21000	9
Paul	Engineering	29000	9
Evan	Sales------	32000	9
Chloe	Engineering	23000	9
Tom	Engineering	23000	9
Alex	Sales	30000	9
Jane	Marketing	29000	9
Jeff	Marketing	35000	9
PS, what is the expected result of query source=employees | stats sum(salary) as total_salary by dept | appendcol [ stats avg(age) as avg_age by dept ]?
Yep, the sub-search under appendcol is restricted to use the same dataSource as the main PPL command, for the expected result, we can probably discuss it on the other thread, in order to centralise the convo.

@andy-k-improving I would go with the same behaviour to match with Splunk ...

YANG-DB · 2025-01-02T18:29:50Z

@andy-k-improving LGTM - please validate @LantaoJin question and respond here
thanks

Signed-off-by: Andy Kwok <[email protected]>

Co-authored-by: Taylor Curran <[email protected]> Signed-off-by: Andy Kwok <[email protected]>

Signed-off-by: Andy Kwok <[email protected]>

LantaoJin · 2025-01-03T02:21:29Z

docs/ppl-lang/ppl-appendcol-command.md

+* sub-search: Executes PPL commands as a secondary search. The sub-search uses the same data specified in the source clause of the main search results as its input.
+
+
+#### Example 1: To append the result of `stats avg(age) as AVG_AGE` into existing search result   


We need to add an additional example here and a test in IT, which is the most commonly used scenario for the AppenCol command:
source=employees | stats avg(age) as avg_age1 by dept | APPENDCOL [ stats avg(age) as avg_age2 by dept ]
The result should look like

dept avg_age1 avg_age2 Sales 35.3333 35.3333 Engineering 27.25 27.25 Marketing 33 33

LGTM if above comment addressed.

Besides, please update https://github.com/opensearch-project/opensearch-spark/blob/main/docs/ppl-lang/PPL-Example-Commands.md as well.

Signed-off-by: Andy Kwok <[email protected]>

LantaoJin · 2025-01-04T02:39:12Z

docs/ppl-lang/ppl-appendcol-command.md

+
+PPL query:
+
+    os> source=employees | stats avg(age) as avg_age1 by dept | fields dept, avg_age1 | APPENDCOL  [ stats avg(age) as avg_age2 by dept | fields avg_age2 ];


What happens without | fields avg_age2? @andy-k-improving

In Splunk, command source=employees | stats avg(age) as avg_age1 by dept | APPENDCOLS [ stats avg(age) as avg_age2 by dept ] could display the same outputs. Does current implementation have to add those fields to get the same outputs?

andy-k-improving requested review from dai-chen, mengweieric, penghuo, seankao-az, anirudha, kaituo, YANG-DB, noCharger, LantaoJin and ykmr1224 as code owners December 11, 2024 02:25

andy-k-improving marked this pull request as draft December 11, 2024 02:26

andy-k-improving changed the title ~~DRAFT: PPL command appendCol implementaion~~ PPL command implementation for appendCol Dec 16, 2024

andy-k-improving marked this pull request as ready for review December 16, 2024 22:45