Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add operators to support duplicate eliminated joins #695

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,8 @@ message Rel {
HashJoinRel hash_join = 13;
MergeJoinRel merge_join = 14;
NestedLoopJoinRel nested_loop_join = 18;
DuplicateEliminatedGetRel duplicate_eliminated_get = 23;
DuplicateEliminatedJoinRel duplicate_eliminated_join = 24;
ConsistentPartitionWindowRel window = 17;
ExchangeRel exchange = 15;
ExpandRel expand = 16;
Expand Down Expand Up @@ -773,6 +775,53 @@ message NestedLoopJoinRel {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}

message DuplicateEliminatedGetRel {
RelCommon common = 1;
ReferenceRel input = 2;
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved
repeated Expression.FieldReference column_ids = 3;
}

message DuplicateEliminatedJoinRel {
RelCommon common = 1;
Rel left = 2;
Rel right = 3;

Expression expression = 4;
Expression post_join_filter = 5;

JoinType type = 6;

// The set of columns that will be duplicate eliminated from the LHS and pushed into the RHS
repeated Expression.FieldReference duplicate_eliminated_columns = 7;

DuplicateEliminatedSide duplicate_eliminated_side = 8;

// If this is a DuplicateEliminatedJoin, whether it has been flipped to de-duplicating the LHS or RHS
enum DuplicateEliminatedSide {
DUPLICATE_ELIMINATED_SIDE_UNSPECIFIED = 0;
DUPLICATE_ELIMINATED_SIDE_LEFT = 1;
DUPLICATE_ELIMINATED_SIDE_RIGHT = 2;
}

enum JoinType {
JOIN_TYPE_UNSPECIFIED = 0;
JOIN_TYPE_INNER = 1;
JOIN_TYPE_OUTER = 2;
JOIN_TYPE_LEFT = 3;
JOIN_TYPE_RIGHT = 4;
JOIN_TYPE_LEFT_SEMI = 5;
JOIN_TYPE_LEFT_ANTI = 6;
JOIN_TYPE_LEFT_SINGLE = 7;
JOIN_TYPE_RIGHT_SEMI = 8;
JOIN_TYPE_RIGHT_ANTI = 9;
JOIN_TYPE_RIGHT_SINGLE = 10;
JOIN_TYPE_LEFT_MARK = 11;
JOIN_TYPE_RIGHT_MARK = 12;
}

substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// The argument of a function
message FunctionArgument {
oneof arg_type {
Expand Down
40 changes: 39 additions & 1 deletion site/docs/relations/physical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,45 @@ The nested loop join operator does a join by holding the entire right input and
| Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. | Optional. Defaults to true (a Cartesian join). |
| Join Type | One of the join types defined in the Join operator. | Required |


## Duplicate Eliminated Join Operator
The Duplicate Eliminated Join, along with the [Duplicate Eliminated Get Operator](physical_relations.md#duplicate-eliminated-get-operator) are the two necessary operators that enable general subquery unnesting. (See the [Unnesting Arbitrary Queries](https://cs.emis.de/LNI/Proceedings/Proceedings241/383.pdf) paper for more information.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the two necessary operators that enable general subquery unnesting

I'm not entirely sure I understand (but this is not surprising as I am not an expert in relational algebra and had difficulty understanding the paper). From my read of the paper it seems that general unnesting can be used to convert a query with dependent joins into a query without them. Duplicate elimnated joins seem to be an optimization that is useful to simplify plans created by generate unnesting but not strictly needed to enable it.

Also, duplicate eliminated joins seems to be a general optimization and not specific to query unnesting. Though perhaps it is mostly useful in that context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper is indeed very difficult to understand. There is also a video from @Mytherin explaining the topic.

The duplicate eliminated join is not only an optimization, but rather a necessary technique to get rid of Dependent Joins. In some cases you don't need a duplicate eliminated join to de-correlate but on others they are necessary.

I'm not sure what you mean by:
"Also, duplicate eliminated joins seems to be a general optimization and not specific to query unnesting. Though perhaps it is mostly useful in that context."
I'm not aware of other scenarios other than correlated subqueries.


The Duplicate Eliminated Join is essentially a [Regular Join Operator](logical_relations.md#join-operator). It can have any regular join type, and its execution is the same. The main difference is that one of its children has, somewhere in its subtree, a dependency on the deduplicated result of the other. Therefore, this operator pushes the deduplicated result to its dependent child via the Duplicate Eliminated Get Operator. The side that will be deduplicated is specified in the Duplicate Eliminated Side property. The other side is the one that depends on the deduplication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm following things correctly, the goal here is to take the hashtable from the build side and push that hash table into branch that calculates the probe input?

Also, we keep talking about "deduplicated result" but we are only talking about the key columns and not the entire column set correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it's not the hastable but actually the deduplicated side. You can see that the Duplicate Eliminated Get has a repeated Expression.FieldReference column_ids = 3; representing the deduplicated columns.

pdet marked this conversation as resolved.
Show resolved Hide resolved

| Signature | Value |
| -------------------- |----------------------------------------------------------------------------------------------|
| Inputs | 2 |
| Outputs | 1 |
pdet marked this conversation as resolved.
Show resolved Hide resolved
| Property Maintenance | It is the same as the [Hash Equijoin Operator](physical_relations.md#hash-equijoin-operator) |
| Direct Output Order | Same as the [Join](logical_relations.md#join-operator) operator. |

### Duplicate Eliminated Join Properties

| Property | Description | Required |
|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
| Left Keys | References to the fields to join on in the left input. | Required |
| Right Keys | References to the fields to join on in the right input. | Required |
| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. |
| Join Type | One of the join types defined in the Join operator. | Required |
| Duplicate Eliminated Side | The side that is deduplicated and pushed into the other side. | Required |
pdet marked this conversation as resolved.
Show resolved Hide resolved
pdet marked this conversation as resolved.
Show resolved Hide resolved

## Duplicate Eliminated Get Operator
An operator that takes as its input the result of the deduplicated side of the Duplicate Eliminated Join. It simply scans the input and outputs the deduplicated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the input to this relation? The deduplicated join relation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is the deduplicated side of the Duplicate Eliminated Join

| Signature | Value |
| -------------------- |-------------------------------------------------------------------------------------|
| Inputs | 1 |
| Outputs | 1 |
| Property Maintenance | Distribution is not maintained due to the deduplication. Orderedness is eliminated. |
| Direct Output Order | It will only project the deduplicated columns from it's input |

### Duplicate Eliminated Get Properties

| Property | Description | Required |
|------------|----------------------------------------------------|-----------------------------|
| Input | A relational input. | Required |
| Column IDs | The columns that were deduplicated from the input. | Required |
pdet marked this conversation as resolved.
Show resolved Hide resolved

## Merge Equijoin Operator

Expand Down
Loading