Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-45918][PS] Optimize
MultiIndex.symmetric_difference
### What changes were proposed in this pull request? Optimize `MultiIndex.symmetric_difference` ### Why are the changes needed? currently, the `XOR` operation `a.union(b).subtract(a.intersect(b))` is not optimum: ``` >>> midx1 = pd.MultiIndex([['lama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... [[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 0, 0, 0, 1, 2, 0, 1, 2]]) >>> midx2 = pd.MultiIndex([['pandas-on-Spark', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... [[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 0, 0, 0, 1, 2, 0, 1, 2]]) >>> s1 = ps.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], ... index=midx1) >>> s2 = ps.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], ... index=midx2) >>> s1.index.symmetric_difference(s2.index)._internal.spark_frame.explain("extended") ``` before this PR: ``` == Optimized Logical Plan == Aggregate [__index_level_0__#0, __index_level_1__#1], [__index_level_0__#0, __index_level_1__#1, monotonically_increasing_id() AS __natural_order__#161L] +- Union false, false :- Join LeftAnti, ((__index_level_0__#0 <=> __index_level_0__#145) AND (__index_level_1__#1 <=> __index_level_1__#146)) : :- Project [__index_level_0__#0, __index_level_1__#1] : : +- LogicalRDD [__index_level_0__#0, __index_level_1__#1, 0#2], false : +- Aggregate [__index_level_0__#145, __index_level_1__#146], [__index_level_0__#145, __index_level_1__#146] : +- Join LeftSemi, ((__index_level_0__#145 <=> __index_level_0__#149) AND (__index_level_1__#146 <=> __index_level_1__#150)) : :- Project [__index_level_0__#145, __index_level_1__#146] : : +- LogicalRDD [__index_level_0__#145, __index_level_1__#146, 0#147], false : +- Project [__index_level_0__#149, __index_level_1__#150] : +- LogicalRDD [__index_level_0__#149, __index_level_1__#150, 0#151], false +- Join LeftAnti, ((__index_level_0__#11 <=> __index_level_0__#145) AND (__index_level_1__#12 <=> __index_level_1__#146)) :- Project [__index_level_0__#11, __index_level_1__#12] : +- LogicalRDD [__index_level_0__#11, __index_level_1__#12, 0#13], false +- Aggregate [__index_level_0__#145, __index_level_1__#146], [__index_level_0__#145, __index_level_1__#146] +- Join LeftSemi, ((__index_level_0__#145 <=> __index_level_0__#149) AND (__index_level_1__#146 <=> __index_level_1__#150)) :- Project [__index_level_0__#145, __index_level_1__#146] : +- LogicalRDD [__index_level_0__#145, __index_level_1__#146, 0#147], false +- Project [__index_level_0__#149, __index_level_1__#150] +- LogicalRDD [__index_level_0__#149, __index_level_1__#150, 0#151], false ``` after this PR: ``` == Optimized Logical Plan == Project [__index_level_0__#0, __index_level_1__#1, monotonically_increasing_id() AS __natural_order__#64L] +- Filter ((isnotnull(__multi_index_min_tag__#46) AND isnotnull(__multi_index_max_tag__#47)) AND (__multi_index_min_tag__#46 = __multi_index_max_tag__#47)) +- Aggregate [__index_level_0__#0, __index_level_1__#1], [__index_level_0__#0, __index_level_1__#1, min(__multi_index_tag__#30) AS __multi_index_min_tag__#46, max(__multi_index_tag__#30) AS __multi_index_max_tag__#47] +- Union false, false :- Project [__index_level_0__#0, __index_level_1__#1, 0 AS __multi_index_tag__#30] : +- LogicalRDD [__index_level_0__#0, __index_level_1__#1, 0#2], false +- Project [__index_level_0__#11, __index_level_1__#12, 1 AS __multi_index_tag__#34] +- LogicalRDD [__index_level_0__#11, __index_level_1__#12, 0#13], false ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#43795 from zhengruifeng/ps_multi_index_opt. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information