[KYUUBI #6691] A new Spark SQL command to merge small files #6695

gabrywu · 2024-09-13T09:03:04Z

🔍 Description

Issue References 🔗

This pull request closing #6691

Describe Your Solution 🔧

There are many cases in which a SQL generate small files, we MUST merge them into bigger ones.
I create a new Spark SQL command to merge small files, which doesn't read-write all of the records of a table, it just merges files in a binary level. Take a CSV table for example, it only appends the byte array from one file to another one, without reading & writing records

Syntax here

compact table table_name [INTO ${targetFileSize} ${targetFileSizeUnit} ] [ cleanup | retain | list ]
-- targetFileSizeUnit can be 'm','mb'
-- cleanup means cleaning compact staging folders, which contains original small files, default behavior
-- retain means retaining compact staging folders, for testing, and we can recover with the staging data
-- list means this command only get the merging result, and don't run actually

recover compact table table_name
-- recover a table if compact table command fails

Types of changes 🔖

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklist 📝

This patch was not authored or co-authored using Generative Tooling

Be nice. Be informative.

2. parser tests pass

pan3793 · 2024-09-13T09:07:46Z

cc @ulysses-you

...yuubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/KyuubiSparkSQLExtension.scala

extensions/spark/kyuubi-extension-spark-3-5/pom.xml

...ion-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/CachePerformanceViewCommand.scala

codecov-commenter · 2024-09-15T07:24:08Z

Codecov Report

Attention: Patch coverage is 0% with 582 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (353877b) to head (fd39f66).
Report is 26 commits behind head on master.

Files with missing lines	Patch %	Lines
...ache/kyuubi/sql/compact/SmallFileCollectExec.scala	0.00%	75 Missing ⚠️
.../kyuubi/sql/compact/merge/AbstractFileMerger.scala	0.00%	61 Missing ⚠️
...yuubi/sql/compact/RecoverCompactTableCommand.scala	0.00%	46 Missing ⚠️
...ache/kyuubi/sql/compact/CompactTableResolver.scala	0.00%	43 Missing ⚠️
...a/org/apache/kyuubi/sql/compact/CompactTable.scala	0.00%	38 Missing ⚠️
...e/kyuubi/sql/compact/merge/ParquetFileMerger.scala	0.00%	36 Missing ⚠️
...apache/kyuubi/sql/compact/SmallFileMergeExec.scala	0.00%	34 Missing ⚠️
...uubi/sql/compact/CachePerformanceViewCommand.scala	0.00%	33 Missing ⚠️
.../apache/kyuubi/sql/compact/CompactTableUtils.scala	0.00%	33 Missing ⚠️
...a/org/apache/spark/sql/SparkInternalExplorer.scala	0.00%	26 Missing ⚠️
... and 12 more

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #6695    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files         682     706    +24     
  Lines       42192   42861   +669     
  Branches     5755    5851    +96     
=======================================
- Misses      42192   42861   +669

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gabrywu · 2024-09-15T15:29:25Z

@cxzl25 @ulysses-you all tests pass

more unit test

ulysses-you · 2024-09-18T01:26:25Z

thank you @gabrywu , about the syntax, is there any reason to introduce compact table ? Why not use procedure like

Call kyuubi.compact(table => 'xx', targetSizeInBytes => 'xxx', mode => 'xxx')

gabrywu · 2024-09-18T02:07:35Z

thank you @gabrywu , about the syntax, is there any reason to introduce compact table ? Why not use procedure like
Call kyuubi.compact(table => 'xx', targetSizeInBytes => 'xxx', mode => 'xxx')

The answer is same to the question, why use procedure?
I created this command based on spark2.x when the call procedure was not the mainstream proposal.
And by the way, a call procedure is becoming a mainstream proposal?

...nsion-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/merge/PlainFileLikeMerger.scala

...extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/CompressionCodecsUtil.scala

...ion-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/CachePerformanceViewCommand.scala

gabrywu · 2024-09-18T10:31:21Z

@cxzl25 what do you think of the call kyuubi.compact procedure?

gabrywu · 2024-09-19T01:34:22Z

@AngersZhuuuu can you help to review this PR?

pan3793 · 2024-09-19T02:24:45Z

For the syntax part, given Delta and Iceberg's dominance in the lakehouse market, I suggest following either Delta's VACCUM or Iceberg's CALL syntax.

Additional information:

Kyuubi Spark extension already adopted Delta ZORDER syntax
Spark 4.0 adopting the Iceberg CALL syntax, see SPARK-48781

gabrywu · 2024-09-19T05:46:19Z

we'd better talk about the syntax and make a final decision in the dev emails [email protected], otherwise, the upcoming PR will still not use CALL procedure
@pan3793 , @ulysses-you , @cxzl25

gabrywu · 2024-09-19T11:10:07Z

An email thread to decide which one should be used, command or call procedure
[VOTE][DISCUSS] A Spark SQL command or Call procedure

turboFei · 2024-09-24T07:24:12Z

do you support to compact one partition for partitioned table？

gabrywu · 2024-09-25T09:16:56Z

do you support to compact one partition for partitioned table？

hi, @turboFei I removed this feature from this PR. only support partition table internally.

apache#6695

gabrywu · 2024-09-26T01:11:10Z

Close this PR and will create a new one if apache/spark/pull/47190 is released in next Spark version v4.0.0

gabrywu added 10 commits September 11, 2024 15:53

1. involve a compact table command to merge small files

532369d

2. parser tests pass

parser tests pass

d2cbcc2

reformat all codes

1a92be0

SparkPlan resolved successfully

906b47e

reformat

4e9d8ca

adding unit test to recover command

350326e

compact table execution tests pass

0d30887

remove unnecessary comments

a52d501

recover compact table command tests pass

7b10692

more unit tests

0ef55ff

github-actions bot added module:spark kind:build module:extensions labels Sep 13, 2024

fix scala style issue

f04170b

gabrywu changed the title ~~Compact table~~ A new Spark SQL command to merge small files Sep 13, 2024

gabrywu added 4 commits September 13, 2024 19:11

reduce message count

02f6303

involve createToScalaConverter

cc0ecce

remove unused import

32276b5

remove unused import

c48b167

cxzl25 reviewed Sep 14, 2024

View reviewed changes

gabrywu added 3 commits September 14, 2024 19:55

remove unnecessary comment & reformat

b4fd2ad

involve SPECULATION_ENABLED_SYNONYM

79e4e94

involve createRandomTable

257dfd6

gabrywu requested a review from cxzl25 September 15, 2024 06:11

gabrywu added 4 commits September 15, 2024 16:33

try to catch unknown Row

66faca4

use Seq instead of WrappedArray

fd5a3c4

compile on scala-2.13 successfully

3775c95

remove unused import

dccc23a

rename compact-table.md to docs

42d1fc0

more unit test

github-actions bot added the kind:documentation Documentation is a feature! label Sep 16, 2024

gabrywu added 7 commits September 16, 2024 13:28

remove unused comments

232407a

support orc

5a34b97

involve toJavaList to compile on scala 2.13

c042dfa

add bzip2 unit tests

d35f7d4

spotless:apply

8b637de

fix getCodecFromFilePath for orc

7fbc805

reformat

59f0b99

support more codec

ab9b674

cxzl25 reviewed Sep 18, 2024

View reviewed changes

...nsion-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/merge/PlainFileLikeMerger.scala Outdated Show resolved Hide resolved

cxzl25 reviewed Sep 18, 2024

View reviewed changes

...extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/CompressionCodecsUtil.scala Outdated Show resolved Hide resolved

cxzl25 reviewed Sep 18, 2024

View reviewed changes

...ion-spark-3-5/src/main/scala/org/apache/kyuubi/sql/compact/CachePerformanceViewCommand.scala Outdated Show resolved Hide resolved

gabrywu added 2 commits September 18, 2024 18:20

remove unused util class, close opened stream in finally block

22f0b79

rollback regardless of the success or failure of the command

8cc2390

gabrywu requested a review from cxzl25 September 18, 2024 10:25

reformat

fd39f66

gabrywu added a commit to gabrywu/kyuubi that referenced this pull request Sep 25, 2024

sync up with compact-table branch (#1)

a86f260

apache#6695

gabrywu closed this Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KYUUBI #6691] A new Spark SQL command to merge small files #6695

[KYUUBI #6691] A new Spark SQL command to merge small files #6695

gabrywu commented Sep 13, 2024 •

edited

Loading

pan3793 commented Sep 13, 2024

codecov-commenter commented Sep 15, 2024 •

edited

Loading

gabrywu commented Sep 15, 2024

ulysses-you commented Sep 18, 2024

gabrywu commented Sep 18, 2024

gabrywu commented Sep 18, 2024

gabrywu commented Sep 19, 2024

pan3793 commented Sep 19, 2024

gabrywu commented Sep 19, 2024 •

edited

Loading

gabrywu commented Sep 19, 2024 •

edited

Loading

turboFei commented Sep 24, 2024

gabrywu commented Sep 25, 2024

gabrywu commented Sep 26, 2024 •

edited

Loading

[KYUUBI #6691] A new Spark SQL command to merge small files #6695

[KYUUBI #6691] A new Spark SQL command to merge small files #6695

Conversation

gabrywu commented Sep 13, 2024 • edited Loading

🔍 Description

Issue References 🔗

Describe Your Solution 🔧

Types of changes 🔖

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklist 📝

pan3793 commented Sep 13, 2024

codecov-commenter commented Sep 15, 2024 • edited Loading

Codecov Report

gabrywu commented Sep 15, 2024

ulysses-you commented Sep 18, 2024

gabrywu commented Sep 18, 2024

gabrywu commented Sep 18, 2024

gabrywu commented Sep 19, 2024

pan3793 commented Sep 19, 2024

gabrywu commented Sep 19, 2024 • edited Loading

gabrywu commented Sep 19, 2024 • edited Loading

turboFei commented Sep 24, 2024

gabrywu commented Sep 25, 2024

gabrywu commented Sep 26, 2024 • edited Loading

gabrywu commented Sep 13, 2024 •

edited

Loading

codecov-commenter commented Sep 15, 2024 •

edited

Loading

gabrywu commented Sep 19, 2024 •

edited

Loading

gabrywu commented Sep 19, 2024 •

edited

Loading

gabrywu commented Sep 26, 2024 •

edited

Loading