[CORE] Gluten should honor the spark configs as much as possible #8043

FelixYBW · 2024-11-26T01:33:19Z

Description

During the analysis of spill in issue #8025, we noted some issues are common between Gluten and Vanilla spark, like the spill read/write buffer size. Some configuration are even not documented in Spark like spark.unsafe.sorter.spill.reader.buffer.size.

Furthermore, I noted Gluten doesn't have a list of honored spark configurations. We should create such a list in documents. @zhouyuan

FelixYBW · 2024-11-26T04:26:29Z

@marin-ma which of following shuffle configurations are used by Gluten? Can you help to fill the missing one, feel free to correct. Gluten Config is the config renamed by Gluten. if so we should either remove the Gluten config or set Gluten Config's default value as spark config's value.

Spark Config	Respected by Gluten	Transparent to Gluten
spark.reducer.maxSizeInFlight	Y	Y
spark.reducer.maxReqsInFlight	Y	Y
spark.reducer.maxBlocksInFlightPerAddress	Y	Y
spark.shuffle.compress	Y
spark.io.compression.codec	Only if `spark.gluten.sql.columnar.shuffle.codec` is not set
spark.shuffle.file.buffer	N Gluten uses fixed 16k buffer
spark.shuffle.unsafe.file.output.buffer	N
spark.shuffle.spill.diskWriteBufferSize	N Gluten uses fixed 16k buffer
spark.shuffle.io.maxRetries	Y	Y
spark.shuffle.io.numConnectionsPerPeer	Y	Y
spark.shuffle.io.preferDirectBufs	Y	Y
spark.shuffle.io.retryWait	Y	Y
spark.shuffle.io.backLog	Y	Y
spark.shuffle.io.connectionTimeout	Y	Y
spark.shuffle.io.connectionCreationTimeout	Y	Y
spark.shuffle.service.enabled	Y	Y
spark.shuffle.service.port	Y	Y
spark.shuffle.service.name	Y	Y
spark.shuffle.service.index.cache.size	Y	Y
spark.shuffle.service.removeShuffle	Y	Y
spark.shuffle.maxChunksBeingTransferred	Y	Y
spark.shuffle.sort.bypassMergeThreshold	N
spark.shuffle.sort.io.plugin.class	N
spark.shuffle.spill.compress	N Gluten uses spark.shuffle.compress
spark.shuffle.accurateBlockThreshold	Y	Y
spark.shuffle.registration.timeout	Y	Y
spark.shuffle.registration.maxAttempts	Y	Y
spark.shuffle.reduceLocality.enabled	Y	Y
spark.shuffle.mapOutput.minSizeForBroadcast	Y	Y
spark.shuffle.mapOutput.dispatcher.numThreads	Y	Y
spark.shuffle.detectCorrupt	Y	Y
spark.shuffle.detectCorrupt.useExtraMemory	Y	Y
spark.shuffle.useOldFetchProtocol	Y	Y
spark.shuffle.readHostLocalDisk	Y	Y
spark.files.io.connectionTimeout	Y	Y
spark.files.io.connectionCreationTimeout	Y	Y
spark.shuffle.checksum.enabled	N Gluten currently doesn't support it
spark.shuffle.checksum.algorithm	N Gluten currently doesn't support it
spark.shuffle.service.fetch.rdd.enabled	Y	Y
spark.shuffle.service.db.enabled	Y	Y
spark.shuffle.service.db.backend	Y	Y

yikf · 2024-11-26T07:06:44Z

@FelixYBW thank you for initiating this matter, and at the same time, I noticed some configuration related to table write. Please correct me if missing something.

Spark Config	Respected by Gluten	Gluten Config	Transparent to Gluten
spark.sql.parquet.compression.codec	Y	N/A	Y

the Parquet configuration options in the table options are also respected by Gluten, as follows:

Options	Respected by Gluten	Gluten Config	Transparent to Gluten
compression	Y	N/A	Y
parquet.compression	Y	N/A	Y
parquet.block.size	Y	spark.gluten.sql.columnar.parquet.write.blockSize	Y
parquet.block.rows	Y	spark.gluten.sql.native.parquet.write.blockRows	Y
parquet.gzip.windowSize	Y	N/A	Y

In addition to the above configuration, any other Spark configuration related to table write will not be respected by Gluten at the moment.

marin-ma · 2024-11-26T07:34:05Z

@FelixYBW Updated the table. PTAL. Thanks!

FelixYBW · 2024-11-26T17:54:44Z

spark.shuffle.file.buffer

Thank you! Can you submit a PR to support the configs:

spark.shuffle.file.buffer
spark.shuffle.spill.diskWriteBufferSize
spark.shuffle.spill.compress

jinchengchenghh · 2024-11-27T08:34:32Z

We may also need to document the config Spark and Velox mapping.

zhouyuan · 2024-11-28T06:19:36Z

looks like a long tail task again. There are also some gaps on the data loading part: the configs on HDFS, Iceberg, Hudi and DeltaLake. As these are newly implemented in native C++.
For other operators(other than Spill and table scan) and expressions(other than ANSI) it should be aligned with Spark.

FelixYBW · 2024-11-28T19:03:14Z

looks like a long tail task again. There are also some gaps on the data loading part: the configs on HDFS, Iceberg, Hudi and DeltaLake. As these are newly implemented in native C++. For other operators(other than Spill and table scan) and expressions(other than ANSI) it should be aligned with Spark.

Let's start from spark3.5's configurations here. GO through the category one by one.

https://spark.apache.org/docs/3.5.1/configuration.html

FelixYBW added the enhancement New feature or request label Nov 26, 2024

zhztheplayer changed the title ~~[Core] Gluten should honor the spark configs as much as possible~~ [CORE] Gluten should honor the spark configs as much as possible Nov 26, 2024

FelixYBW mentioned this issue Nov 26, 2024

[GLUTEN-8039][VL] Native writer should respect table properties #8040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE] Gluten should honor the spark configs as much as possible #8043

[CORE] Gluten should honor the spark configs as much as possible #8043

FelixYBW commented Nov 26, 2024 •

edited

Loading

FelixYBW commented Nov 26, 2024 •

edited by marin-ma

Loading

yikf commented Nov 26, 2024

marin-ma commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

jinchengchenghh commented Nov 27, 2024

zhouyuan commented Nov 28, 2024

FelixYBW commented Nov 28, 2024

[CORE] Gluten should honor the spark configs as much as possible #8043

[CORE] Gluten should honor the spark configs as much as possible #8043

Comments

FelixYBW commented Nov 26, 2024 • edited Loading

Description

FelixYBW commented Nov 26, 2024 • edited by marin-ma Loading

yikf commented Nov 26, 2024

marin-ma commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

jinchengchenghh commented Nov 27, 2024

zhouyuan commented Nov 28, 2024

FelixYBW commented Nov 28, 2024

FelixYBW commented Nov 26, 2024 •

edited

Loading

FelixYBW commented Nov 26, 2024 •

edited by marin-ma

Loading