-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-7953][VL] Fetch and dump all inputs for micro benchmark on middle stage begin #7998
Conversation
does this PR only works for middle stage? or dump the data comes from shuffle? |
@Yohahaha Yes. It only works for middle stage. Updated PR titile. |
@zhztheplayer Currently, the input files are being dumped during the middle stage of execution. This PR will fetch all the data and save it into file before the pipeline starts. The stage input will then be read from this dumped file. This will help to save a complete input file even if the task fails. In this way, we can have the full input data of the failed task, and reproduce the failure with microbenchmark. |
cpp/core/jni/JniCommon.cc
Outdated
auto rb = gluten::arrowGetOrThrow(arrow::ImportRecordBatch(array.get(), schema.get())); | ||
GLUTEN_THROW_NOT_OK(writer_->initWriter(*(rb->schema().get()))); | ||
GLUTEN_THROW_NOT_OK(writer_->writeInBatches(rb)); | ||
} while (env->CallBooleanMethod(jColumnarBatchItr_, serializedColumnarBatchIteratorHasNext_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkEnv after CallBooleanMethod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get all the batch, so the batch Iterator cannot be reused, will it affect the following stage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh In what cases would the batch iterator be reused? Could you elaborate more on this? Thanks!
} | ||
return reader_->next(); | ||
} | ||
} // namespace gluten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end empty line
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works. Thank you Rong.
Collect all input data and save it into a Parquet file. Then, read the data from the Parquet file to feed it into the pipeline.
Update
spark.gluten.sql.benchmark_task.partitionId
andspark.gluten.sql.benchmark_task.taskId
to accept a comma-separated string of multiple partition ids/task idsCurrently, the input files are being dumped during the middle stage of execution. This PR will fetch all the data and save it into file before the pipeline starts. The stage input will then be read from this dumped file. This will help to save a complete input file even if the task fails. In this way, we can have the full input data of the failed task, and reproduce the failure with microbenchmark.