Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orcScan reads missing data column #715

Open
ASiegeLion opened this issue Dec 23, 2024 · 2 comments
Open

orcScan reads missing data column #715

ASiegeLion opened this issue Dec 23, 2024 · 2 comments

Comments

@ASiegeLion
Copy link
Contributor

ASiegeLion commented Dec 23, 2024

blaze 读取orc 格式缺少列。
错误日志:
java.lang.RuntimeException: poll record batch error: Execution error: native execution panics: Execution error: Execution error: output_with_sender[Shuffle] error: Execution error: output_with_sender[Limit] error: Execution error: output_with_sender[Limit]: output() returns error: Execution error: Execution error: output_with_sender[Project]: output() returns error: Execution error: Execution error: index out of bounds: the len is 31 but the index is 31
at org.apache.spark.sql.blaze.JniBridge.nextBatch(Native Method)
at org.apache.spark.sql.blaze.BlazeCallNativeWrapper$$anon$1.hasNext(BlazeCallNativeWrapper.scala:80)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:25)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)

查看对应的代码发现orc_exec.rs中的FileOpener 中的open函数
image

ProjectionMask::roots(builder.file_metadata().root_data_type(), projection); projection 生成和orc mask需要的index不匹配。

orc 数据组织格式为:
image
列如:
``
若 hive schema :

biz_col_name_list : List<String>,column_index 0
dist_scene_list:  List<String>,  column_index 1
entry_name_1st : String,  column_index 2
entry_name_2nd: String, column_index 3
``` 

orc meta  则为:

`RootDataType {
 children: [  
 NamedColumn { name: "biz_col_name_list", data_type: List { column_index: 1, child: String { column_index: 2 } } }, NamedColumn { name: "dist_scene_list", data_type: List { column_index: 3, child: String { column_index: 4 } } }, 
NamedColumn { name: "entry_name_1st", data_type: String { column_index: 5 } }, 
NamedColumn { name: "entry_name_2nd", data_type: String { column_index: 6 } }] }`

可以看出 hive schema中的column index 与 orc meta中的column index 有区别。
 






@cxzl25
Copy link
Contributor

cxzl25 commented Dec 26, 2024

#716 has been merged, is this issue ready to be closed?

@ASiegeLion
Copy link
Contributor Author

#716 has been merged, is this issue ready to be closed?

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants