Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat](iceberg)Supports using rest type catalog to read tables in unity catalog #43525

Merged
merged 4 commits into from
Nov 22, 2024

Conversation

wuwenchi
Copy link
Contributor

@wuwenchi wuwenchi commented Nov 8, 2024

What problem does this PR solve?

  1. We now support using the rest type catalog to read tables in the unity catalog (https://github.com/unitycatalog/unitycatalog).
  2. When reading the parquet file on the be side, we find the corresponding column name based on the column id, which naturally supports the column rename function.

example:

CREATE CATALOG `uc3`
PROPERTIES (
"type"  =  "iceberg",
"iceberg.catalog.type"  =  "rest",
"uri"  =  "http://127.0.0.1:8080/api/2.1/unity-catalog/iceberg",
"external_catalog.name" = "unity"  --- catalog name in unity catalog
);

Release Note

feat support iceberg unity catalog

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -259,6 +259,10 @@ std::vector<tparquet::KeyValue> ParquetReader::get_metadata_key_values() {
return _t_metadata->key_value_metadata;
}

const FieldDescriptor ParquetReader::get_file_metadata_schema() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: return type 'const std::doris::vectorized::FieldDescriptor' is 'const'-qualified at the top level, which may reduce code readability without improving const correctness [readability-const-return-type]

Suggested change
const FieldDescriptor ParquetReader::get_file_metadata_schema() {
FieldDescriptor ParquetReader::get_file_metadata_schema() {

be/src/vec/exec/format/parquet/vparquet_reader.h:151:

-     const FieldDescriptor get_file_metadata_schema();
+     FieldDescriptor get_file_metadata_schema();

@wuwenchi
Copy link
Contributor Author

wuwenchi commented Nov 8, 2024

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.91% (9867/26030)
Line Coverage: 29.09% (82270/282777)
Region Coverage: 28.26% (42368/149947)
Branch Coverage: 24.80% (21452/86500)
Coverage Report: http://coverage.selectdb-in.cc/coverage/0b472c1d54bb438937c943b2672ca3caa904e331_0b472c1d54bb438937c943b2672ca3caa904e331/report/index.html

@wuwenchi
Copy link
Contributor Author

wuwenchi commented Nov 8, 2024

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.91% (9868/26030)
Line Coverage: 29.10% (82278/282777)
Region Coverage: 28.25% (42367/149947)
Branch Coverage: 24.80% (21451/86500)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b2105ca203b130292bf414c0203cbd4d5b5bf1bf_b2105ca203b130292bf414c0203cbd4d5b5bf1bf/report/index.html

icebergCatalog = ((HMSExternalCatalog) key.catalog).getIcebergHiveCatalog();
Catalog icebergCatalog = ((HMSExternalCatalog) key.catalog).getIcebergHiveCatalog();
icebergTable = HiveMetaStoreClientHelper.ugiDoAs(((ExternalCatalog) key.catalog).getConfiguration(),
() -> icebergCatalog.loadTable(TableIdentifier.of(key.dbName, key.tableName)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should unify the interface, both using catalog.loadTable() or using metadataOps.loadTable()

@@ -39,6 +39,7 @@ public abstract class IcebergExternalCatalog extends ExternalCatalog {
public static final String ICEBERG_HADOOP = "hadoop";
public static final String ICEBERG_GLUE = "glue";
public static final String ICEBERG_DLF = "dlf";
public static final String EXTERNAL_SERVER_CATALOG_NAME = "external_server_catalog_name";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final String EXTERNAL_SERVER_CATALOG_NAME = "external_server_catalog_name";
public static final String EXTERNAL_SERVER_CATALOG_NAME = "external_catalog.name";

_has_schema_change = true;
}
}
Status IcebergParquetReader::_gen_col_name_maps(FieldDescriptor field_desc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Status IcebergParquetReader::_gen_col_name_maps(FieldDescriptor field_desc) {
Status IcebergParquetReader::_gen_col_name_maps(const FieldDescriptor& field_desc) {

@@ -149,6 +149,7 @@ class ParquetReader : public GenericReader {
const std::unordered_map<std::string, VExprContextSPtr>& missing_columns) override;

std::vector<tparquet::KeyValue> get_metadata_key_values();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be removed?

@@ -218,7 +218,7 @@ class IcebergParquetReader final : public IcebergTableReader {
parquet_reader->set_delete_rows(&_iceberg_delete_rows);
}

Status _gen_col_name_maps(std::vector<tparquet::KeyValue> parquet_meta_kv);
Status _gen_col_name_maps(FieldDescriptor field_desc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need also modify this method in IcebergOrcReader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The orc format is originally associated according to the id, so there is no need to modify it.

@wuwenchi
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -147,6 +150,14 @@ Status FieldDescriptor::parse_from_thrift(const std::vector<tparquet::SchemaElem
return Status::OK();
}

const doris::Slice FieldDescriptor::get_column_name_from_field_id(int32_t id) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: return type 'const doris::Slice' is 'const'-qualified at the top level, which may reduce code readability without improving const correctness [readability-const-return-type]

Suggested change
const doris::Slice FieldDescriptor::get_column_name_from_field_id(int32_t id) const {
doris::Slice FieldDescriptor::get_column_name_from_field_id(int32_t id) const {

be/src/vec/exec/format/parquet/schema_desc.h:137:

-     const doris::Slice get_column_name_from_field_id(int32_t id) const;
+     doris::Slice get_column_name_from_field_id(int32_t id) const;

// This is for iceberg schema evolution.
std::vector<tparquet::KeyValue> ParquetReader::get_metadata_key_values() {
return _t_metadata->key_value_metadata;
const FieldDescriptor ParquetReader::get_file_metadata_schema() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: return type 'const std::doris::vectorized::FieldDescriptor' is 'const'-qualified at the top level, which may reduce code readability without improving const correctness [readability-const-return-type]

Suggested change
const FieldDescriptor ParquetReader::get_file_metadata_schema() {
FieldDescriptor ParquetReader::get_file_metadata_schema() {

be/src/vec/exec/format/parquet/vparquet_reader.h:150:

-     const FieldDescriptor get_file_metadata_schema();
+     FieldDescriptor get_file_metadata_schema();

@wuwenchi
Copy link
Contributor Author

run buildall

@morningman morningman changed the title [bugfix](iceberg)Supports using rest type catalog to read tables in unity catalog [feat](iceberg)Supports using rest type catalog to read tables in unity catalog Nov 12, 2024
@wuwenchi
Copy link
Contributor Author

run buildall

@wuwenchi
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.94% (9887/26058)
Line Coverage: 29.14% (82615/283527)
Region Coverage: 28.28% (42483/150238)
Branch Coverage: 24.85% (21546/86688)
Coverage Report: http://coverage.selectdb-in.cc/coverage/705354f1b6486cfd3a273d7ceb1886212af124f4_705354f1b6486cfd3a273d7ceb1886212af124f4/report/index.html

@wuwenchi
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.93% (9886/26063)
Line Coverage: 29.13% (82605/283594)
Region Coverage: 28.26% (42470/150264)
Branch Coverage: 24.85% (21546/86698)
Coverage Report: http://coverage.selectdb-in.cc/coverage/705354f1b6486cfd3a273d7ceb1886212af124f4_705354f1b6486cfd3a273d7ceb1886212af124f4/report/index.html

morningman
morningman previously approved these changes Nov 15, 2024
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 15, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

final_file = final_file.substr(5);
}
RETURN_IF_ERROR_RESULT(io::global_local_filesystem()->open_file(final_file, &file_reader,
&reader_options));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file:: prefix is duplicated with type? Maybe we should parse when get type and check if they are same.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 21, 2024
@apache apache deleted a comment from github-actions bot Nov 21, 2024
}

// Parse URI authority, if any
if (path_str.compare(start, 2, "//") == 0 && path_str.length() - start > 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add example here

@wuwenchi
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 21, 2024
@doris-robot
Copy link

TPC-H: Total hot run time: 40240 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 36be1a4a5812d4a491212d80ef831d13f2b98942, data reload: false

------ Round 1 ----------------------------------
q1	17611	7768	7395	7395
q2	2043	187	179	179
q3	10852	1147	1156	1147
q4	10370	740	722	722
q5	7657	2780	2715	2715
q6	246	150	146	146
q7	997	621	610	610
q8	9268	1874	1964	1874
q9	6619	6434	6503	6434
q10	7068	2300	2301	2300
q11	470	262	278	262
q12	432	222	219	219
q13	17767	3080	3097	3080
q14	237	231	220	220
q15	583	538	538	538
q16	649	601	584	584
q17	995	525	572	525
q18	7379	6845	6661	6661
q19	1345	1031	1004	1004
q20	484	179	184	179
q21	4058	3122	3197	3122
q22	377	324	328	324
Total cold run time: 107507 ms
Total hot run time: 40240 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7388	7299	7349	7299
q2	329	227	232	227
q3	2920	2864	2928	2864
q4	2170	1842	1994	1842
q5	5672	5723	5690	5690
q6	222	141	146	141
q7	2277	1844	1834	1834
q8	3429	3595	3566	3566
q9	8991	9002	8982	8982
q10	3602	3574	3531	3531
q11	613	501	508	501
q12	834	647	636	636
q13	11285	3332	3295	3295
q14	318	271	277	271
q15	577	534	526	526
q16	684	634	643	634
q17	1897	1652	1597	1597
q18	8454	7906	7683	7683
q19	1742	1618	1588	1588
q20	2130	1902	1882	1882
q21	5613	5590	5506	5506
q22	626	564	559	559
Total cold run time: 71773 ms
Total hot run time: 60654 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 197270 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 36be1a4a5812d4a491212d80ef831d13f2b98942, data reload: false

query1	1235	942	944	942
query2	6234	2106	2063	2063
query3	10828	3802	3931	3802
query4	67449	26687	23631	23631
query5	4972	461	452	452
query6	414	185	184	184
query7	5512	319	297	297
query8	324	233	223	223
query9	8773	2696	2685	2685
query10	437	271	263	263
query11	17353	15379	15960	15379
query12	158	109	108	108
query13	1510	441	466	441
query14	10842	7114	7601	7114
query15	222	191	179	179
query16	7123	460	483	460
query17	1161	573	555	555
query18	1773	339	295	295
query19	194	146	145	145
query20	117	110	110	110
query21	212	105	101	101
query22	4722	4549	4484	4484
query23	34788	34386	34392	34386
query24	5405	2527	2625	2527
query25	519	386	389	386
query26	652	153	154	153
query27	1828	288	303	288
query28	4263	2468	2487	2468
query29	692	433	420	420
query30	215	162	150	150
query31	1005	842	869	842
query32	67	55	53	53
query33	415	284	276	276
query34	962	519	526	519
query35	839	747	804	747
query36	1101	953	952	952
query37	123	71	80	71
query38	4495	4557	4405	4405
query39	1515	1461	1506	1461
query40	199	101	102	101
query41	45	48	45	45
query42	116	98	100	98
query43	547	524	515	515
query44	1214	850	859	850
query45	194	167	172	167
query46	1165	709	704	704
query47	2037	1932	1959	1932
query48	429	337	319	319
query49	746	412	401	401
query50	832	403	418	403
query51	7462	7389	7151	7151
query52	104	86	89	86
query53	247	180	188	180
query54	528	393	418	393
query55	81	75	74	74
query56	257	270	242	242
query57	1287	1163	1174	1163
query58	222	210	234	210
query59	3293	3176	2991	2991
query60	260	252	256	252
query61	115	109	113	109
query62	788	646	686	646
query63	215	202	190	190
query64	1345	644	630	630
query65	3289	3221	3265	3221
query66	700	297	302	297
query67	16206	15851	15738	15738
query68	3936	580	566	566
query69	424	255	256	255
query70	1230	1165	1041	1041
query71	365	258	259	258
query72	6446	4142	4002	4002
query73	779	368	364	364
query74	10267	9035	9046	9035
query75	3389	2690	2667	2667
query76	1838	1135	1078	1078
query77	494	340	274	274
query78	10571	9538	9421	9421
query79	2163	601	595	595
query80	1358	426	426	426
query81	535	239	237	237
query82	1324	125	118	118
query83	192	148	143	143
query84	283	75	75	75
query85	992	354	300	300
query86	417	306	302	302
query87	4830	4772	4613	4613
query88	3846	2256	2219	2219
query89	417	295	297	295
query90	2009	190	188	188
query91	150	103	101	101
query92	61	50	52	50
query93	2652	538	557	538
query94	814	292	298	292
query95	367	251	247	247
query96	635	284	278	278
query97	2887	2687	2703	2687
query98	213	199	192	192
query99	1933	1299	1288	1288
Total cold run time: 322189 ms
Total hot run time: 197270 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.03% (9902/26036)
Line Coverage: 29.23% (82888/283563)
Region Coverage: 28.37% (42588/150137)
Branch Coverage: 24.91% (21580/86616)
Coverage Report: http://coverage.selectdb-in.cc/coverage/36be1a4a5812d4a491212d80ef831d13f2b98942_36be1a4a5812d4a491212d80ef831d13f2b98942/report/index.html

@doris-robot
Copy link

ClickBench: Total hot run time: 32.87 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 36be1a4a5812d4a491212d80ef831d13f2b98942, data reload: false

query1	0.03	0.03	0.03
query2	0.07	0.04	0.03
query3	0.24	0.08	0.07
query4	1.64	0.10	0.10
query5	0.42	0.42	0.43
query6	1.16	0.67	0.65
query7	0.01	0.01	0.02
query8	0.04	0.04	0.03
query9	0.58	0.52	0.52
query10	0.56	0.55	0.56
query11	0.14	0.10	0.11
query12	0.15	0.11	0.12
query13	0.60	0.61	0.60
query14	2.72	2.78	2.83
query15	0.90	0.82	0.81
query16	0.38	0.38	0.38
query17	1.02	0.99	1.05
query18	0.24	0.21	0.21
query19	1.92	1.87	1.98
query20	0.01	0.01	0.01
query21	15.37	0.62	0.56
query22	2.25	1.96	2.78
query23	17.18	0.82	0.84
query24	3.19	1.12	1.47
query25	0.23	0.08	0.11
query26	0.55	0.14	0.14
query27	0.04	0.05	0.04
query28	9.98	1.09	1.06
query29	12.59	3.22	3.19
query30	0.25	0.07	0.07
query31	2.86	0.39	0.37
query32	3.30	0.45	0.48
query33	3.03	3.02	3.05
query34	16.97	4.43	4.46
query35	4.54	4.49	4.48
query36	0.67	0.49	0.49
query37	0.09	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.03	0.02
query40	0.15	0.13	0.13
query41	0.09	0.03	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.31 s
Total hot run time: 32.87 s

@wuwenchi
Copy link
Contributor Author

run feut

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 22, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@morningman morningman merged commit 82f33c2 into apache:master Nov 22, 2024
27 of 29 checks passed
wuwenchi added a commit to wuwenchi/doris_new that referenced this pull request Dec 8, 2024
…nity catalog (apache#43525)

### What problem does this PR solve?

1. We now support using the `rest` type catalog to read tables in the
unity catalog (https://github.com/unitycatalog/unitycatalog).
2. When reading the parquet file on the be side, we find the
corresponding column name based on the column id, which naturally
supports the column rename function.

example:
```
CREATE CATALOG `uc3`
PROPERTIES (
"type"  =  "iceberg",
"iceberg.catalog.type"  =  "rest",
"uri"  =  "http://127.0.0.1:8080/api/2.1/unity-catalog/iceberg",
"external_catalog.name" = "unity"  --- catalog name in unity catalog
);
```
morningman pushed a commit that referenced this pull request Dec 9, 2024
morningman pushed a commit that referenced this pull request Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.8-merged dev/3.0.4-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants