[#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. #6059

yuqi1129 · 2025-01-02T04:56:46Z

What changes were proposed in this pull request?

Add full example about how to use cloud storage fileset like S3, GCS, OSS and ADLS
Polish how-to-use-gvfs.md and hadoop-catalog-md.
Add document how fileset using credential.

Why are the changes needed?

For better user experience.

Fix: #5472

Does this PR introduce any user-facing change?

N/A.

How was this patch tested?

N/A

docs/cloud-storage-fileset-example.md

docs/hadoop-catalog-with-gcs.md

FANNG1 · 2025-01-06T14:58:25Z

docs/hadoop-catalog-with-s3.md

+## Prerequisites
+
+In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop classpath located 
+at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command:


in Gravitino Hadoop classpath located at ${HADOOP_HOME}/share/hadoop/common/lib/ , use hadoop catalog classpath not hadoop classpath?

The user should place the jar in {GRAVITINO_HOME}/catalogs/hadoop/libs not ${HADOOP_HOME}/share/hadoop/common/lib/, YES?

docs/hadoop-catalog-with-s3.md

FANNG1 · 2025-01-06T15:06:20Z

docs/hadoop-catalog-with-s3.md

+schema_name = "your_s3_schema"
+fileset_name = "your_s3_fileset"
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell"


add more context about where to find hadoop-aws-3.2.0.jar, etc

I can add a link here, considering other jars like aws-java-sdk-bundle-1.11.375.jar, gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, I believe we also need to add the hyperlink

FANNG1 · 2025-01-06T15:08:24Z

docs/hadoop-catalog-with-s3.md

+
+```python
+## Replace the following code snippet with the above code snippet with the same environment variables
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell"


this should be gravitino-aws-bundle jar?

Yes, this is a mistake.

not fixed yet?

FANNG1 · 2025-01-06T15:08:59Z

docs/hadoop-catalog-with-s3.md

+- [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment and `hadoop-aws` jar.
+- [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is a condensed version of the Gravitino AWS bundle jar without Hadoop environment and `hadoop-aws` jar.
+
+Please choose the correct jar according to your environment.


add more information how to choose the cloud jar

I have said "If your Spark without Hadoop environment, you can use the following code snippet to access the fileset:"

docs/hadoop-catalog-with-s3.md

FANNG1 · 2025-01-06T15:14:02Z

docs/hadoop-catalog-with-s3.md

+$ bin/gravitino-server.sh start
+```
+
+## Create a Hadoop Catalog with S3


I prefer to add credential vending to the examples by default.

Have you added any detailed examples of credential vending in the hadoop-catalog.md? so I can create link there.

I didn't provide detailed example yet, you could refer to the credential-vending document for detailed properties.

docs/hadoop-catalog.md

mchades · 2025-01-07T03:32:52Z

docs/how-to-use-gvfs.md

@@ -42,7 +42,10 @@ the path mapping and convert automatically.

 ### Prerequisites

-+ A Hadoop environment with HDFS or other Hadoop Compatible File System (HCFS) implementations like S3, GCS, etc. GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any compatibility issues.
+ A Hadoop environment with HDFS running. GVFS has been tested against


A Hadoop environment with HDFS running.

Is this a GVFS prerequisite?

This line was added by @justinmclean , Indeed, the description is not correct.

I have removed this sentence.

docs/how-to-use-gvfs.md

mchades · 2025-01-08T16:05:22Z

docs/how-to-use-gvfs.md

It appears that the update for using bundle jars in the Spark example is missing

The spark example in this document does not contain usage for cloud storage and I'm considering removing the cloud storage example from GVFS, but it seems to be a big job. I have only do part of it.

docs/hadoop-catalog-with-adls.md

docs/hadoop-catalog-with-gcs.md

mchades · 2025-01-10T03:31:28Z

docs/how-to-use-gvfs.md

-+ A Hadoop environment with HDFS or other Hadoop Compatible File System (HCFS) implementations like S3, GCS, etc. GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any compatibility issues.
+ - GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.
+  x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any
+  compatibility issues.

 ### Configuration


The configurations listed below appear to be intended for Java GVFS exclusively.

Should the Python GVFS require a distinct form?

Configuration about Python GVFS is just in the chapter below, the whole structure is designed by jiebao

You may need to consider this comprehensively.
There may be something in the Python GVFS chapter that also needs to be updated

I have removed cloud storage details from how-to-use-gvfs.md and put them into there own documents.

yuqi1129 added 2 commits January 2, 2025 12:54

fix

c4fb29a

Merge branch 'main' of github.com:datastrato/graviton into 5472

4791a64

yuqi1129 self-assigned this Jan 2, 2025

yuqi1129 requested a review from jerryshao January 2, 2025 11:42

tengqm reviewed Jan 3, 2025

View reviewed changes

yuqi1129 added 2 commits January 3, 2025 10:19

fix

baf42e1

Merge branch 'main' of github.com:datastrato/graviton into 5472

d86610b

jerqi changed the title ~~[#5472] improvement(docs): Add example to use cloud stroage fileset and polish hadoop-catalog document.~~ [#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. Jan 3, 2025

yuqi1129 added 2 commits January 3, 2025 19:35

fix

b7eb621

update the docs

1ecc378

tengqm reviewed Jan 4, 2025

View reviewed changes

yuqi1129 added 2 commits January 6, 2025 08:48

polish document again.

d232e92

Again

fbd57ba

yuqi1129 requested review from tengqm and FANNG1 and removed request for tengqm January 6, 2025 03:02

tengqm reviewed Jan 6, 2025

View reviewed changes

docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved

docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved

docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved

yuqi1129 added 3 commits January 6, 2025 19:57

fix

4fb6e79

fix

e481c8d

fix

0a97fc7

FANNG1 reviewed Jan 6, 2025

View reviewed changes

yuqi1129 added 3 commits January 7, 2025 09:58

fix

a6fbe7b

fix

7b47a9b

Polish the doc

ab07455

mchades reviewed Jan 7, 2025

View reviewed changes

docs/hadoop-catalog.md Show resolved Hide resolved

mchades reviewed Jan 7, 2025

View reviewed changes

yuqi1129 added 2 commits January 7, 2025 12:09

Optimize the docs

6c1aac3

format code.

44014d9

mchades reviewed Jan 7, 2025

View reviewed changes

docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved

docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved

Merge branch 'main' of github.com:datastrato/graviton into 5472

f4968bd

yuqi1129 added 4 commits January 8, 2025 16:00

Merge branch 'main' of github.com:datastrato/graviton into 5472

8c61d18

polish document

8563c91

polish docs

0b066a5

typo

4c6f4c8

mchades reviewed Jan 8, 2025

View reviewed changes

Polish document again.

76f651e