Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. #6059

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Jan 2, 2025

What changes were proposed in this pull request?

  1. Add full example about how to use cloud storage fileset like S3, GCS, OSS and ADLS
  2. Polish how-to-use-gvfs.md and hadoop-catalog-md.
  3. Add document how fileset using credential.

Why are the changes needed?

For better user experience.

Fix: #5472

Does this PR introduce any user-facing change?

N/A.

How was this patch tested?

N/A

@yuqi1129 yuqi1129 self-assigned this Jan 2, 2025
@yuqi1129 yuqi1129 requested a review from jerryshao January 2, 2025 11:42
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
@jerqi jerqi changed the title [#5472] improvement(docs): Add example to use cloud stroage fileset and polish hadoop-catalog document. [#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. Jan 3, 2025
docs/cloud-storage-fileset-example.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
@yuqi1129 yuqi1129 requested review from tengqm and FANNG1 and removed request for tengqm January 6, 2025 03:02
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-gcs.md Outdated Show resolved Hide resolved
## Prerequisites

In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop classpath located
at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Gravitino Hadoop classpath located at ${HADOOP_HOME}/share/hadoop/common/lib/ , use hadoop catalog classpath not hadoop classpath?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user should place the jar in {GRAVITINO_HOME}/catalogs/hadoop/libs not ${HADOOP_HOME}/share/hadoop/common/lib/, YES?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

docs/hadoop-catalog-with-s3.md Outdated Show resolved Hide resolved
schema_name = "your_s3_schema"
fileset_name = "your_s3_fileset"

os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add more context about where to find hadoop-aws-3.2.0.jar, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a link here, considering other jars like aws-java-sdk-bundle-1.11.375.jar, gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, I believe we also need to add the hyperlink

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


```python
## Replace the following code snippet with the above code snippet with the same environment variables
os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be gravitino-aws-bundle jar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a mistake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not fixed yet?

- [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment and `hadoop-aws` jar.
- [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is a condensed version of the Gravitino AWS bundle jar without Hadoop environment and `hadoop-aws` jar.

Please choose the correct jar according to your environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add more information how to choose the cloud jar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have said "If your Spark without Hadoop environment, you can use the following code snippet to access the fileset:"

docs/hadoop-catalog-with-s3.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-s3.md Outdated Show resolved Hide resolved
$ bin/gravitino-server.sh start
```

## Create a Hadoop Catalog with S3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to add credential vending to the examples by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you added any detailed examples of credential vending in the hadoop-catalog.md? so I can create link there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't provide detailed example yet, you could refer to the credential-vending document for detailed properties.

@@ -42,7 +42,10 @@ the path mapping and convert automatically.

### Prerequisites

+ A Hadoop environment with HDFS or other Hadoop Compatible File System (HCFS) implementations like S3, GCS, etc. GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any compatibility issues.
+ A Hadoop environment with HDFS running. GVFS has been tested against
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Hadoop environment with HDFS running.

Is this a GVFS prerequisite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image This line was added by @justinmclean , Indeed, the description is not correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this sentence.

docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the update for using bundle jars in the Spark example is missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spark example in this document does not contain usage for cloud storage and I'm considering removing the cloud storage example from GVFS, but it seems to be a big job. I have only do part of it.

docs/hadoop-catalog-with-adls.md Outdated Show resolved Hide resolved
docs/hadoop-catalog-with-adls.md Outdated Show resolved Hide resolved
+ A Hadoop environment with HDFS or other Hadoop Compatible File System (HCFS) implementations like S3, GCS, etc. GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any compatibility issues.
- GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.
x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any
compatibility issues.

### Configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configurations listed below appear to be intended for Java GVFS exclusively.

Should the Python GVFS require a distinct form?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration about Python GVFS is just in the chapter below, the whole structure is designed by jiebao

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to consider this comprehensively.
There may be something in the Python GVFS chapter that also needs to be updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed cloud storage details from how-to-use-gvfs.md and put them into there own documents.

@jerryshao jerryshao added the branch-0.8 Automatically cherry-pick commit to branch-0.8 label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-0.8 Automatically cherry-pick commit to branch-0.8
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Improve document about how to use S3, OSS , GCS and ADLS(ABS) filest
5 participants