-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. #6059
base: main
Are you sure you want to change the base?
Conversation
docs/hadoop-catalog-with-s3.md
Outdated
## Prerequisites | ||
|
||
In order to create a Hadoop catalog with S3, you need to place [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) in Gravitino Hadoop classpath located | ||
at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino server with the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in Gravitino Hadoop classpath located at ${HADOOP_HOME}/share/hadoop/common/lib/
, use hadoop catalog classpath not hadoop classpath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user should place the jar in {GRAVITINO_HOME}/catalogs/hadoop/libs
not ${HADOOP_HOME}/share/hadoop/common/lib/
, YES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
schema_name = "your_s3_schema" | ||
fileset_name = "your_s3_fileset" | ||
|
||
os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add more context about where to find hadoop-aws-3.2.0.jar
, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a link here, considering other jars like aws-java-sdk-bundle-1.11.375.jar
, gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar
, I believe we also need to add the hyperlink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
docs/hadoop-catalog-with-s3.md
Outdated
|
||
```python | ||
## Replace the following code snippet with the above code snippet with the same environment variables | ||
os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be gravitino-aws-bundle jar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not fixed yet?
- [`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle) is the Gravitino AWS jar with Hadoop environment and `hadoop-aws` jar. | ||
- [`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws) is a condensed version of the Gravitino AWS bundle jar without Hadoop environment and `hadoop-aws` jar. | ||
|
||
Please choose the correct jar according to your environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add more information how to choose the cloud jar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have said "If your Spark without Hadoop environment, you can use the following code snippet to access the fileset:"
docs/hadoop-catalog-with-s3.md
Outdated
$ bin/gravitino-server.sh start | ||
``` | ||
|
||
## Create a Hadoop Catalog with S3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to add credential vending to the examples by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you added any detailed examples of credential vending in the hadoop-catalog.md? so I can create link there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't provide detailed example yet, you could refer to the credential-vending document for detailed properties.
docs/how-to-use-gvfs.md
Outdated
@@ -42,7 +42,10 @@ the path mapping and convert automatically. | |||
|
|||
### Prerequisites | |||
|
|||
+ A Hadoop environment with HDFS or other Hadoop Compatible File System (HCFS) implementations like S3, GCS, etc. GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any compatibility issues. | |||
+ A Hadoop environment with HDFS running. GVFS has been tested against |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Hadoop environment with HDFS running.
Is this a GVFS prerequisite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the update for using bundle jars in the Spark example is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spark example in this document does not contain usage for cloud storage and I'm considering removing the cloud storage example from GVFS, but it seems to be a big job. I have only do part of it.
+ A Hadoop environment with HDFS or other Hadoop Compatible File System (HCFS) implementations like S3, GCS, etc. GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any compatibility issues. | ||
- GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2. | ||
x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any | ||
compatibility issues. | ||
|
||
### Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configurations listed below appear to be intended for Java GVFS exclusively.
Should the Python GVFS require a distinct form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configuration about Python GVFS is just in the chapter below, the whole structure is designed by jiebao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to consider this comprehensively.
There may be something in the Python GVFS chapter that also needs to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed cloud storage details from how-to-use-gvfs.md
and put them into there own documents.
What changes were proposed in this pull request?
Why are the changes needed?
For better user experience.
Fix: #5472
Does this PR introduce any user-facing change?
N/A.
How was this patch tested?
N/A