Optimize structure.

apache · Jan 10, 2025 · cfb054c · cfb054c
1 parent 4d644f1
commit cfb054c
Show file tree

Hide file tree

Showing 4 changed files with 271 additions and 271 deletions.
diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md
@@ -231,75 +231,6 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema
 
 ## Accessing a fileset with ADLS
 
-### Using Spark to access the fileset
-
-The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset:
-
-Before running the following code, you need to install required packages:
-
-```bash
-pip install pyspark==3.1.3
-pip install gravitino==0.8.0-incubating
-```
-Then you can run the following code:
-
-```python
-import logging
-from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient
-from pyspark.sql import SparkSession
-import os
-
-gravitino_url = "http://localhost:8090"
-metalake_name = "test"
-
-catalog_name = "your_adls_catalog"
-schema_name = "your_adls_schema"
-fileset_name = "your_adls_fileset"
-
-os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell"
-spark = SparkSession.builder
-.appName("adls_fileset_test")
-.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs")
-.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
-.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
-.config("spark.hadoop.fs.gravitino.client.metalake", "test")
-.config("spark.hadoop.azure-storage-account-name", "azure_account_name")
-.config("spark.hadoop.azure-storage-account-key", "azure_account_name")
-.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true")
-.config("spark.driver.memory", "2g")
-.config("spark.driver.port", "2048")
-.getOrCreate()
-
-data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
-columns = ["Name", "Age"]
-spark_df = spark.createDataFrame(data, schema=columns)
-gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
-
-spark_df.coalesce(1).write
-.mode("overwrite")
-.option("header", "true")
-.csv(gvfs_path)
-```
-
-If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset:
-
-```python
-## Replace the following code snippet with the above code snippet with the same environment variables
-
-os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell"
-```
-
-- [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment(3.3.1) and `hadoop-azure` jar.
-- [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is a condensed version of the Gravitino ADLS bundle jar without Hadoop environment and `hadoop-azure` jar.
-- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory.
-
-
-Please choose the correct jar according to your environment.
-
-:::note
-In some Spark versions, a Hadoop environment is necessary for the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly.
-:::
-
 ### Using the GVFS Java client to access the fileset
 
 To access fileset with Azure Blob Storage(ADLS) using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations:
@@ -373,6 +304,75 @@ Or use the bundle jar with Hadoop environment:
   </dependency>
 ```
 
+### Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset:
+
+Before running the following code, you need to install required packages:
+
+```bash
+pip install pyspark==3.1.3
+pip install gravitino==0.8.0-incubating
+```
+Then you can run the following code:
+
+```python
+import logging
+from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090"
+metalake_name = "test"
+
+catalog_name = "your_adls_catalog"
+schema_name = "your_adls_schema"
+fileset_name = "your_adls_fileset"
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell"
+spark = SparkSession.builder
+.appName("adls_fileset_test")
+.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs")
+.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+.config("spark.hadoop.fs.gravitino.client.metalake", "test")
+.config("spark.hadoop.azure-storage-account-name", "azure_account_name")
+.config("spark.hadoop.azure-storage-account-key", "azure_account_name")
+.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true")
+.config("spark.driver.memory", "2g")
+.config("spark.driver.port", "2048")
+.getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+.mode("overwrite")
+.option("header", "true")
+.csv(gvfs_path)
+```
+
+If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset:
+
+```python
+## Replace the following code snippet with the above code snippet with the same environment variables
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell"
+```
+
+- [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment(3.3.1) and `hadoop-azure` jar.
+- [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is a condensed version of the Gravitino ADLS bundle jar without Hadoop environment and `hadoop-azure` jar.
+- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory.
+
+
+Please choose the correct jar according to your environment.
+
+:::note
+In some Spark versions, a Hadoop environment is necessary for the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly.
+:::
+
 ### Accessing a fileset using the Hadoop fs command
 
 The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3.

diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md
@@ -227,71 +227,6 @@ catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema
 
 ## Accessing a fileset with GCS
 
-### Using Spark to access the fileset
-
-The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset:
-
-Before running the following code, you need to install required packages:
-
-```bash
-pip install pyspark==3.1.3
-pip install gravitino==0.8.0-incubating
-```
-Then you can run the following code:
-
-```python
-import logging
-from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient
-from pyspark.sql import SparkSession
-import os
-
-gravitino_url = "http://localhost:8090"
-metalake_name = "test"
-
-catalog_name = "your_gcs_catalog"
-schema_name = "your_gcs_schema"
-fileset_name = "your_gcs_fileset"
-
-os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar --master local[1] pyspark-shell"
-spark = SparkSession.builder
-.appName("gcs_fielset_test")
-.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs")
-.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
-.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
-.config("spark.hadoop.fs.gravitino.client.metalake", "test_metalake")
-.config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json")
-.config("spark.driver.memory", "2g")
-.config("spark.driver.port", "2048")
-.getOrCreate()
-
-data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
-columns = ["Name", "Age"]
-spark_df = spark.createDataFrame(data, schema=columns)
-gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
-
-spark_df.coalesce(1).write
-.mode("overwrite")
-.option("header", "true")
-.csv(gvfs_path)
-```
-
-If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset:
-
-```python
-## Replace the following code snippet with the above code snippet with the same environment variables
-
-os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell"
-```
-
-- [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCP jar with Hadoop environment(3.3.1) and `gcs-connector`.
-- [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is a condensed version of the Gravitino GCP bundle jar without Hadoop environment and [`gcs-connector`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) 
-
-Please choose the correct jar according to your environment.
-
-:::note
-In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly.
-:::
-
 ### Using the GVFS Java client to access the fileset
 
 To access fileset with GCS using the GVFS Java client, based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the following configurations:
@@ -360,6 +295,71 @@ Or use the bundle jar with Hadoop environment:
   </dependency>
 ```
 
+### Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset:
+
+Before running the following code, you need to install required packages:
+
+```bash
+pip install pyspark==3.1.3
+pip install gravitino==0.8.0-incubating
+```
+Then you can run the following code:
+
+```python
+import logging
+from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090"
+metalake_name = "test"
+
+catalog_name = "your_gcs_catalog"
+schema_name = "your_gcs_schema"
+fileset_name = "your_gcs_fileset"
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar --master local[1] pyspark-shell"
+spark = SparkSession.builder
+.appName("gcs_fielset_test")
+.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs")
+.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+.config("spark.hadoop.fs.gravitino.client.metalake", "test_metalake")
+.config("spark.hadoop.gcs-service-account-file", "/path/to/gcs-service-account-file.json")
+.config("spark.driver.memory", "2g")
+.config("spark.driver.port", "2048")
+.getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+.mode("overwrite")
+.option("header", "true")
+.csv(gvfs_path)
+```
+
+If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset:
+
+```python
+## Replace the following code snippet with the above code snippet with the same environment variables
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell"
+```
+
+- [`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle) is the Gravitino GCP jar with Hadoop environment(3.3.1) and `gcs-connector`.
+- [`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp) is a condensed version of the Gravitino GCP bundle jar without Hadoop environment and [`gcs-connector`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar) 
+
+Please choose the correct jar according to your environment.
+
+:::note
+In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly.
+:::
+
 ### Accessing a fileset using the Hadoop fs command
 
 The following are examples of how to use the `hadoop fs` command to access the fileset in Hadoop 3.1.3.