[RFC] CDK to deploy EMR GLUE dev enviornment #1000

kenrickyap · 2024-12-20T18:26:53Z

Is your feature request related to a problem?

Currently there is no easy way to spin up an environment to test spark sql/ppl commands against EMR spark instance leveraging GLUE catalogue.

What solution would you like?

Provide CDK to deploy AWS stack needed for manual testing within spark repo under docs with instructions for deployment.

CDK will deploy the following :

S3 bucket - CDK will add the following to bucket:
- opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar such that spark instance can leverage opensearch-spark
- test.csv containing data for test tables that GLUE will integrate.
GLUE database + table - create glue table on top of test.csv in
EMR instance - hosts Spark and will use opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar
EMR IAM role - enable EMR to read from s3 bucket and GLUE

Implementation details

As described in this doc, EMR must have the following properties when defined to leverage GLUE table:

 {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }

For user to run spark.sql() commands with PPL support there are 2 options:

SSH into EMR master node and use Spark Shell by running the following command:

  spark-shell \
  --jars s3://<bucket_name>/opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar
  --<additional conf as required>

Submit spark job using the following command:

spark-submit \
--jars s3://<bucket_name>/opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar 
--<additional conf as required> \
<s3 path to query script>

What alternatives have you considered?

Exisiting RFC describes solution to run integration tests in docker image of EMR + GLUE environment. This could be an alternative to test this case, however this would not test directly against AWS.

Do you have any additional context?

Possible CDK stack (stack has not been tested as I currently do not have access to AWS account):

from aws_cdk import (core, aws_s3 as s3, aws_iam as iam, aws_emr as emr, aws_glue as glue)

class OpenSearchSparkEMRDevStack(core.Stack):

    def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
        super().__init__(scope, id, **kwargs)

        # S3 Bucket
        bucket = s3.Bucket(self, "OpenSearchSparkBucket", versioned=True)

        path_to_jar = "<path_to_jar>"
        path_to_csv = "<path_to_csv>"

        # Upload files to S3
        bucket_upload_jar = bucket.upload_local_file(
            key="jars/opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar",
            local_file_path=path_to_jar
        )
        bucket_upload_csv = bucket.upload_local_file(
            key="data/test.csv",
            local_file_path=path_to_csv
        )

        # Glue Database
        glue_database = glue.CfnDatabase(
            self, "OpenSearchSparkDevEnvGlueDatabase",
            catalog_id="<catalog_id>",
            database_input=glue.CfnDatabase.DatabaseInputProperty(
                # input relevant DatabaseInputProperty
            )
        )

        # Glue Table
        glue_table = glue.CfnTable(
            self, "OpenSearchSparkDevEnvGlueTable",
            catalog_id="<catalog_id>",
            database_name="<database_name>",
            table_input=glue.CfnTable.TableInputProperty(
                # input relevant TableInputProperty
            )
        )

        # IAM Role for EMR
        emr_role = iam.Role(
            self, "EMRRole",
            assumed_by=iam.ServicePrincipal("ec2.amazonaws.com")
        )

        emr_role.add_managed_policy(
            iam.ManagedPolicy.from_aws_managed_policy_name("AmazonS3FullAccess")
        )
        emr_role.add_managed_policy(
            iam.ManagedPolicy.from_aws_managed_policy_name("AWSGlueConsoleFullAccess")
        )

        # EMR Cluster
        emr_cluster = emr.CfnCluster(
            self, "OpenSearchSparkEMRCluster",
            instances={
                "masterInstanceGroup": {
                    "instanceType": "m5.xlarge",
                    "instanceCount": 1
                },
                "coreInstanceGroup": {
                    "instanceType": "m5.xlarge",
                    "instanceCount": 2
                }
            },
            job_flow_role=emr_role.role_name,
            service_role=emr_role.role_name,
            name="OpenSearchSparkCluster",
            applications=[{"name": "Hadoop"}, {"name": "Spark"}],
            configurations=[
                {
                    "classification": "spark-hive-site",
                    "properties": {
                        "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
                    }
                }
            ]
        )

The text was updated successfully, but these errors were encountered:

YANG-DB · 2024-12-27T23:30:59Z

@kenrickyap I have some additional context here - which has a working sample of an AWS CDK for deploying a similar stack

kenrickyap added enhancement New feature or request untriaged labels Dec 20, 2024

YANG-DB added Lang:PPL Pipe Processing Language support testing test related feature labels Dec 21, 2024

anasalkouz removed the untriaged label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] CDK to deploy EMR GLUE dev enviornment #1000

[RFC] CDK to deploy EMR GLUE dev enviornment #1000

kenrickyap commented Dec 20, 2024 •

edited

Loading

YANG-DB commented Dec 27, 2024

[RFC] CDK to deploy EMR GLUE dev enviornment #1000

[RFC] CDK to deploy EMR GLUE dev enviornment #1000

Comments

kenrickyap commented Dec 20, 2024 • edited Loading

Is your feature request related to a problem?

What solution would you like?

CDK will deploy the following :

Implementation details

What alternatives have you considered?

Do you have any additional context?

YANG-DB commented Dec 27, 2024

kenrickyap commented Dec 20, 2024 •

edited

Loading