Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] CDK to deploy EMR GLUE dev enviornment #1000

Open
kenrickyap opened this issue Dec 20, 2024 · 1 comment
Open

[RFC] CDK to deploy EMR GLUE dev enviornment #1000

kenrickyap opened this issue Dec 20, 2024 · 1 comment
Labels
enhancement New feature or request Lang:PPL Pipe Processing Language support testing test related feature

Comments

@kenrickyap
Copy link
Contributor

kenrickyap commented Dec 20, 2024

Is your feature request related to a problem?

Currently there is no easy way to spin up an environment to test spark sql/ppl commands against EMR spark instance leveraging GLUE catalogue.

What solution would you like?

Provide CDK to deploy AWS stack needed for manual testing within spark repo under docs with instructions for deployment.

CDK will deploy the following :

  • S3 bucket - CDK will add the following to bucket:
    • opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar such that spark instance can leverage opensearch-spark
    • test.csv containing data for test tables that GLUE will integrate.
  • GLUE database + table - create glue table on top of test.csv in
  • EMR instance - hosts Spark and will use opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar
  • EMR IAM role - enable EMR to read from s3 bucket and GLUE

Implementation details

  • As described in this doc, EMR must have the following properties when defined to leverage GLUE table:
 {
    "Classification": "spark-hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
  • For user to run spark.sql() commands with PPL support there are 2 options:

      spark-shell \
      --jars s3://<bucket_name>/opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar
      --<additional conf as required>
    
    • Submit spark job using the following command:
    spark-submit \
    --jars s3://<bucket_name>/opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar 
    --<additional conf as required> \
    <s3 path to query script>
    

What alternatives have you considered?

  • Exisiting RFC describes solution to run integration tests in docker image of EMR + GLUE environment. This could be an alternative to test this case, however this would not test directly against AWS.

Do you have any additional context?

Possible CDK stack (stack has not been tested as I currently do not have access to AWS account):

from aws_cdk import (core, aws_s3 as s3, aws_iam as iam, aws_emr as emr, aws_glue as glue)

class OpenSearchSparkEMRDevStack(core.Stack):

    def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
        super().__init__(scope, id, **kwargs)

        # S3 Bucket
        bucket = s3.Bucket(self, "OpenSearchSparkBucket", versioned=True)

        path_to_jar = "<path_to_jar>"
        path_to_csv = "<path_to_csv>"

        # Upload files to S3
        bucket_upload_jar = bucket.upload_local_file(
            key="jars/opensearch-spark-ppl_2.12-0.7.0-SNAPSHOT.jar",
            local_file_path=path_to_jar
        )
        bucket_upload_csv = bucket.upload_local_file(
            key="data/test.csv",
            local_file_path=path_to_csv
        )

        # Glue Database
        glue_database = glue.CfnDatabase(
            self, "OpenSearchSparkDevEnvGlueDatabase",
            catalog_id="<catalog_id>",
            database_input=glue.CfnDatabase.DatabaseInputProperty(
                # input relevant DatabaseInputProperty
            )
        )

        # Glue Table
        glue_table = glue.CfnTable(
            self, "OpenSearchSparkDevEnvGlueTable",
            catalog_id="<catalog_id>",
            database_name="<database_name>",
            table_input=glue.CfnTable.TableInputProperty(
                # input relevant TableInputProperty
            )
        )

        # IAM Role for EMR
        emr_role = iam.Role(
            self, "EMRRole",
            assumed_by=iam.ServicePrincipal("ec2.amazonaws.com")
        )

        emr_role.add_managed_policy(
            iam.ManagedPolicy.from_aws_managed_policy_name("AmazonS3FullAccess")
        )
        emr_role.add_managed_policy(
            iam.ManagedPolicy.from_aws_managed_policy_name("AWSGlueConsoleFullAccess")
        )

        # EMR Cluster
        emr_cluster = emr.CfnCluster(
            self, "OpenSearchSparkEMRCluster",
            instances={
                "masterInstanceGroup": {
                    "instanceType": "m5.xlarge",
                    "instanceCount": 1
                },
                "coreInstanceGroup": {
                    "instanceType": "m5.xlarge",
                    "instanceCount": 2
                }
            },
            job_flow_role=emr_role.role_name,
            service_role=emr_role.role_name,
            name="OpenSearchSparkCluster",
            applications=[{"name": "Hadoop"}, {"name": "Spark"}],
            configurations=[
                {
                    "classification": "spark-hive-site",
                    "properties": {
                        "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
                    }
                }
            ]
        )
@kenrickyap kenrickyap added enhancement New feature or request untriaged labels Dec 20, 2024
@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support testing test related feature labels Dec 21, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Dec 27, 2024

@kenrickyap I have some additional context here - which has a working sample of an AWS CDK for deploying a similar stack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Lang:PPL Pipe Processing Language support testing test related feature
Projects
None yet
Development

No branches or pull requests

3 participants