OpenSearch and Spark Integration P0 Demo #316

dai-chen · 2023-03-23T17:27:03Z

dai-chen
Mar 23, 2023
Maintainer

Background

Please find more context about this feature in https://github.com/opensearch-project/sql/issues/1116.

Demo Use Case

Architecture: From customer end, they already have ingestion pipeline pushing ALB logs to a S3 bucket. The dataset is extremely huge and keeps growing. There is some monitoring system that alarms on suspicious client IP.

Workflow: Once received the notification, customer wants to quick load data corresponding only into predefined OpenSearch index and dashboard. So they can diagnose and troubleshoot fast by full text analytics and visualization offered by OpenSearch.

Solution for Demo: we propose a new Maximus table format on which secondary index and materialized view are based. For the use case:

Create BloomFilter skipping index (coarse-grain and per-file) on client_ip as first acceleration
As on-demand and second acceleration, create materialized view (accelerated by skipping index automatically) which loads data into OpenSearch index

Prerequisites

Docker: https://github.com/penghuo/os-sql/tree/feature/spark-integration/docker#setup-dev-env
Change the S3 location in CREATE TABLE to yours and make sure there is credential configured in your local $HOME/.aws
Create a temporary table alb_logs_temp to simulate customer ingestion
Create OpenSearch index deltalog, alb_logs_raw and alb_logs_metrics for Maximum metadata and MV data
Create a dashboard with any chart you want on alb_logs_raw and alb_logs_metrics index created previously

Please run with DevTools in OpenSearch Dashboard.

# Create temporary table to simulate customer ingestion
POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    CREATE TABLE IF NOT EXISTS alb_logs_temp
    (
      type string,
      time timestamp,
      elb string,
      client_ip string,
      client_port int,
      target_ip string,
      target_port int,
      request_processing_time double,
      target_processing_time double,
      response_processing_time double,
      elb_status_code int,
      target_status_code string,
      received_bytes bigint,
      sent_bytes bigint,
      request_verb string,
      request_url string,
      request_proto string,
      user_agent string,
      ssl_cipher string,
      ssl_protocol string,
      target_group_arn string,
      trace_id string,
      domain_name string,
      chosen_cert_arn string,
      matched_rule_priority string,
      request_creation_time string,
      actions_executed string,
      redirect_url string,
      lambda_error_reason string,
      target_port_list string,
      target_status_code_list string,
      classification string,
      classification_reason string
    )
    USING PARQUET
    LOCATION 's3a://xxx/'
  ")"""
}

# Insert a few records as existing S3 data
POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    INSERT INTO alb_logs_temp
    VALUES
    (
      'https', --type
      CAST('2023-03-15 16:30:00.000000' AS TIMESTAMP), --time
      'app/elb1',      --elb
      '10.212.10.100', --client_ip
      41950,           --client_port
      '10.212.20.1',   --target_ip
      443,   --target_port
      0.002, --request_processing_time
      0.046, --target_processing_time
      0.0,   --response_processing_time
      503,   --elb_status_code
      '503', --target_status_code
      211,   --received_bytes
      364,   --sent_bytes
      'GET', --request_verb
      'https://192.168.1.100:443/solr/', --request_url
      NULL,  --request_proto
      NULL,  --user_agent
      NULL,  --ssl_cipher
      NULL,  --ssl_protocol
      NULL,  --target_group_arn
      NULL,  --trace_id
      NULL,  --domain_name
      NULL,  --chosen_cert_arn
      NULL,  --matched_rule_priority
      NULL,  --request_creation_time
      NULL,  --actions_executed
      NULL,  --redirect_url
      NULL,  --lambda_error_reason
      NULL,  --target_port_list
      NULL,  --target_status_code_list
      NULL,  --classification
      NULL   --classification_reason
    );
  ")"""
}

# Create Maximus table metadata index
PUT deltalog
{
   "settings":{
      "index":{
         "number_of_shards":1,
         "number_of_replicas":0
      }
   },
   "mappings":{
      "properties":{
         "path":{
            "type":"keyword"
         },
         "version":{
            "type":"keyword"
         }
      }
   }
}

# Create OpenSearch index for MV data
PUT alb_logs_raw
{
  "mappings" : {
    "properties" : {
      "receivedBytes" : {
        "type" : "long"
      },
      "requestUrl" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "requestVerb" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "sentBytes" : {
        "type" : "long"
      },
      "statusCode" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "timestamp" : {
        "type" : "date"
      }
    }
  }
}

PUT alb_logs_metrics
{
  "mappings" : {
    "properties" : {
      "count2xx" : {
        "type" : "long"
      },
      "count4xx" : {
        "type" : "long"
      },
      "count5xx" : {
        "type" : "long"
      },
      "latencyInSec" : {
        "type" : "float"
      },
      "timestamp" : {
        "type" : "date"
      },
      "totalCount" : {
        "type" : "long"
      },
      "totalReceivedBytes" : {
        "type" : "long"
      },
      "totalSentBytes" : {
        "type" : "long"
      }
    }
  }
}

Demo

Workflow

Steps

Please run with DevTools in OpenSearch Dashboard.

######################################################################
### Step 1: Configure Source Table and Skipping Index Acceleration ###
######################################################################

# Create Maximus table with existing S3 data
POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    CREATE EXTERNAL TABLE maximus_alb_logs
    (
        type string,
        time timestamp,
        elb string,
        client_ip string,
        client_port int,
        target_ip string,
        target_port int,
        request_processing_time double,
        target_processing_time double,
        response_processing_time double,
        elb_status_code int,
        target_status_code string,
        received_bytes bigint,
        sent_bytes bigint,
        request_verb string,
        request_url string,
        request_proto string,
        user_agent string,
        ssl_cipher string,
        ssl_protocol string,
        target_group_arn string,
        trace_id string,
        domain_name string,
        chosen_cert_arn string,
        matched_rule_priority string,
        request_creation_time string,
        actions_executed string,
        redirect_url string,
        lambda_error_reason string,
        target_port_list string,
        target_status_code_list string,
        classification string,
        classification_reason string
    )
    USING MAXIMUS
    LOCATION 's3a://xxx/'
    TBLPROPERTIES ('auto_refresh'='true');
  ")"""
}

# Check Maximus table metadata and data
POST _plugins/_sql
{
  "query": "SELECT * FROM deltalog"
}

POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    SELECT * FROM maximus_alb_logs
  ")"""
}


# Add more records to verify auto refresh on Maximus table
POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    INSERT INTO alb_logs_temp
    VALUES
    (
      'https', --type
      CAST('2023-03-15 16:31:00.000000' AS TIMESTAMP), --time
      'app/elb1',      --elb
      '10.212.10.101', --client_ip
      41950,           --client_port
      '10.212.20.1',   --target_ip
      443,   --target_port
      0.002, --request_processing_time
      0.046, --target_processing_time
      0.0,   --response_processing_time
      503,   --elb_status_code
      '503', --target_status_code
      211,   --received_bytes
      364,   --sent_bytes
      'GET', --request_verb
      'https://192.168.1.100:443/solr/', --request_url
      NULL,  --request_proto
      NULL,  --user_agent
      NULL,  --ssl_cipher
      NULL,  --ssl_protocol
      NULL,  --target_group_arn
      NULL,  --trace_id
      NULL,  --domain_name
      NULL,  --chosen_cert_arn
      NULL,  --matched_rule_priority
      NULL,  --request_creation_time
      NULL,  --actions_executed
      NULL,  --redirect_url
      NULL,  --lambda_error_reason
      NULL,  --target_port_list
      NULL,  --target_status_code_list
      NULL,  --classification
      NULL   --classification_reason
    );
  ")"""
}


# Create skipping index as first and permanent acceleration strategy
POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    CREATE INDEX alb_logs_client_ip_index
    ON maximus_alb_logs (client_ip)
    AS 'bloomfilter';
  ")"""
}

# Check skipping index data
POST _plugins/_sql
{
  "query": "SELECT * FROM alb_logs_client_ip_index"
}

# Check skipping index works with applicable query
POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    EXPLAIN
    SELECT *
    FROM maximus_alb_logs
    WHERE client_ip = '10.212.10.101'
  ")"""
}


################################################################
### Step 2: Trigger On-Demand Materialized View Acceleration ###
################################################################

# Create non-aggregate MV to load raw data for full text analytics
POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    CREATE MATERIALIZED VIEW alb_logs_raw
    AS
    SELECT
      UNIX_MILLIS(time) AS timestamp,
      request_verb AS requestVerb,
      request_url AS requestUrl,
      target_status_code AS statusCode,
      received_bytes AS receivedBytes,
      sent_bytes AS sentBytes
    FROM maximus_alb_logs
    WHERE client_ip = '10.212.10.101'
  ")"""
}

# Check MV data
POST _plugins/_sql
{
  "query": "SELECT * FROM alb_logs_raw"
}


# Create aggregate MV for log-to-metrics transformation and rollup
POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    CREATE MATERIALIZED VIEW alb_logs_metrics
    AS
    SELECT
      UNIX_MILLIS(window.start) AS timestamp,
      COUNT(*) AS totalCount,
      AVG(target_processing_time) FILTER(WHERE target_processing_time != -1) AS latencyInSec,
      COUNT(*) FILTER(WHERE target_status_code LIKE '2__') AS count2xx,
      COUNT(*) FILTER(WHERE target_status_code LIKE '4__') AS count4xx,
      COUNT(*) FILTER(WHERE target_status_code LIKE '5__') AS count5xx,
      SUM(received_bytes) AS totalReceivedBytes,
      SUM(sent_bytes) AS totalSentBytes
    FROM maximus_alb_logs
    WHERE client_ip = '10.212.10.101'
    GROUP BY TUMBLE(time, '1 Minute');
  ")"""
}

# Will see data coming in as we move beyond each 1 min window in streaming job above
POST _plugins/_sql
{
  "query": "SELECT * FROM alb_logs_metrics"
}


# Add more records and verify MV refresh
POST _plugins/_ppl
{
  "query": """source = myspark.jdbc("
    INSERT INTO alb_logs_temp
    VALUES
    (
      'https', --type
      CAST('2023-03-15 16:35:00.000000' AS TIMESTAMP), --time
      'app/elb1',      --elb
      '10.212.10.101', --client_ip
      41950,           --client_port
      '10.212.20.1',   --target_ip
      443,   --target_port
      0.002, --request_processing_time
      0.046, --target_processing_time
      0.0,   --response_processing_time
      503,   --elb_status_code
      '503', --target_status_code
      211,   --received_bytes
      364,   --sent_bytes
      'GET', --request_verb
      'https://192.168.1.100:443/solr/', --request_url
      NULL,  --request_proto
      NULL,  --user_agent
      NULL,  --ssl_cipher
      NULL,  --ssl_protocol
      NULL,  --target_group_arn
      NULL,  --trace_id
      NULL,  --domain_name
      NULL,  --chosen_cert_arn
      NULL,  --matched_rule_priority
      NULL,  --request_creation_time
      NULL,  --actions_executed
      NULL,  --redirect_url
      NULL,  --lambda_error_reason
      NULL,  --target_port_list
      NULL,  --target_status_code_list
      NULL,  --classification
      NULL   --classification_reason
    );
  ")"""
}

Cleanup

Remove docker_os_data and docker_spark_data docker volume.

Alternatively, if you don't want to lose everything you created in OpenSearch, start docker and run the following commands in CLI and re-create alb_logs_temp only.

curl 'http://localhost:9200/deltalog/_delete_by_query' \
--header 'Content-Type: application/json' \
--data '{
  "query": { 
        "match_all" : {}
    }
}
'

curl 'http://localhost:9200/alb_logs_raw/_delete_by_query' \
--header 'Content-Type: application/json' \
--data '{
  "query": { 
        "match_all" : {}
    }
}
'

curl 'http://localhost:9200/alb_logs_metrics/_delete_by_query' \
--header 'Content-Type: application/json' \
--data '{
  "query": { 
        "match_all" : {}
    }
}
'

curl -X DELETE 'http://localhost:9200/alb_logs_client_ip_index'

Video

OpenSearch.Spark.demo.part.1.mp4

OpenSearch.Spark.demo.part.2.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSearch and Spark Integration P0 Demo #316

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

OpenSearch and Spark Integration P0 Demo #316

dai-chen Mar 23, 2023 Maintainer

Background

Demo Use Case

Prerequisites

Demo

Workflow

Steps

Cleanup

Video

Replies: 0 comments

dai-chen
Mar 23, 2023
Maintainer