Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Spanner Migration Tool alerting and monitoring code for sharded migrations #2017

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Live Migration Monitoring Dashboard - Terraform Module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


This Terraform module creates a Google Cloud Monitoring dashboard to visualize key metrics related to the Live migration template(s), including Cloud Storage, Pub/Sub, Dataflow, Spanner, and Datastream statistics. It includes log-based metrics and alert policies to monitor error thresholds, conversion errors, DLQ object counts, and throttling in various GCP services, ensuring a comprehensive view of the migration process.

## Overview
The Terraform configuration is organized into two main folders: main/ and modules/. The `main/` folder contains the core configuration files, including `main.tf`, where the infrastructure resources are defined, and module calls are made. `terraform_simple.tfvars` and `terraform.tfvars` holds values for the variables defined in `variables.tf`, which are used throughout the configuration. The `provider.tf` file sets up the Google Cloud provider with necessary credentials and project details.

The `modules/` folder houses reusable Terraform modules. The `modules/dashboard` folder defines the Monitoring Dashboard resources, specifying how to visualize key metrics from various services used by the Live migration template(s) like Spanner, Dataflow, Pub/Sub, and Cloud Storage. The `modules/alerting` folder contains alert policies for various Google Cloud resources, with separate files for Google Cloud Storage, Pub/Sub, and Dataflow alerting. Finally, the `modules/notification_channels` folder configures the notification channels (email, SMS, etc.) that will be used to alert users when a condition is met.

## Requirements
* **Terraform:** Install Terraform on your local machine with version 0.13 or later
* **Google Cloud Provider:** Make sure you have the Google Cloud provider configured in your Terraform environment.
* **Google Cloud Project:** Create a Google Cloud project and enable the necessary APIs (e.g., Cloud Monitoring, Cloud Storage, Spanner).
* **Service Account:** Create a service account with appropriate permissions to create Monitoring dashboards and access the relevant metrics.

## Usage
**1. Clone the Repository**
```
git clone <repository-url>
cd <repository-directory>
```
**2. Set Up Authentication**
* Set up your Google Cloud credentials using one of the following methods:
* ***Application Default Credentials:*** Run `gcloud auth application-default login`.
* ***Service Account Key:*** Set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key file.

**3. Provide Variable Values**
* Update and mofify `input.tfvars` file in the main directory and provide the following values:
```hcl
project_id = "<your-project-id>"
prefix = "<your-prefix>"
email_address = "<notification-email-address>"
# Optional:
# region = "<your-desired-region>"
# Customize thresholds as needed for alerting module
gcs_object_count_dlq_threshold = 100
gcs_read_write_throttles_threshold = 5000
pubsub_age_of_oldest_message_threshold = 3600
dataflow_conversion_errors_threshold = 10
dataflow_other_errors_threshold = 50
dataflow_total_errors_threshold = 100
```
**4. Update the `main.tf` to add more modules if necessary**
```hcl
module "my_new_module" {
source = "./modules/my_new_module"
# ... (Provide any necessary variables for this module)
}
```


**5. Initialize and Apply:**
```hcl
terraform init
terraform plan -var-file="terraform_simple.tfvars" -var-file="terraform.tfvars"
terraform apply -var-file="terraform_simple.tfvars" -var-file="terraform.tfvars"
```

**5. Clean and destroy:**
```hcl
terraform destroy -var-file="terraform_simple.tfvars" -var-file="terraform.tfvars"
```

Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Module for Dashboard Deployment

Check failure on line 1 in v2/datastream-to-spanner/terraform/samples/monitoring-alerting/main.tf

View workflow job for this annotation

GitHub Actions / verify-terraform-samples (v2/datastream-to-spanner/terraform/samples)

File is not in canonical format (terraform fmt)
# This module sets up a Cloud Monitoring dashboard for visualizing metrics.
module "dashboard" {
source = "./modules/dashboard" # Path to the dashboard module source.
project_id = var.project_id # ID of the project where the dashboard will be deployed.
prefix = var.prefix # prefix associated with the monitored resources.
spanner_project_id = var.spanner_project_id # ID of the project containing Spanner metrics to display on the dashboard.
}

# Module for Notification Channels
# This module sets up the notification channels for alerting, e.g., email.
module "notification_channels" {
source = "./modules/notification_channels" # Path to the notification channels module.
email_address = var.email_address # Email address to receive alert notifications.
}

# Module for Google Cloud Storage (GCS) Alerts
# This module defines alerting rules for Google Cloud Storage metrics, like DLQ object count and read/write throttling.
module "gcs_alerts" {
source = "./modules/alerting/gcs" # Path to the GCS alerting module.

project_id = var.project_id # Project ID for general resource monitoring.
alerting_project_id = var.alerting_project_id # ID for the alerting-specific project (if different from project_id).
prefix= var.prefix # prefix for the resources being monitored.

# Thresholds for triggering GCS alerts
gcs_object_count_dlq_threshold = var.gcs_object_count_dlq_threshold # DLQ object count threshold.
gcs_read_write_throttles_threshold = var.gcs_read_write_throttles_threshold # Threshold for read/write throttles.

notification_channels = module.notification_channels.notification_channels # Links notification channels from the notification_channels module.
}

# Module for Pub/Sub Alerts
# This module defines alerting rules for Pub/Sub metrics, such as the age of the oldest message.
module "pubsub_alerts" {
source = "./modules/alerting/pubsub" # Path to the Pub/Sub alerting module.

project_id = var.project_id # Project ID for general resource monitoring.
alerting_project_id = var.alerting_project_id # ID for the alerting-specific project (if different from project_id).
prefix = var.prefix # prefix for the resources being monitored.

# Threshold for triggering Pub/Sub alerts
pubsub_age_of_oldest_message_threshold = var.pubsub_age_of_oldest_message_threshold # Threshold for the age of the oldest message in Pub/Sub.

notification_channels = module.notification_channels.notification_channels # Links notification channels from the notification_channels module.
}

# Module for Dataflow Alerts
# This module defines alerting rules for Dataflow metrics, including conversion errors, other errors, and total errors.
module "dataflow_alerts" {
source = "./modules/alerting/dataflow" # Path to the Dataflow alerting module.

project_id = var.project_id # Project ID for general resource monitoring.
alerting_project_id = var.alerting_project_id # ID for the alerting-specific project (if different from project_id).
prefix = var.prefix # prefix for the resources being monitored.

# Thresholds for triggering Dataflow alerts
dataflow_conversion_errors_threshold = var.dataflow_conversion_errors_threshold # Threshold for conversion errors.
dataflow_other_errors_threshold = var.dataflow_other_errors_threshold # Threshold for other (non-conversion) errors.
dataflow_total_errors_threshold = var.dataflow_total_errors_threshold # Total errors threshold for Dataflow jobs.

notification_channels = module.notification_channels.notification_channels # Links notification channels from the notification_channels module.
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Log-based metric for tracking Dataflow conversion errors separately

Check failure on line 1 in v2/datastream-to-spanner/terraform/samples/monitoring-alerting/modules/alerting/dataflow/dataflow.tf

View workflow job for this annotation

GitHub Actions / verify-terraform-samples (v2/datastream-to-spanner/terraform/samples)

File is not in canonical format (terraform fmt)
resource "google_logging_metric" "dataflow_conversion_errors_metric" {
name = "dataflow_conversion_errors_metric"

# Filter to capture only conversion errors with severity of ERROR or higher for Dataflow jobs
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND textPayload:\"conversion error\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shreyakhajanchi Please validate with the recent updates to the metrics if the filter is still accurate.


# Metric descriptor settings to define the metric type and value type
metric_descriptor {
metric_kind = "DELTA" # Tracks change over time
value_type = "INT64" # Counts integer values
}

}

# Log-based metric for other (non-conversion) Dataflow errors
resource "google_logging_metric" "dataflow_other_errors_metric" {
name = "dataflow_other_errors_metric"
description = "Metric to track other Dataflow errors (excluding conversion errors)"

# Filter to capture all Dataflow errors except conversion errors
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND NOT textPayload:\"conversion error\""

metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
}

}

# Log-based metric for total Dataflow errors (both conversion and other errors)
resource "google_logging_metric" "dataflow_total_errors_metric" {
name = "dataflow_total_errors_metric"
description = "Metric to track the total number of Dataflow errors"

# Filter to capture all Dataflow errors with severity of ERROR or higher
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR"

metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
}

}
Comment on lines +2 to +44
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need log-based metrics? Aren't these metrics directly published to GCP in a counter?

An older (but similar) example of querying Retryable Errors from Monitoring -

fetch dataflow_job
| metric 'dataflow.googleapis.com/job/user_counter'
| filter (resource.job_name == 'hb-dataflow-polltest-100tables-ef7e-3589')
| filter (metric.metric_name == 'Retryable errors')
| group_by 1m, [value_user_counter_mean: mean(value.user_counter)]
| every 1m


# Alert policy for conversion errors in Dataflow
resource "google_monitoring_alert_policy" "dataflow_errors" {
# Display name for the alert policy, with an optional prefix for custom naming
display_name = format("${var.prefix} - Dataflow Conversion Errors Alert Policy")

conditions {
display_name = "Dataflow Conversion Errors Condition"

# Condition threshold settings, including metric filter and threshold value
condition_threshold {
# Filter to include only conversion errors with specified project ID
filter = "resource.type=\"dataflow_job\" AND metric.type=\"logging.googleapis.com/user/dataflow_conversion_errors_metric\" AND resource.labels.project_id = ${var.alerting_project_id}"
comparison = "COMPARISON_GT" # Triggers alert when threshold exceeded
threshold_value = var.dataflow_conversion_errors_threshold
duration = "60s" # Alert triggered if error rate persists for 60 seconds
}
}

# Notification channels to receive the alert notifications
notification_channels = var.notification_channels

# Ensure alert creation after the metric is defined
depends_on = [google_logging_metric.dataflow_conversion_errors_metric]
combiner = "OR" # Trigger alert if any condition in the policy is met
}

# Alert policy for other Dataflow errors
resource "google_monitoring_alert_policy" "dataflow_other_errors_alert" {
display_name = format("${var.prefix} - Dataflow Other Errors Alert")

conditions {
display_name = "Dataflow Other Errors Condition"

condition_threshold {
# Filter for non-conversion errors with specified project ID
filter = "resource.type=\"dataflow_job\" AND metric.type=\"logging.googleapis.com/user/dataflow_other_errors_metric\" AND resource.labels.project_id = ${var.alerting_project_id}"
comparison = "COMPARISON_GT"
threshold_value = var.dataflow_other_errors_threshold
duration = "60s"
}
}

notification_channels = var.notification_channels

depends_on = [google_logging_metric.dataflow_other_errors_metric]
combiner = "OR"
}

# Alert policy for total Dataflow errors
resource "google_monitoring_alert_policy" "dataflow_total_errors_alert" {
display_name = format("${var.prefix} - Dataflow Total Errors Alert")

conditions {
display_name = "Dataflow Total Errors Condition"

condition_threshold {
# Filter for all types of Dataflow errors with specified project ID
filter = "resource.type=\"dataflow_job\" AND metric.type=\"logging.googleapis.com/user/dataflow_total_errors_metric\" AND resource.labels.project_id = ${var.alerting_project_id}"
comparison = "COMPARISON_GT"
threshold_value = var.dataflow_total_errors_threshold
duration = "60s"
}
}

notification_channels = var.notification_channels

depends_on = [google_logging_metric.dataflow_total_errors_metric]
combiner = "OR"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
variable "project_id" {

Check failure on line 1 in v2/datastream-to-spanner/terraform/samples/monitoring-alerting/modules/alerting/dataflow/variables.tf

View workflow job for this annotation

GitHub Actions / verify-terraform-samples (v2/datastream-to-spanner/terraform/samples)

File is not in canonical format (terraform fmt)
type = string
description = "The ID of the Google Cloud project"
}

variable "alerting_project_id" {
type = string
description = "The ID of the Google Cloud Project where alerting is created "
}

variable "prefix" {
type = string
description = "The service ID for the GCS buckets"
}

variable "notification_channels" {
description = "List of notification channels"
type = list(string)
}

variable "dataflow_conversion_errors_threshold" {
description = "Threshold for conversion errors"
type = number
}

variable "dataflow_other_errors_threshold" {
description = "Threshold for other errors"
type = number
}

variable "dataflow_total_errors_threshold" {
description = "Threshold for total errors"
type = number
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Alert policy for monitoring the object count in Dead Letter Queue (DLQ) directories within GCS buckets

Check failure on line 1 in v2/datastream-to-spanner/terraform/samples/monitoring-alerting/modules/alerting/gcs/gcs.tf

View workflow job for this annotation

GitHub Actions / verify-terraform-samples (v2/datastream-to-spanner/terraform/samples)

File is not in canonical format (terraform fmt)
resource "google_monitoring_alert_policy" "object_count_dlq" {
# Display name for the alert policy, includes prefix for easy identification
display_name = format("${var.prefix} - GCS Object Count in DLQ Directories Alert")

conditions {
display_name = "Object Count in DLQ Exceeds Threshold"

# Condition to trigger alert when object count in DLQ directory surpasses a specified threshold
condition_threshold {
# Filter to target GCS buckets with names matching a specific prefix and containing "dlq"
# Monitors the "storage/object_count" metric in those buckets
filter = "resource.type = \"gcs_bucket\" AND (resource.labels.project_id = ${var.alerting_project_id} AND resource.labels.bucket_name = starts_with(${var.prefix}) AND resource.labels.bucket_name = monitoring.regex.full_match(\"dlq\")) AND metric.type = \"storage.googleapis.com/storage/object_count\""

# Aggregation settings to define data processing intervals and alignment method
aggregations {
alignment_period = "300s" # Align data points into 5-minute intervals
cross_series_reducer = "REDUCE_NONE" # No cross-series aggregation
per_series_aligner = "ALIGN_MEAN" # Use the mean value within each interval
}

# Trigger alert if object count exceeds threshold immediately (no duration buffer)
comparison = "COMPARISON_GT"
duration = "0s"
threshold_value = var.gcs_object_count_dlq_threshold

# Trigger alert based on the threshold being exceeded in at least 1 data point
trigger {
count = 1
}
}
}

# Combiner set to "OR" for single condition
combiner = "OR"
enabled = true # Enable the alert policy

# Notification channels for sending alert notifications
notification_channels = var.notification_channels
}

# Alert policy for monitoring read and write throttles in GCS buckets
resource "google_monitoring_alert_policy" "read_write_throttles" {
# Display name for the alert policy, includes prefix for easy identification
display_name = format("${var.prefix} - GCS Read/Write Throttles Alert")

conditions {
display_name = "Read/Write Throttles Exceed Threshold"

# Condition to trigger alert when read/write throttling in GCS exceeds threshold
condition_threshold {
# Filter to target GCS buckets with names matching a specific prefix
# Monitors the "network/received_bytes_count" metric for throttling
filter = "resource.type = \"gcs_bucket\" AND (resource.labels.project_id = ${var.alerting_project_id} AND resource.labels.bucket_name = starts_with(${var.prefix})) AND metric.type = \"storage.googleapis.com/network/received_bytes_count\""

# Aggregation settings to define data processing intervals and alignment method
aggregations {
alignment_period = "300s" # Align data points into 5-minute intervals
cross_series_reducer = "REDUCE_NONE" # No cross-series aggregation
per_series_aligner = "ALIGN_MEAN" # Use the mean value within each interval
}

# Trigger alert if read/write throttling exceeds threshold immediately (no duration buffer)
comparison = "COMPARISON_GT"
duration = "0s"
threshold_value = var.gcs_read_write_throttles_threshold

# Trigger alert based on the threshold being exceeded in at least 1 data point
trigger {
count = 1
}
}
}

# Combiner set to "OR" for single condition
combiner = "OR"

# Notification channels for sending alert notifications
notification_channels = var.notification_channels
enabled = true # Enable the alert policy
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
variable "gcs_object_count_dlq_threshold" {

Check failure on line 1 in v2/datastream-to-spanner/terraform/samples/monitoring-alerting/modules/alerting/gcs/variables.tf

View workflow job for this annotation

GitHub Actions / verify-terraform-samples (v2/datastream-to-spanner/terraform/samples)

File is not in canonical format (terraform fmt)
description = "Threshold for GCS object count in DLQ directory."
type = number
}

variable "gcs_read_write_throttles_threshold" {
description = "Threshold for GCS read/write throttles."
type = number
}

variable "project_id" {
type = string
description = "The ID of the Google Cloud project"
}

variable "alerting_project_id" {
type = string
description = "The ID of the Google Cloud Project where alerting is created "
}

variable "prefix" {
type = string
description = "The service ID for the GCS buckets"
}

variable "notification_channels" {
description = "List of notification channels"
type = list(string)
}
Loading
Loading