-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Spanner Migration Tool alerting and monitoring code for sharded migrations #2017
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Live Migration Monitoring Dashboard - Terraform Module | ||
|
||
This Terraform module creates a Google Cloud Monitoring dashboard to visualize key metrics related to the Live migration template(s), including Cloud Storage, Pub/Sub, Dataflow, Spanner, and Datastream statistics. It includes log-based metrics and alert policies to monitor error thresholds, conversion errors, DLQ object counts, and throttling in various GCP services, ensuring a comprehensive view of the migration process. | ||
|
||
## Overview | ||
The Terraform configuration is organized into two main folders: main/ and modules/. The `main/` folder contains the core configuration files, including `main.tf`, where the infrastructure resources are defined, and module calls are made. `terraform_simple.tfvars` and `terraform.tfvars` holds values for the variables defined in `variables.tf`, which are used throughout the configuration. The `provider.tf` file sets up the Google Cloud provider with necessary credentials and project details. | ||
|
||
The `modules/` folder houses reusable Terraform modules. The `modules/dashboard` folder defines the Monitoring Dashboard resources, specifying how to visualize key metrics from various services used by the Live migration template(s) like Spanner, Dataflow, Pub/Sub, and Cloud Storage. The `modules/alerting` folder contains alert policies for various Google Cloud resources, with separate files for Google Cloud Storage, Pub/Sub, and Dataflow alerting. Finally, the `modules/notification_channels` folder configures the notification channels (email, SMS, etc.) that will be used to alert users when a condition is met. | ||
|
||
## Requirements | ||
* **Terraform:** Install Terraform on your local machine with version 0.13 or later | ||
* **Google Cloud Provider:** Make sure you have the Google Cloud provider configured in your Terraform environment. | ||
* **Google Cloud Project:** Create a Google Cloud project and enable the necessary APIs (e.g., Cloud Monitoring, Cloud Storage, Spanner). | ||
* **Service Account:** Create a service account with appropriate permissions to create Monitoring dashboards and access the relevant metrics. | ||
|
||
## Usage | ||
**1. Clone the Repository** | ||
``` | ||
git clone <repository-url> | ||
cd <repository-directory> | ||
``` | ||
**2. Set Up Authentication** | ||
* Set up your Google Cloud credentials using one of the following methods: | ||
* ***Application Default Credentials:*** Run `gcloud auth application-default login`. | ||
* ***Service Account Key:*** Set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key file. | ||
|
||
**3. Provide Variable Values** | ||
* Update and mofify `input.tfvars` file in the main directory and provide the following values: | ||
```hcl | ||
project_id = "<your-project-id>" | ||
prefix = "<your-prefix>" | ||
email_address = "<notification-email-address>" | ||
# Optional: | ||
# region = "<your-desired-region>" | ||
# Customize thresholds as needed for alerting module | ||
gcs_object_count_dlq_threshold = 100 | ||
gcs_read_write_throttles_threshold = 5000 | ||
pubsub_age_of_oldest_message_threshold = 3600 | ||
dataflow_conversion_errors_threshold = 10 | ||
dataflow_other_errors_threshold = 50 | ||
dataflow_total_errors_threshold = 100 | ||
``` | ||
**4. Update the `main.tf` to add more modules if necessary** | ||
```hcl | ||
module "my_new_module" { | ||
source = "./modules/my_new_module" | ||
# ... (Provide any necessary variables for this module) | ||
} | ||
``` | ||
|
||
|
||
**5. Initialize and Apply:** | ||
```hcl | ||
terraform init | ||
terraform plan -var-file="terraform_simple.tfvars" -var-file="terraform.tfvars" | ||
terraform apply -var-file="terraform_simple.tfvars" -var-file="terraform.tfvars" | ||
``` | ||
|
||
**5. Clean and destroy:** | ||
```hcl | ||
terraform destroy -var-file="terraform_simple.tfvars" -var-file="terraform.tfvars" | ||
``` | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Module for Dashboard Deployment | ||
# This module sets up a Cloud Monitoring dashboard for visualizing metrics. | ||
module "dashboard" { | ||
source = "./modules/dashboard" # Path to the dashboard module source. | ||
project_id = var.project_id # ID of the project where the dashboard will be deployed. | ||
prefix = var.prefix # prefix associated with the monitored resources. | ||
spanner_project_id = var.spanner_project_id # ID of the project containing Spanner metrics to display on the dashboard. | ||
} | ||
|
||
# Module for Notification Channels | ||
# This module sets up the notification channels for alerting, e.g., email. | ||
module "notification_channels" { | ||
source = "./modules/notification_channels" # Path to the notification channels module. | ||
email_address = var.email_address # Email address to receive alert notifications. | ||
} | ||
|
||
# Module for Google Cloud Storage (GCS) Alerts | ||
# This module defines alerting rules for Google Cloud Storage metrics, like DLQ object count and read/write throttling. | ||
module "gcs_alerts" { | ||
source = "./modules/alerting/gcs" # Path to the GCS alerting module. | ||
|
||
project_id = var.project_id # Project ID for general resource monitoring. | ||
alerting_project_id = var.alerting_project_id # ID for the alerting-specific project (if different from project_id). | ||
prefix= var.prefix # prefix for the resources being monitored. | ||
|
||
# Thresholds for triggering GCS alerts | ||
gcs_object_count_dlq_threshold = var.gcs_object_count_dlq_threshold # DLQ object count threshold. | ||
gcs_read_write_throttles_threshold = var.gcs_read_write_throttles_threshold # Threshold for read/write throttles. | ||
|
||
notification_channels = module.notification_channels.notification_channels # Links notification channels from the notification_channels module. | ||
} | ||
|
||
# Module for Pub/Sub Alerts | ||
# This module defines alerting rules for Pub/Sub metrics, such as the age of the oldest message. | ||
module "pubsub_alerts" { | ||
source = "./modules/alerting/pubsub" # Path to the Pub/Sub alerting module. | ||
|
||
project_id = var.project_id # Project ID for general resource monitoring. | ||
alerting_project_id = var.alerting_project_id # ID for the alerting-specific project (if different from project_id). | ||
prefix = var.prefix # prefix for the resources being monitored. | ||
|
||
# Threshold for triggering Pub/Sub alerts | ||
pubsub_age_of_oldest_message_threshold = var.pubsub_age_of_oldest_message_threshold # Threshold for the age of the oldest message in Pub/Sub. | ||
|
||
notification_channels = module.notification_channels.notification_channels # Links notification channels from the notification_channels module. | ||
} | ||
|
||
# Module for Dataflow Alerts | ||
# This module defines alerting rules for Dataflow metrics, including conversion errors, other errors, and total errors. | ||
module "dataflow_alerts" { | ||
source = "./modules/alerting/dataflow" # Path to the Dataflow alerting module. | ||
|
||
project_id = var.project_id # Project ID for general resource monitoring. | ||
alerting_project_id = var.alerting_project_id # ID for the alerting-specific project (if different from project_id). | ||
prefix = var.prefix # prefix for the resources being monitored. | ||
|
||
# Thresholds for triggering Dataflow alerts | ||
dataflow_conversion_errors_threshold = var.dataflow_conversion_errors_threshold # Threshold for conversion errors. | ||
dataflow_other_errors_threshold = var.dataflow_other_errors_threshold # Threshold for other (non-conversion) errors. | ||
dataflow_total_errors_threshold = var.dataflow_total_errors_threshold # Total errors threshold for Dataflow jobs. | ||
|
||
notification_channels = module.notification_channels.notification_channels # Links notification channels from the notification_channels module. | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
# Log-based metric for tracking Dataflow conversion errors separately | ||
Check failure on line 1 in v2/datastream-to-spanner/terraform/samples/monitoring-alerting/modules/alerting/dataflow/dataflow.tf GitHub Actions / verify-terraform-samples (v2/datastream-to-spanner/terraform/samples)
|
||
resource "google_logging_metric" "dataflow_conversion_errors_metric" { | ||
name = "dataflow_conversion_errors_metric" | ||
|
||
# Filter to capture only conversion errors with severity of ERROR or higher for Dataflow jobs | ||
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND textPayload:\"conversion error\"" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shreyakhajanchi Please validate with the recent updates to the metrics if the |
||
|
||
# Metric descriptor settings to define the metric type and value type | ||
metric_descriptor { | ||
metric_kind = "DELTA" # Tracks change over time | ||
value_type = "INT64" # Counts integer values | ||
} | ||
|
||
} | ||
|
||
# Log-based metric for other (non-conversion) Dataflow errors | ||
resource "google_logging_metric" "dataflow_other_errors_metric" { | ||
name = "dataflow_other_errors_metric" | ||
description = "Metric to track other Dataflow errors (excluding conversion errors)" | ||
|
||
# Filter to capture all Dataflow errors except conversion errors | ||
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND NOT textPayload:\"conversion error\"" | ||
|
||
metric_descriptor { | ||
metric_kind = "DELTA" | ||
value_type = "INT64" | ||
} | ||
|
||
} | ||
|
||
# Log-based metric for total Dataflow errors (both conversion and other errors) | ||
resource "google_logging_metric" "dataflow_total_errors_metric" { | ||
name = "dataflow_total_errors_metric" | ||
description = "Metric to track the total number of Dataflow errors" | ||
|
||
# Filter to capture all Dataflow errors with severity of ERROR or higher | ||
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR" | ||
|
||
metric_descriptor { | ||
metric_kind = "DELTA" | ||
value_type = "INT64" | ||
} | ||
|
||
} | ||
Comment on lines
+2
to
+44
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need log-based metrics? Aren't these metrics directly published to GCP in a counter? An older (but similar) example of querying
|
||
|
||
# Alert policy for conversion errors in Dataflow | ||
resource "google_monitoring_alert_policy" "dataflow_errors" { | ||
# Display name for the alert policy, with an optional prefix for custom naming | ||
display_name = format("${var.prefix} - Dataflow Conversion Errors Alert Policy") | ||
|
||
conditions { | ||
display_name = "Dataflow Conversion Errors Condition" | ||
|
||
# Condition threshold settings, including metric filter and threshold value | ||
condition_threshold { | ||
# Filter to include only conversion errors with specified project ID | ||
filter = "resource.type=\"dataflow_job\" AND metric.type=\"logging.googleapis.com/user/dataflow_conversion_errors_metric\" AND resource.labels.project_id = ${var.alerting_project_id}" | ||
comparison = "COMPARISON_GT" # Triggers alert when threshold exceeded | ||
threshold_value = var.dataflow_conversion_errors_threshold | ||
duration = "60s" # Alert triggered if error rate persists for 60 seconds | ||
} | ||
} | ||
|
||
# Notification channels to receive the alert notifications | ||
notification_channels = var.notification_channels | ||
|
||
# Ensure alert creation after the metric is defined | ||
depends_on = [google_logging_metric.dataflow_conversion_errors_metric] | ||
combiner = "OR" # Trigger alert if any condition in the policy is met | ||
} | ||
|
||
# Alert policy for other Dataflow errors | ||
resource "google_monitoring_alert_policy" "dataflow_other_errors_alert" { | ||
display_name = format("${var.prefix} - Dataflow Other Errors Alert") | ||
|
||
conditions { | ||
display_name = "Dataflow Other Errors Condition" | ||
|
||
condition_threshold { | ||
# Filter for non-conversion errors with specified project ID | ||
filter = "resource.type=\"dataflow_job\" AND metric.type=\"logging.googleapis.com/user/dataflow_other_errors_metric\" AND resource.labels.project_id = ${var.alerting_project_id}" | ||
comparison = "COMPARISON_GT" | ||
threshold_value = var.dataflow_other_errors_threshold | ||
duration = "60s" | ||
} | ||
} | ||
|
||
notification_channels = var.notification_channels | ||
|
||
depends_on = [google_logging_metric.dataflow_other_errors_metric] | ||
combiner = "OR" | ||
} | ||
|
||
# Alert policy for total Dataflow errors | ||
resource "google_monitoring_alert_policy" "dataflow_total_errors_alert" { | ||
display_name = format("${var.prefix} - Dataflow Total Errors Alert") | ||
|
||
conditions { | ||
display_name = "Dataflow Total Errors Condition" | ||
|
||
condition_threshold { | ||
# Filter for all types of Dataflow errors with specified project ID | ||
filter = "resource.type=\"dataflow_job\" AND metric.type=\"logging.googleapis.com/user/dataflow_total_errors_metric\" AND resource.labels.project_id = ${var.alerting_project_id}" | ||
comparison = "COMPARISON_GT" | ||
threshold_value = var.dataflow_total_errors_threshold | ||
duration = "60s" | ||
} | ||
} | ||
|
||
notification_channels = var.notification_channels | ||
|
||
depends_on = [google_logging_metric.dataflow_total_errors_metric] | ||
combiner = "OR" | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
variable "project_id" { | ||
Check failure on line 1 in v2/datastream-to-spanner/terraform/samples/monitoring-alerting/modules/alerting/dataflow/variables.tf GitHub Actions / verify-terraform-samples (v2/datastream-to-spanner/terraform/samples)
|
||
type = string | ||
description = "The ID of the Google Cloud project" | ||
} | ||
|
||
variable "alerting_project_id" { | ||
type = string | ||
description = "The ID of the Google Cloud Project where alerting is created " | ||
} | ||
|
||
variable "prefix" { | ||
type = string | ||
description = "The service ID for the GCS buckets" | ||
} | ||
|
||
variable "notification_channels" { | ||
description = "List of notification channels" | ||
type = list(string) | ||
} | ||
|
||
variable "dataflow_conversion_errors_threshold" { | ||
description = "Threshold for conversion errors" | ||
type = number | ||
} | ||
|
||
variable "dataflow_other_errors_threshold" { | ||
description = "Threshold for other errors" | ||
type = number | ||
} | ||
|
||
variable "dataflow_total_errors_threshold" { | ||
description = "Threshold for total errors" | ||
type = number | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
# Alert policy for monitoring the object count in Dead Letter Queue (DLQ) directories within GCS buckets | ||
resource "google_monitoring_alert_policy" "object_count_dlq" { | ||
# Display name for the alert policy, includes prefix for easy identification | ||
display_name = format("${var.prefix} - GCS Object Count in DLQ Directories Alert") | ||
|
||
conditions { | ||
display_name = "Object Count in DLQ Exceeds Threshold" | ||
|
||
# Condition to trigger alert when object count in DLQ directory surpasses a specified threshold | ||
condition_threshold { | ||
# Filter to target GCS buckets with names matching a specific prefix and containing "dlq" | ||
# Monitors the "storage/object_count" metric in those buckets | ||
filter = "resource.type = \"gcs_bucket\" AND (resource.labels.project_id = ${var.alerting_project_id} AND resource.labels.bucket_name = starts_with(${var.prefix}) AND resource.labels.bucket_name = monitoring.regex.full_match(\"dlq\")) AND metric.type = \"storage.googleapis.com/storage/object_count\"" | ||
|
||
# Aggregation settings to define data processing intervals and alignment method | ||
aggregations { | ||
alignment_period = "300s" # Align data points into 5-minute intervals | ||
cross_series_reducer = "REDUCE_NONE" # No cross-series aggregation | ||
per_series_aligner = "ALIGN_MEAN" # Use the mean value within each interval | ||
} | ||
|
||
# Trigger alert if object count exceeds threshold immediately (no duration buffer) | ||
comparison = "COMPARISON_GT" | ||
duration = "0s" | ||
threshold_value = var.gcs_object_count_dlq_threshold | ||
|
||
# Trigger alert based on the threshold being exceeded in at least 1 data point | ||
trigger { | ||
count = 1 | ||
} | ||
} | ||
} | ||
|
||
# Combiner set to "OR" for single condition | ||
combiner = "OR" | ||
enabled = true # Enable the alert policy | ||
|
||
# Notification channels for sending alert notifications | ||
notification_channels = var.notification_channels | ||
} | ||
|
||
# Alert policy for monitoring read and write throttles in GCS buckets | ||
resource "google_monitoring_alert_policy" "read_write_throttles" { | ||
# Display name for the alert policy, includes prefix for easy identification | ||
display_name = format("${var.prefix} - GCS Read/Write Throttles Alert") | ||
|
||
conditions { | ||
display_name = "Read/Write Throttles Exceed Threshold" | ||
|
||
# Condition to trigger alert when read/write throttling in GCS exceeds threshold | ||
condition_threshold { | ||
# Filter to target GCS buckets with names matching a specific prefix | ||
# Monitors the "network/received_bytes_count" metric for throttling | ||
filter = "resource.type = \"gcs_bucket\" AND (resource.labels.project_id = ${var.alerting_project_id} AND resource.labels.bucket_name = starts_with(${var.prefix})) AND metric.type = \"storage.googleapis.com/network/received_bytes_count\"" | ||
|
||
# Aggregation settings to define data processing intervals and alignment method | ||
aggregations { | ||
alignment_period = "300s" # Align data points into 5-minute intervals | ||
cross_series_reducer = "REDUCE_NONE" # No cross-series aggregation | ||
per_series_aligner = "ALIGN_MEAN" # Use the mean value within each interval | ||
} | ||
|
||
# Trigger alert if read/write throttling exceeds threshold immediately (no duration buffer) | ||
comparison = "COMPARISON_GT" | ||
duration = "0s" | ||
threshold_value = var.gcs_read_write_throttles_threshold | ||
|
||
# Trigger alert based on the threshold being exceeded in at least 1 data point | ||
trigger { | ||
count = 1 | ||
} | ||
} | ||
} | ||
|
||
# Combiner set to "OR" for single condition | ||
combiner = "OR" | ||
|
||
# Notification channels for sending alert notifications | ||
notification_channels = var.notification_channels | ||
enabled = true # Enable the alert policy | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
variable "gcs_object_count_dlq_threshold" { | ||
description = "Threshold for GCS object count in DLQ directory." | ||
type = number | ||
} | ||
|
||
variable "gcs_read_write_throttles_threshold" { | ||
description = "Threshold for GCS read/write throttles." | ||
type = number | ||
} | ||
|
||
variable "project_id" { | ||
type = string | ||
description = "The ID of the Google Cloud project" | ||
} | ||
|
||
variable "alerting_project_id" { | ||
type = string | ||
description = "The ID of the Google Cloud Project where alerting is created " | ||
} | ||
|
||
variable "prefix" { | ||
type = string | ||
description = "The service ID for the GCS buckets" | ||
} | ||
|
||
variable "notification_channels" { | ||
description = "List of notification channels" | ||
type = list(string) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this template to the list of examples here -
https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/datastream-to-spanner/terraform/samples#list-of-examples