-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Spanner Migration Tool alerting and monitoring code for sharded migrations #2017
base: main
Are you sure you want to change the base?
Added Spanner Migration Tool alerting and monitoring code for sharded migrations #2017
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This is great work. Added some comments, please take a look.
Overall -
Please look at the reference of an existing template and follow a similar structure wherever possible. We also have a status check which validates terraform templates and should ideally validate a bunch of stuff as well, please see it's output.
@@ -0,0 +1,48 @@ | |||
# Terraform state files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? I think there is a global gitignore that we can use instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we do not need this. I have removed this file in the new commit.
@@ -0,0 +1,63 @@ | |||
# Spanner Migration Tool Monitoring Dashboard - Terraform Module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where possible, can you mirror the sample
structure we have for other templates?
Example: We include a terraform.tfvars
and terraform_simple.tfvars
for each sample. Definitions are here - https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/datastream-to-spanner/terraform/samples#sample-structure
As much as possible, please try to keep the naming convention consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using "Spanner Migration Tool" as a way to refer, could you please replace it with "Live migration template(s)".
For example -
- Live Migration Monitoring Dashboard (Title)
visualize key metrics related to the Live migration template(s)
(first line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the README.md
and the dashboard title to use Live Migration Monitoring Dashboard
name = "dataflow_conversion_errors_metric" | ||
|
||
# Filter to capture only conversion errors with severity of ERROR or higher for Dataflow jobs | ||
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND textPayload:\"conversion error\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shreyakhajanchi Please validate with the recent updates to the metrics if the filter
is still accurate.
resource "google_logging_metric" "dataflow_conversion_errors_metric" { | ||
name = "dataflow_conversion_errors_metric" | ||
|
||
# Filter to capture only conversion errors with severity of ERROR or higher for Dataflow jobs | ||
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND textPayload:\"conversion error\"" | ||
|
||
# Metric descriptor settings to define the metric type and value type | ||
metric_descriptor { | ||
metric_kind = "DELTA" # Tracks change over time | ||
value_type = "INT64" # Counts integer values | ||
} | ||
|
||
} | ||
|
||
# Log-based metric for other (non-conversion) Dataflow errors | ||
resource "google_logging_metric" "dataflow_other_errors_metric" { | ||
name = "dataflow_other_errors_metric" | ||
description = "Metric to track other Dataflow errors (excluding conversion errors)" | ||
|
||
# Filter to capture all Dataflow errors except conversion errors | ||
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND NOT textPayload:\"conversion error\"" | ||
|
||
metric_descriptor { | ||
metric_kind = "DELTA" | ||
value_type = "INT64" | ||
} | ||
|
||
} | ||
|
||
# Log-based metric for total Dataflow errors (both conversion and other errors) | ||
resource "google_logging_metric" "dataflow_total_errors_metric" { | ||
name = "dataflow_total_errors_metric" | ||
description = "Metric to track the total number of Dataflow errors" | ||
|
||
# Filter to capture all Dataflow errors with severity of ERROR or higher | ||
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR" | ||
|
||
metric_descriptor { | ||
metric_kind = "DELTA" | ||
value_type = "INT64" | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need log-based metrics? Aren't these metrics directly published to GCP in a counter?
An older (but similar) example of querying Retryable Errors
from Monitoring -
fetch dataflow_job
| metric 'dataflow.googleapis.com/job/user_counter'
| filter (resource.job_name == 'hb-dataflow-polltest-100tables-ef7e-3589')
| filter (metric.metric_name == 'Retryable errors')
| group_by 1m, [value_user_counter_mean: mean(value.user_counter)]
| every 1m
Sample Status check run - This shows that the file is not formatted with |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2017 +/- ##
============================================
+ Coverage 45.41% 52.93% +7.51%
+ Complexity 3675 1367 -2308
============================================
Files 842 378 -464
Lines 49970 20661 -29309
Branches 5261 2090 -3171
============================================
- Hits 22692 10936 -11756
+ Misses 25608 9045 -16563
+ Partials 1670 680 -990
|
pubsub_age_of_oldest_message_threshold = 120 # Maximum age of the oldest message in Pub/Sub in seconds. | ||
|
||
# Notification Configuration | ||
email_address = "[email protected]" # Email address to receive alert notifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change!
@@ -0,0 +1,63 @@ | |||
# Live Migration Monitoring Dashboard - Terraform Module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this template to the list of examples here -
https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/datastream-to-spanner/terraform/samples#list-of-examples
Looks like there are some formatting issues - Please run |
This Terraform module delivers the following features tailored for the Spanner Migration Tool used in sharded migrations: