-
Notifications
You must be signed in to change notification settings - Fork 5
Third Party Monitoring Evaluation
This document has an important goal:
- To have an understanding of the desired state for the Forest Infrastructure monitoring set-up involving the use of managed monitoring platform and other additional services
Managed Monitoring Services considered:
- New Relic
- Datadog
- Grafana Cloud
The accepted criteria are as follows:
- External monitoring of the snapshot service, independent of the service's self-monitoring.
- Checking whether the last snapshot is older than 30 hours and sending an alert via Slack.
- Checking if any snapshot is smaller than a given size (GB) and sending an alert via Slack.
- Checking if full snapshots don't have a corresponding
SHA
sum file or if there is a straySHA
sum file and sending an alert via Slack. - Checking if any snapshot doesn't match a name pattern and sending an alert via Slack.
Furthermore, the proposed solution is as follows:
-
Managed Monitoring Service (Initial Consideration): Ideally, a managed monitoring platform would provide comprehensive and effective oversight of our Snapshot Service. Services such as New Relic, Datadog, and Grafana Cloud are under consideration. However, these platforms generally lack native support for Digital Ocean Spaces and are typically designed with AWS S3; notably, Digital Ocean Spaces are S3-compatible, which suggests the potential for compatibility. Therefore, we aim to explore potential workarounds, integrations, or compatibility options that enable these platforms to monitor our DigitalOcean Space effectively. However, we must acknowledge that the feasibility and effectiveness of this solution can only be fully assessed during the actual implementation process. This underscores the importance of flexibility and contingency planning in our monitoring strategy.
-
DigitalOcean (DO) Function: If our chosen managed services cannot meet all requirements or are not cost-effective, we will consider developing a custom function in DO to handle these checks.
The accepted criteria are as follows:
- The service should always be running.
- Send an alert via Slack when the service is down.
Furthermore, the proposed solution is as follows:
- Managed Monitoring Service: Most managed services can monitor whether a service is running and send alerts based on those metrics. Depending on the specifics of the Sync-Check Service, solutions like New Relic, Datadog, or Grafana Cloud are suitable.
The accepted criteria are as follows:
- The nodes should always be running, and their epochs should be consistently advancing. If these conditions aren't met, an alert should be sent via Slack.
- The storage volume of the nodes should be manageable. If it is nearing capacity, an alert should be triggered via Slack.
- Log aggregation set-up to analyze, query, and store Logs for long-term retention
Furthermore, the proposed solution is as follows:
- Managed Monitoring Service: Managed services such as New Relic, Datadog, or Grafana Cloud can often monitor server nodes, including checking if the nodes are running, tracking specific metrics (like epoch advancement), and monitoring storage usage.
In this section, we will calculate the cost estimations for various services under consideration, based on the projected usage across all nodes.
- Forest and Lotus Node (4 total)
- Forest snapshot Node
- Forest sync Check Node
- Snapshot spaces bucket
Total of 6 nodes for all services required to be monitored. We anticipate that the cumulative log output of all nodes will be about 100GB/month, and the estimated data from monitoring the infrastructure of all nodes will be about 20GB/month
New Relic Standard Plan(On demand)
- Estimated Cost: $56.1/month
The components include:
- 1 free full platform user ($99/month/user for up to 5 additional users if needed).
- 120GB of estimated data ingestion/month. 100GB free and then $0.30 per GB after that.
- Basic users: unlimited
- 1 Core user: $49/month per user
- Data Retention: 8 days by default for free with an additional cost of $0.05/month for extended retention.
Integrations and Features:
-
New Relic Prometheus OpenMetrics integrations of New Relic for Docker allow us to scrape Prometheus endpoints and send the data to New Relic, so we can store and visualize crucial metrics.
-
Custom instrumentation and Telemetry, this could be used to monitor how fast a e.g., tipset validation executes, with historical data. This would let us spot performance regressions easily.
-
Easy to set up relic agent on Droplets to monitor, disk space, CPU, memory, running status, and logs)
-
Custom Alerts Setup: Datadog allows the creation of custom alerts based on metrics, traces, and logs. New Relic allows for the creation of custom alerts using their Alerts UI or through the REST API.
-
New Relic offers user-friendly, interactive dashboards that can be tailored to depict a wide array of metrics data. These dashboards facilitate real-time monitoring of various metrics, providing near-instantaneous visibility into critical parameters such as CPU usage, storage levels, and memory consumption
-
No native integration for Grafana Loki
You can also get more details on the pricing page here
Pro Plan Pricing(On demand):
- Estimated Cost: $118/month
The components include:
- Infrastructure Monitoring: $18 per host/month, for 6 nodes
- Log Management: 120GB of estimated data ingestion/month. $0.10 per ingested GB/month.
- Log Retention: $2.50 per million log events/month. 30 days
- Log Forwarding to Custom Destinations: $0.25 per GB outbound/destination/month. (If needed)
- User Pricing: Not explicitly mentioned on their pricing page, and from further research, it appears they do not have per-user pricing.
Integration and Features:
- DataDog Prometheus OpenMetrics Collect our exposed Prometheus and OpenMetrics metrics from our Docker containers by using the Datadog Agent, Datadog supports Prometheus metrics, which allows you to send your - Prometheus metrics to Datadog for visualization and alerting. It's designed to be simple to set up.
- Easy to set up DataDog Agent on Droplets to monitor, disk space, CPU, memory, running status, and logs) and Dashboards to visualize
- New Relic provides a selection of pre-built dashboards that readily display critical parameters like CPU usage, storage levels, and memory consumption. However, it's important to note that the quantity of these pre-constructed dashboards is limited. To fully harness the capabilities of this solution, we'll need to invest time in comprehending the metrics and constructing custom dashboards tailored to our specific needs.
- No native support for Grafana Loki
You can also get more information about the pricing page here.
Pro Plan:
- Estimated Cost: $39/month
The components include:
- Subscription Fee: $29/month.
- Log Management: 100GB included with the subscription ($0.50/GB/month beyond that).
- Retention: 30 days included with the subscription.
- Users: 5 active users are included with the subscription ($8 per user/month for additional users if needed).
- Metrics: Includes 15k metrics with 13 months of retention.
Integration and Features:
- Native Integration with Grafana, Prometheus and Grafana Loki
- Easy installation of node exporter to monitor, disk space, CPU, memory, running status, and logs) and visualize and alerts
- Grafana Cloud offers a comprehensive, customizable dashboard for data tracking and visualization(we already have a Forest Dashboard)
- Easy set-up and configure
- Large community support
Further details can be found on the Grafana Cloud pricing page.
Following a comprehensive analysis of the three managed monitoring services, we have decided to adopt New Relic for our Forest Infrastructure monitoring needs. The reasons behind our choice are multifold:
New Relic's robust capabilities are well suited for monitoring various components of our infrastructure, including the Snapshot-Service, Sync-Check Service, and our Forest and lotus Nodes. This aligns with our need for a service that can provide comprehensive external monitoring.
Cost-wise, New Relic's pricing model is advantageous to us. Our estimated monitoring data and log output fit well within the limits of their standard plan. This pricing model also provides us the flexibility to scale our monitoring activities in the future without significantly impacting the cost.
The integration of New Relic with Prometheus OpenMetrics is a beneficial feature as it allows us to scrape Prometheus endpoints and visualize crucial metrics. New Relic's custom instrumentation and telemetry offer added benefits by enabling us to monitor specific operations such as tipset validation and alert us to performance regressions.
New Relic's user-friendly interface simplifies the process of setting up custom alerts and provides an interactive dashboard for real-time tracking of metrics.
However, considering our specific needs, New Relic seems to be the best fit. While New Relic's features, cost-effective pricing model, and integration capabilities align with our specific use case.