forked from thanos-io/thanos
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Ian Billett <[email protected]>
- Loading branch information
Showing
10 changed files
with
546 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+94.1 KB
tutorials/katacoda/thanos/4-receiver-advanced/assets/healthy-stores.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+82.6 KB
tutorials/katacoda/thanos/4-receiver-advanced/assets/healthy-targets.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
#!/usr/bin/env bash | ||
|
||
docker pull quay.io/prometheus/prometheus:v2.27.0 | ||
docker pull quay.io/thanos/thanos:main-2021-06-11-7c6c5051 | ||
|
||
mkdir /root/editor |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Summary | ||
|
||
Congratulations! 🎉🎉🎉 | ||
You completed this Thanos Receive tutorial. Let's summarize what we learned: | ||
|
||
* Thanos Receive is a component that implements the `Prometheus Remote Write` protocol. | ||
* Prometheus can be configured to remote write its metric data in real-time to another server that implements the Remote Write protocol. | ||
|
||
See next courses for other tutorials about different deployment models and more advanced features of Thanos! | ||
|
||
## Further Reading | ||
|
||
To understand more about `Thanos Receive` - check out the following resources: | ||
* [Thanos Receive Documentation](https://thanos.io/tip/components/receive.md/) | ||
* [Thanos Receive Design Document](https://thanos.io/tip/proposals/201812_thanos-remote-receive.md/) | ||
* [Pros/Cons of allowing remote write in Prometheus](https://docs.google.com/document/d/1H47v7WfyKkSLMrR8_iku6u9VB73WrVzBHb2SB6dL9_g/edit#heading=h.2v27snv0lsur) | ||
|
||
### Feedback | ||
|
||
Do you see any bug, typo in the tutorial or you have some feedback for us? | ||
|
||
let us know on https://github.com/thanos-io/thanos or #thanos slack channel linked on https://thanos.io |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
{ | ||
"title": "Advanced: Scaling Data Ingest & High Availability in Thanos Receive", | ||
"description": "Learn how to architect Thanos Receive to achieve increasing data ingest volumes and high-availabilty", | ||
"difficulty": "Advanced", | ||
"details": { | ||
"steps": [ | ||
{ | ||
"title": "Problem Statement", | ||
"text": "step1.md" | ||
}, | ||
{ | ||
"title": "Scaling Thanos Receive", | ||
"text": "step2.md" | ||
}, | ||
{ | ||
"title": "Running Infrastructure", | ||
"text": "step3.md" | ||
} | ||
], | ||
"intro": { | ||
"text": "intro.md", | ||
"courseData": "courseBase.sh", | ||
"credits": "https://thanos.io" | ||
}, | ||
"finish": { | ||
"text": "finish.md", | ||
"credits": "test" | ||
} | ||
}, | ||
"files": [ | ||
"prometheus-batcave.yaml", | ||
"prometheus-batcomputer.yaml", | ||
"prometheus-wayne-enterprises.yaml", | ||
"hashring.json" | ||
], | ||
"environment": { | ||
"uilayout": "editor-terminal", | ||
"uisettings": "yaml", | ||
"uieditorpath": "/root/editor", | ||
"showdashboard": true, | ||
"dashboards": [ | ||
{"name": "Prometheus Batcave", "port": 39090}, | ||
{"name": "Prometheus Batcomputer", "port": 39091}, | ||
{"name": "Prometheus Wayne Enterprises", "port": 39092}, | ||
{"name": "Thanos Query", "port": 59090} | ||
] | ||
}, | ||
"backend": { | ||
"imageid": "docker-direct" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Advanced: Scaling Data Ingest & High Availability in Thanos Receive | ||
|
||
This tutorial builds on what we learnt in tutorial #3 [Ingesting metrics data from unreachable sources with Thanos Receive](https://www.katacoda.com/thanos/courses/thanos/3-receive), and dives into more complex topics aimed at preparing your Thanos metrics infrastructure for running in production. | ||
|
||
In this tutorial, you will learn: | ||
|
||
* How to achieve high-availability in Thanos Receive by replicating data. | ||
* How to efficiently scale Thanos Receive using 'router' and 'ingester' modes. | ||
|
||
> NOTE: This course uses docker containers with pre-built Thanos, Prometheus, and Minio Docker images available publicly. | ||
### Prerequisites | ||
|
||
This tutorial directly follows tutorial #3 [Ingesting metrics data from unreachable sources with Thanos Receive](https://www.katacoda.com/thanos/courses/thanos/3-receive) - so please make sure you have completed that first 🤗 | ||
|
||
### Feedback | ||
|
||
Do you see any bug, typo in the tutorial or you have some feedback for us? | ||
Let us know on https://github.com/thanos-io/thanos or #thanos slack channel linked on https://thanos.io | ||
|
||
### Contributed by: | ||
|
||
* Ian Billett [@ianbillett](http://github.com/ianbillett) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Problem Statement | ||
|
||
This tutorial will extend the `Thanos Receive` setup we built in the previous tutorial, by imposing extra requirements that make this infrastructure more suitable for a production environment. | ||
|
||
<details> | ||
<summary>Click here for a brief recap of the previous tutorial</summary> | ||
<br> | ||
|
||
You are responsible for monitoring at `Wayne Enterprises`. You are required to monitor, in real time, two sites (`batcave` & `batcomputer`) that are sensitive and cannot receive external requests. | ||
<br> | ||
|
||
The solution that satisfied our requirements was to configure each of the Prometheus instances to `remote_write` their metrics data to an instance of `Thanos Receive` in our infrastructure. | ||
<br> | ||
|
||
</details> | ||
|
||
## Requirements | ||
|
||
`Wayne Enterprises` is becoming increasingly successful and is starting to ingest metrics data from an increasing number of sensitive sites. To ensure that we provide our customers with a good experience, we are seeking to satisfy the following requirements: | ||
|
||
* **Scalability**. As the number of sites we monitor increases, our infrastructure must be resilient to changes and operationally stable under increasingly large workloads. | ||
* **High-Availability**. The data we ingest from downstream sites must be replicated, and must not reside on one machine only. | ||
|
||
Before moving on - can you think how you would achieve these requirements with Thanos Receive? | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Increasing Data Ingest Volumes | ||
|
||
Helpfully, `Thanos Receive` includes features to scale data ingest volumes beyond a single instance. | ||
|
||
It does this by enabling users to configure `Thanos Receive` instances to participate in a `hashring`. | ||
|
||
<details> | ||
<summary> What is a hashring? </summary> | ||
A `hashring` is a way of allocating `N` things between `M` slots. | ||
|
||
By using [consistent hashing](https://en.wikipedia.org/wiki/Consistent_hashing), it ensures that when the number of `M` slots changes, the minimum possible things `N` are re-allocated between slots. | ||
|
||
Crucially, this avoids the situation where _everything_ is re-allocated when the number of underlying workers changes. | ||
|
||
This is a common technique in distributed data storage and load-balancing, and is an interesting topic in Computer Science. There are lots of good resources out there you can read up on. | ||
</details> | ||
|
||
|
||
The following two functions must be performed by participants to form a `hashring`: | ||
* `routing` - Decide which member(s) of the `hashring` should process the request and forward it to them. | ||
* `ingesting` - Receive a request containing metrics data, store it in our local TSDB instance, and provide a Store API for querying data. | ||
|
||
A single `Thanos Receive` instance can perform **one or both** of the above functions. | ||
|
||
Before diving into running these instances, let's think about the implications of running in these different modes. | ||
|
||
## Architecture | ||
|
||
How should we architect our `hashring` to best satisfy our scalability requirement? | ||
|
||
Broadly, there are two ways of approaching this decision: | ||
|
||
1. Combined - participants perform **both** `routing` and `ingesting`. | ||
1. Separate - participants perform **either** `routing` or `ingesting`. | ||
|
||
Let's consider how each approach responds in the following two scenarios... | ||
|
||
### #1 Configuration Reloading | ||
|
||
`hashring` participants know about each other via a configuration file (we'll see this in the next page). | ||
|
||
When this file changes, `Thanos Receiver` flushes its TSDB head blocks to disk during which, the component refuses to process any requests. | ||
|
||
<details> | ||
<summary>What do you think happens under a combined and separate architecture?</summary> | ||
<br> | ||
|
||
With **combined** routing & ingesting, every participant has a local TSDB instance. When the head block (RAM) holds a lot of data, and the configuration file is changed, the whole hashring can become unresponsive for a prolonged period while the TSDB is flushed. | ||
<br> | ||
<br> | ||
|
||
With **separate** routing & ingesting, only the 'routing' components are watching the configuration file for changes. Since these do not store data locally, there is no TSDB to flush. When a configuration change is made, the 'ingesters' are unaffected, and the 'routers' are only unavailable for a very short period. | ||
<br> | ||
<br> | ||
|
||
</details> | ||
|
||
### #2 Network Overhead | ||
|
||
`Routing` participants forward data to `ingesting` via `gRPC` network connections. | ||
|
||
<details> | ||
<summary>What happens when the number participants gets large?</summary> | ||
<br> | ||
|
||
With **combined** routing & ingesting, every participant can route to every other participant. Therefore, if we have `n` participants in the `hashring` we will have `n²` open network connections, which can saturate networks. | ||
<br> | ||
|
||
With **separate** routing & ingesting, each `routing` component maintains a connection to each of the `ingesting` components. If we have `n` routers and `m` ingesters, we will have `nm` maximum connections. Routing is generally a low overhead activity, so `n` tends to be comparatively small to `m`. | ||
<br> | ||
|
||
</details> | ||
|
||
## Conclusion | ||
|
||
Start the next chapter and we will build out this infrastructure! |
Oops, something went wrong.