Skip to content

Commit

Permalink
Implement advanced receive tutorial
Browse files Browse the repository at this point in the history
Signed-off-by: Ian Billett <[email protected]>
  • Loading branch information
bill3tt committed Jun 17, 2021
1 parent f75a233 commit 9db1ee2
Show file tree
Hide file tree
Showing 10 changed files with 546 additions and 0 deletions.
5 changes: 5 additions & 0 deletions tutorials/katacoda/thanos-pathway.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@
"title": "Intermediate: Ingesting metrics data from unreachable sources with Thanos Receive",
"description": "Learn how to ingest and query metrics data from unreachable sources with Thanos Receive."
},
{
"course_id": "4-receiver-advanced",
"title": "Advanced: Scaling Data Ingest & High Availability in Thanos Receive",
"description": "Learn how to architect Thanos Receive to achieve increasing data ingest volumes and high-availabilty."
},
{
"course_id": "6-query-caching",
"title": "Advanced: Querying with low tail-latency and low cost - Query caching with Thanos",
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions tutorials/katacoda/thanos/4-receiver-advanced/courseBase.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env bash

docker pull quay.io/prometheus/prometheus:v2.27.0
docker pull quay.io/thanos/thanos:main-2021-06-11-7c6c5051

mkdir /root/editor
22 changes: 22 additions & 0 deletions tutorials/katacoda/thanos/4-receiver-advanced/finish.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Summary

Congratulations! 🎉🎉🎉
You completed this Thanos Receive tutorial. Let's summarize what we learned:

* Thanos Receive is a component that implements the `Prometheus Remote Write` protocol.
* Prometheus can be configured to remote write its metric data in real-time to another server that implements the Remote Write protocol.

See next courses for other tutorials about different deployment models and more advanced features of Thanos!

## Further Reading

To understand more about `Thanos Receive` - check out the following resources:
* [Thanos Receive Documentation](https://thanos.io/tip/components/receive.md/)
* [Thanos Receive Design Document](https://thanos.io/tip/proposals/201812_thanos-remote-receive.md/)
* [Pros/Cons of allowing remote write in Prometheus](https://docs.google.com/document/d/1H47v7WfyKkSLMrR8_iku6u9VB73WrVzBHb2SB6dL9_g/edit#heading=h.2v27snv0lsur)

### Feedback

Do you see any bug, typo in the tutorial or you have some feedback for us?

let us know on https://github.com/thanos-io/thanos or #thanos slack channel linked on https://thanos.io
51 changes: 51 additions & 0 deletions tutorials/katacoda/thanos/4-receiver-advanced/index.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
{
"title": "Advanced: Scaling Data Ingest & High Availability in Thanos Receive",
"description": "Learn how to architect Thanos Receive to achieve increasing data ingest volumes and high-availabilty",
"difficulty": "Advanced",
"details": {
"steps": [
{
"title": "Problem Statement",
"text": "step1.md"
},
{
"title": "Scaling Thanos Receive",
"text": "step2.md"
},
{
"title": "Running Infrastructure",
"text": "step3.md"
}
],
"intro": {
"text": "intro.md",
"courseData": "courseBase.sh",
"credits": "https://thanos.io"
},
"finish": {
"text": "finish.md",
"credits": "test"
}
},
"files": [
"prometheus-batcave.yaml",
"prometheus-batcomputer.yaml",
"prometheus-wayne-enterprises.yaml",
"hashring.json"
],
"environment": {
"uilayout": "editor-terminal",
"uisettings": "yaml",
"uieditorpath": "/root/editor",
"showdashboard": true,
"dashboards": [
{"name": "Prometheus Batcave", "port": 39090},
{"name": "Prometheus Batcomputer", "port": 39091},
{"name": "Prometheus Wayne Enterprises", "port": 39092},
{"name": "Thanos Query", "port": 59090}
]
},
"backend": {
"imageid": "docker-direct"
}
}
23 changes: 23 additions & 0 deletions tutorials/katacoda/thanos/4-receiver-advanced/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Advanced: Scaling Data Ingest & High Availability in Thanos Receive

This tutorial builds on what we learnt in tutorial #3 [Ingesting metrics data from unreachable sources with Thanos Receive](https://www.katacoda.com/thanos/courses/thanos/3-receive), and dives into more complex topics aimed at preparing your Thanos metrics infrastructure for running in production.

In this tutorial, you will learn:

* How to achieve high-availability in Thanos Receive by replicating data.
* How to efficiently scale Thanos Receive using 'router' and 'ingester' modes.

> NOTE: This course uses docker containers with pre-built Thanos, Prometheus, and Minio Docker images available publicly.
### Prerequisites

This tutorial directly follows tutorial #3 [Ingesting metrics data from unreachable sources with Thanos Receive](https://www.katacoda.com/thanos/courses/thanos/3-receive) - so please make sure you have completed that first 🤗

### Feedback

Do you see any bug, typo in the tutorial or you have some feedback for us?
Let us know on https://github.com/thanos-io/thanos or #thanos slack channel linked on https://thanos.io

### Contributed by:

* Ian Billett [@ianbillett](http://github.com/ianbillett)
25 changes: 25 additions & 0 deletions tutorials/katacoda/thanos/4-receiver-advanced/step1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Problem Statement

This tutorial will extend the `Thanos Receive` setup we built in the previous tutorial, by imposing extra requirements that make this infrastructure more suitable for a production environment.

<details>
<summary>Click here for a brief recap of the previous tutorial</summary>
<br>

You are responsible for monitoring at `Wayne Enterprises`. You are required to monitor, in real time, two sites (`batcave` & `batcomputer`) that are sensitive and cannot receive external requests.
<br>

The solution that satisfied our requirements was to configure each of the Prometheus instances to `remote_write` their metrics data to an instance of `Thanos Receive` in our infrastructure.
<br>

</details>

## Requirements

`Wayne Enterprises` is becoming increasingly successful and is starting to ingest metrics data from an increasing number of sensitive sites. To ensure that we provide our customers with a good experience, we are seeking to satisfy the following requirements:

* **Scalability**. As the number of sites we monitor increases, our infrastructure must be resilient to changes and operationally stable under increasingly large workloads.
* **High-Availability**. The data we ingest from downstream sites must be replicated, and must not reside on one machine only.

Before moving on - can you think how you would achieve these requirements with Thanos Receive?

76 changes: 76 additions & 0 deletions tutorials/katacoda/thanos/4-receiver-advanced/step2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Increasing Data Ingest Volumes

Helpfully, `Thanos Receive` includes features to scale data ingest volumes beyond a single instance.

It does this by enabling users to configure `Thanos Receive` instances to participate in a `hashring`.

<details>
<summary> What is a hashring? </summary>
A `hashring` is a way of allocating `N` things between `M` slots.

By using [consistent hashing](https://en.wikipedia.org/wiki/Consistent_hashing), it ensures that when the number of `M` slots changes, the minimum possible things `N` are re-allocated between slots.

Crucially, this avoids the situation where _everything_ is re-allocated when the number of underlying workers changes.

This is a common technique in distributed data storage and load-balancing, and is an interesting topic in Computer Science. There are lots of good resources out there you can read up on.
</details>


The following two functions must be performed by participants to form a `hashring`:
* `routing` - Decide which member(s) of the `hashring` should process the request and forward it to them.
* `ingesting` - Receive a request containing metrics data, store it in our local TSDB instance, and provide a Store API for querying data.

A single `Thanos Receive` instance can perform **one or both** of the above functions.

Before diving into running these instances, let's think about the implications of running in these different modes.

## Architecture

How should we architect our `hashring` to best satisfy our scalability requirement?

Broadly, there are two ways of approaching this decision:

1. Combined - participants perform **both** `routing` and `ingesting`.
1. Separate - participants perform **either** `routing` or `ingesting`.

Let's consider how each approach responds in the following two scenarios...

### #1 Configuration Reloading

`hashring` participants know about each other via a configuration file (we'll see this in the next page).

When this file changes, `Thanos Receiver` flushes its TSDB head blocks to disk during which, the component refuses to process any requests.

<details>
<summary>What do you think happens under a combined and separate architecture?</summary>
<br>

With **combined** routing & ingesting, every participant has a local TSDB instance. When the head block (RAM) holds a lot of data, and the configuration file is changed, the whole hashring can become unresponsive for a prolonged period while the TSDB is flushed.
<br>
<br>

With **separate** routing & ingesting, only the 'routing' components are watching the configuration file for changes. Since these do not store data locally, there is no TSDB to flush. When a configuration change is made, the 'ingesters' are unaffected, and the 'routers' are only unavailable for a very short period.
<br>
<br>

</details>

### #2 Network Overhead

`Routing` participants forward data to `ingesting` via `gRPC` network connections.

<details>
<summary>What happens when the number participants gets large?</summary>
<br>

With **combined** routing & ingesting, every participant can route to every other participant. Therefore, if we have `n` participants in the `hashring` we will have `` open network connections, which can saturate networks.
<br>

With **separate** routing & ingesting, each `routing` component maintains a connection to each of the `ingesting` components. If we have `n` routers and `m` ingesters, we will have `nm` maximum connections. Routing is generally a low overhead activity, so `n` tends to be comparatively small to `m`.
<br>

</details>

## Conclusion

Start the next chapter and we will build out this infrastructure!
Loading

0 comments on commit 9db1ee2

Please sign in to comment.