Implement advanced receive tutorial

Signed-off-by: Ian Billett <[email protected]>
bill3tt · Jun 17, 2021 · 9db1ee2 · 9db1ee2
1 parent f75a233
commit 9db1ee2
Show file tree

Hide file tree

Showing 10 changed files with 546 additions and 0 deletions.
diff --git a/tutorials/katacoda/thanos-pathway.json b/tutorials/katacoda/thanos-pathway.json
@@ -18,6 +18,11 @@
       "title": "Intermediate: Ingesting metrics data from unreachable sources with Thanos Receive",
       "description": "Learn how to ingest and query metrics data from unreachable sources with Thanos Receive."
     },
+    {
+      "course_id": "4-receiver-advanced",
+      "title": "Advanced: Scaling Data Ingest & High Availability in Thanos Receive",
+      "description": "Learn how to architect Thanos Receive to achieve increasing data ingest volumes and high-availabilty."
+    },
     {
       "course_id": "6-query-caching",
       "title": "Advanced: Querying with low tail-latency and low cost - Query caching with Thanos",

diff --git a/tutorials/katacoda/thanos/4-receiver-advanced/assets/healthy-stores.png b/tutorials/katacoda/thanos/4-receiver-advanced/assets/healthy-stores.png
diff --git a/tutorials/katacoda/thanos/4-receiver-advanced/assets/healthy-targets.png b/tutorials/katacoda/thanos/4-receiver-advanced/assets/healthy-targets.png
diff --git a/tutorials/katacoda/thanos/4-receiver-advanced/courseBase.sh b/tutorials/katacoda/thanos/4-receiver-advanced/courseBase.sh
@@ -0,0 +1,6 @@
+#!/usr/bin/env bash
+
+docker pull quay.io/prometheus/prometheus:v2.27.0
+docker pull quay.io/thanos/thanos:main-2021-06-11-7c6c5051
+
+mkdir /root/editor
diff --git a/tutorials/katacoda/thanos/4-receiver-advanced/finish.md b/tutorials/katacoda/thanos/4-receiver-advanced/finish.md
@@ -0,0 +1,22 @@
+# Summary
+
+Congratulations! 🎉🎉🎉
+You completed this Thanos Receive tutorial. Let's summarize what we learned:
+
+* Thanos Receive is a component that implements the `Prometheus Remote Write` protocol.
+* Prometheus can be configured to remote write its metric data in real-time to another server that implements the Remote Write protocol.
+
+See next courses for other tutorials about different deployment models and more advanced features of Thanos!
+
+## Further Reading
+
+To understand more about `Thanos Receive` - check out the following resources:
+* [Thanos Receive Documentation](https://thanos.io/tip/components/receive.md/)
+* [Thanos Receive Design Document](https://thanos.io/tip/proposals/201812_thanos-remote-receive.md/)
+* [Pros/Cons of allowing remote write in Prometheus](https://docs.google.com/document/d/1H47v7WfyKkSLMrR8_iku6u9VB73WrVzBHb2SB6dL9_g/edit#heading=h.2v27snv0lsur)
+
+### Feedback
+
+Do you see any bug, typo in the tutorial or you have some feedback for us?
+
+let us know on https://github.com/thanos-io/thanos or #thanos slack channel linked on https://thanos.io
diff --git a/tutorials/katacoda/thanos/4-receiver-advanced/index.json b/tutorials/katacoda/thanos/4-receiver-advanced/index.json
@@ -0,0 +1,51 @@
+{
+  "title": "Advanced: Scaling Data Ingest & High Availability in Thanos Receive",
+  "description": "Learn how to architect Thanos Receive to achieve increasing data ingest volumes and high-availabilty",
+  "difficulty": "Advanced",
+  "details": {
+    "steps": [
+      {
+        "title": "Problem Statement",
+        "text": "step1.md"
+      },
+      {
+        "title": "Scaling Thanos Receive",
+        "text": "step2.md"
+      },
+      {
+        "title": "Running Infrastructure",
+        "text": "step3.md"
+      }
+    ],
+    "intro": {
+      "text": "intro.md",
+      "courseData": "courseBase.sh",
+      "credits": "https://thanos.io"
+    },
+    "finish": {
+      "text": "finish.md",
+      "credits": "test"
+    }
+  },
+  "files": [
+    "prometheus-batcave.yaml",
+    "prometheus-batcomputer.yaml",
+    "prometheus-wayne-enterprises.yaml",
+    "hashring.json"
+  ],
+  "environment": {
+    "uilayout": "editor-terminal",
+    "uisettings": "yaml",
+    "uieditorpath": "/root/editor",
+    "showdashboard": true,
+    "dashboards": [
+      {"name": "Prometheus Batcave", "port": 39090},
+      {"name": "Prometheus Batcomputer", "port": 39091},
+      {"name": "Prometheus Wayne Enterprises", "port": 39092},
+      {"name": "Thanos Query", "port": 59090}
+  ]
+  },
+  "backend": {
+    "imageid": "docker-direct"
+  }
+}
diff --git a/tutorials/katacoda/thanos/4-receiver-advanced/intro.md b/tutorials/katacoda/thanos/4-receiver-advanced/intro.md
@@ -0,0 +1,23 @@
+# Advanced: Scaling Data Ingest & High Availability in Thanos Receive
+
+This tutorial builds on what we learnt in tutorial #3 [Ingesting metrics data from unreachable sources with Thanos Receive](https://www.katacoda.com/thanos/courses/thanos/3-receive), and dives into more complex topics aimed at preparing your Thanos metrics infrastructure for running in production. 
+
+In this tutorial, you will learn:
+
+* How to achieve high-availability in Thanos Receive by replicating data.
+* How to efficiently scale Thanos Receive using 'router' and 'ingester' modes.
+
+> NOTE: This course uses docker containers with pre-built Thanos, Prometheus, and Minio Docker images available publicly.
+
+### Prerequisites
+
+This tutorial directly follows tutorial #3 [Ingesting metrics data from unreachable sources with Thanos Receive](https://www.katacoda.com/thanos/courses/thanos/3-receive) - so please make sure you have completed that first 🤗
+
+### Feedback
+
+Do you see any bug, typo in the tutorial or you have some feedback for us?
+Let us know on https://github.com/thanos-io/thanos or #thanos slack channel linked on https://thanos.io
+
+### Contributed by:
+
+* Ian Billett [@ianbillett](http://github.com/ianbillett)
diff --git a/tutorials/katacoda/thanos/4-receiver-advanced/step1.md b/tutorials/katacoda/thanos/4-receiver-advanced/step1.md
@@ -0,0 +1,25 @@
+# Problem Statement
+
+This tutorial will extend the `Thanos Receive` setup we built in the previous tutorial, by imposing extra requirements that make this infrastructure more suitable for a production environment.
+
+<details>
+  <summary>Click here for a brief recap of the previous tutorial</summary>
+  <br>
+
+  You are responsible for monitoring at `Wayne Enterprises`. You are required to monitor, in real time, two sites (`batcave` & `batcomputer`) that are sensitive and cannot receive external requests.
+  <br>
+
+  The solution that satisfied our requirements was to configure each of the Prometheus instances to `remote_write` their metrics data to an instance of `Thanos Receive` in our infrastructure.
+  <br>
+
+</details>
+
+## Requirements
+
+`Wayne Enterprises` is becoming increasingly successful and is starting to ingest metrics data from an increasing number of sensitive sites. To ensure that we provide our customers with a good experience, we are seeking to satisfy the following requirements:  
+
+* **Scalability**. As the number of sites we monitor increases, our infrastructure must be resilient to changes and operationally stable under increasingly large workloads. 
+* **High-Availability**. The data we ingest from downstream sites must be replicated, and must not reside on one machine only.
+
+Before moving on - can you think how you would achieve these requirements with Thanos Receive?
+
diff --git a/tutorials/katacoda/thanos/4-receiver-advanced/step2.md b/tutorials/katacoda/thanos/4-receiver-advanced/step2.md
@@ -0,0 +1,76 @@
+# Increasing Data Ingest Volumes
+
+Helpfully, `Thanos Receive` includes features to scale data ingest volumes beyond a single instance.
+
+It does this by enabling users to configure `Thanos Receive` instances to participate in a `hashring`.
+
+<details>
+  <summary> What is a hashring? </summary>
+  A `hashring` is a way of allocating `N` things between `M` slots.
+
+  By using [consistent hashing](https://en.wikipedia.org/wiki/Consistent_hashing), it ensures that when the number of `M` slots changes, the minimum possible things `N` are re-allocated between slots.
+
+  Crucially, this avoids the situation where _everything_ is re-allocated when the number of underlying workers changes.
+
+  This is a common technique in distributed data storage and load-balancing, and is an interesting topic in Computer Science. There are lots of good resources out there you can read up on.
+</details>
+
+
+The following two functions must be performed by participants to form a `hashring`:
+* `routing` - Decide which member(s) of the `hashring` should process the request and forward it to them.
+* `ingesting` - Receive a request containing metrics data, store it in our local TSDB instance, and provide a Store API for querying data.
+
+A single `Thanos Receive` instance can perform **one or both** of the above functions.
+
+Before diving into running these instances, let's think about the implications of running in these different modes.
+
+## Architecture
+
+How should we architect our `hashring` to best satisfy our scalability requirement?
+
+Broadly, there are two ways of approaching this decision:
+
+1. Combined - participants perform **both** `routing` and `ingesting`.
+1. Separate - participants perform **either** `routing` or `ingesting`.
+
+Let's consider how each approach responds in the following two scenarios...
+
+### #1 Configuration Reloading
+
+`hashring` participants know about each other via a configuration file (we'll see this in the next page).
+
+When this file changes, `Thanos Receiver` flushes its TSDB head blocks to disk during which, the component refuses to process any requests.
+
+<details>
+  <summary>What do you think happens under a combined and separate architecture?</summary>
+  <br>
+
+  With **combined** routing & ingesting, every participant has a local TSDB instance. When the head block (RAM) holds a lot of data, and the configuration file is changed, the whole hashring can become unresponsive for a prolonged period while the TSDB is flushed.
+  <br>
+  <br>
+
+  With **separate** routing & ingesting, only the 'routing' components are watching the configuration file for changes. Since these do not store data locally, there is no TSDB to flush. When a configuration change is made, the 'ingesters' are unaffected, and the 'routers' are only unavailable for a very short period.
+  <br>
+  <br>
+
+</details>
+
+### #2 Network Overhead
+
+`Routing` participants forward data to `ingesting` via `gRPC` network connections.
+
+<details>
+  <summary>What happens when the number participants gets large?</summary>
+  <br>
+
+  With **combined** routing & ingesting, every participant can route to every other participant. Therefore, if we have `n` participants in the `hashring` we will have `n²` open network connections, which can saturate networks.
+  <br>
+
+  With **separate** routing & ingesting, each `routing` component maintains a connection to each of the `ingesting` components. If we have `n` routers and `m` ingesters, we will have `nm` maximum connections. Routing is generally a low overhead activity, so `n` tends to be comparatively small to `m`.
+  <br>
+
+</details>
+
+## Conclusion
+
+Start the next chapter and we will build out this infrastructure!