docs: document reverse proxy config for job streams

camunda · Jul 4, 2024 · 51a2f7b · 51a2f7b
1 parent 3bd768e
commit 51a2f7b
Show file tree

Hide file tree

Showing 4 changed files with 38 additions and 0 deletions.
diff --git a/docs/apis-tools/go-client/job-worker.md b/docs/apis-tools/go-client/job-worker.md
@@ -106,6 +106,12 @@ To avoid your workers being overloaded with too many jobs, e.g. running out of m
 
 **If streaming is enabled, back pressure applies to both pushing and polling**. You can then use `MaxJobsActive` and `Concurrency` as a way to soft-bound the memory usage of your worker. For example, given a maximum variable payload for a job of 1MB, `MaxJobsActive = 32`, and `Concurrency = 10`, then a single worker could use up to 42MB of memory. You can estimate a worst case scenario using the configured maximum message size, as no job payload will ever exceed this.
 
+#### Proxying
+
+If you're using a reverse proxy or a load balancer between your worker and your gateway, you may need to configure additional parameters to ensure the worker is not killed unexpectedly. If you observe regular 504 timeouts, consider reading [this guide](../../../self-managed/zeebe-deployment/zeebe-gateway/job-streaming).
+
+Note that by default, the Go job workers have a stream timeout of 1 hour.
+
 ## Additional resources
 
 - [Job worker reference](/components/concepts/job-workers.md)
diff --git a/docs/apis-tools/java-client/job-worker.md b/docs/apis-tools/java-client/job-worker.md
@@ -185,6 +185,12 @@ To avoid your workers being overloaded with too many jobs, e.g. running out of m
 If the worker blocks longer than the job's deadline, the job will **not** be passed to the worker, but will be dropped. As it will time out on the broker side, it will be pushed again.
 :::
 
+#### Proxying
+
+If you're using a reverse proxy or a load balancer between your worker and your gateway, you may need to configure additional parameters to ensure the worker is not killed unexpectedly. If you observe regular 504 timeouts, consider reading [this guide](../../../self-managed/zeebe-deployment/zeebe-gateway/job-streaming).
+
+Note that by default, the Java job workers have a stream timeout of 1 hour.
+
 ## Multi-tenancy
 
 You can configure a job worker to pick up jobs belonging to one or more tenants. When using the builder, you can configure

diff --git a/docs/components/concepts/job-workers.md b/docs/components/concepts/job-workers.md
@@ -169,6 +169,10 @@ If you're using Prometheus, you can use the following query to estimate the queu
 
 On the server side (e.g. if you're running a self-managed cluster), you can measure the rate of jobs which are not pushed due to clients which are not ready via the metric `zeebe_broker_jobs_push_fail_try_count_total{code="BLOCKED"}`. If the rate of this metric is high for a sustained amount of time, it may be a good indicator that you need to scale your workers. Unfortunately, on the server side we don't differentiate between clients, so this metric doesn't tell you which worker deployment needs to be scaled. We thus recommend using client metrics whenever possible.
 
+#### Proxying
+
+If you're using a reverse proxy or a load balancer between your worker and your gateway, you may need to configure additional parameters to ensure the worker is not killed unexpectedly. If you observe regular 504 timeouts, consider reading [this guide](../../../self-managed/zeebe-deployment/zeebe-gateway/job-streaming).
+
 ### Troubleshooting
 
 Since this feature requires a good amount of coordination between various components over the network, we've built in some tools to help monitor the health of the job streams.

diff --git a/docs/self-managed/zeebe-deployment/zeebe-gateway/job-streaming.md b/docs/self-managed/zeebe-deployment/zeebe-gateway/job-streaming.md
@@ -0,0 +1,22 @@
+---
+id: job-streaming
+title: "Job Streaming"
+sidebar_label: "Job Streaming"
+---
+
+Streaming job workers are expected to be long lived in order to cut down on the latency overhead involved with (re)creating a stream and propagating this throughout the cluster. This may require special configuration, especially if you're using a reverse proxy in front of your gateway. Typically, this will affect you in the form of HTTP 504 (Gateway Timeout) being returned to your job streaming worker at regular intervals.
+
+:::note
+Note that this configuration is _only_ required for reverse proxies which do not support forwarding HTTP/2 keepalive (on either side). See, for example, [this nginx ticket](https://trac.nginx.org/nginx/ticket/1887).
+
+If your proxy supports it, then you don't need to do anything.
+:::
+
+The general recommendation would be to apply the following configuration:
+
+- On your client, set an explicit stream timeout, say, 1h.
+- On your reverse proxy, ensure the read response timeout is set to slightly higher than your client, e.g. 1h10.
+
+## NGINX
+
+As aforementioned, nginx is a known proxy which does not support forward HTTP/2 pings from either side as a form of keepalive. You should configure an appropriate `grpc_send_timeout` such that it is _higher_ than your job worker stream timeout configuration.