Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example with tracing + opentelemetry exporting #61

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions kubernetes-with-otelcol-tracing/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
apiVersion: v2
name: buildbarn
description: remote building tool
type: application
version: 0.1.0
27 changes: 27 additions & 0 deletions kubernetes-with-otelcol-tracing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Buildbarn cache with traces and metrics collection

This deployment is Datadog-centric, but it ought to be fairly straightforward to swap out the Datadog-specific bits for any other upstream that supports Opentelemetry.

## Deployment Attributes

1. sharded local storage, with each pod having 250GB for CAS, 250GB for AC, and 5GB for persistence
2. traces being sent to Datadog via the opentelemetry collector in daemon mode. The trace sampler is set to "always" because the preferred approach is to send all traces to Datadog, and then sample down based on your cost needs on the datadog side. This is because analytics data comes from all ingested spans, not just the retained ones.
3. scrape prometheus metrics to Datadog into the metrics namespace "buildbarn". This includes the metrics for the Opentelemetry Collector, so you can have some visibility into ingestion rate, errors, etc.

## Further reading

- Datadog prometheus scraping configuration: https://docs.datadoghq.com/agent/kubernetes/prometheus/
- Datadog span retention: https://docs.datadoghq.com/tracing/trace_retention_and_ingestion

## Usage Notes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document real quick how to set this up? See for instance how the existing Kubernetes deployment is documented.


1. Make sure to generate hashInitialization values for common.yaml
2. Change serviceDnsName to reflect the actual DNS names assigned to services in your deployment
3. Set the number of storage replicas to suit your storage and throughput needs.
4. Configure the volumeClaimTemplates to suit your cluster's implementation. We're on AWS using basic gp2 EBS volumes and a large disk cache with no I/O issues on the volumes so far.

Changing the number of replicas for the storage nodes will cause a redeployment of all frontend nodes, because the config will change. It will also rearrange the keyspace and cause the bulk of your cache to go cold.

## Requirements:

1. Datadog agent already deployed in the kubernetes cluster: https://docs.datadoghq.com/agent/kubernetes/?tab=helm#installation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be set as a dependency of this chart?

100 changes: 100 additions & 0 deletions kubernetes-with-otelcol-tracing/templates/collector.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: {{ $.Values.baseName }}
app.kubernetes.io/component: opentelemetry-collector
service: buildbarn
team: {{ $.Values.team }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears this value doesn't exist:

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: error validating "": error validating data: unknown object type "nil" in Deployment.metadata.labels.team

name: {{ $.Values.baseName }}-opentelemetry-collector
namespace: {{ $.Release.Namespace }}
spec:
replicas: {{ $.Values.otelcol.replicas }}
selector:
matchLabels:
app.kubernetes.io/name: {{ $.Values.baseName }}
app.kubernetes.io/component: opentelemetry-collector
template:
metadata:
annotations:
checksum/common.config: {{ include (print $.Template.BasePath "/config/collector.yaml.tpl") . | sha256sum }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
checksum/common.config: {{ include (print $.Template.BasePath "/config/collector.yaml.tpl") . | sha256sum }}
checksum/common.config: {{ include (print $.Template.BasePath "/config/collector.yaml") . | sha256sum }}

ad.datadoghq.com/otelcol.check_names: |
["openmetrics"]
ad.datadoghq.com/otelcol.init_configs: |
[{}]
ad.datadoghq.com/otelcol.instances: |
[
{
"prometheus_url": "http://%%host%%:%%port_http%%/metrics",
"namespace": "buildbarn.otelcol",
"metrics": ["*"]
}
]
labels:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
labels:
labels:
app.kubernetes.io/name: {{ $.Values.baseName }}

This label is missing, while being present in the selector:

Error: INSTALLATION FAILED: Deployment.apps "buildbarn-opentelemetry-collector" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"app.kubernetes.io/component":"opentelemetry-collector"}: selector does not match template labels

app.kubernetes.io/component: opentelemetry-collector
spec:
serviceAccountName: {{ $.Values.baseName }}-opentelemetry-collector
containers:
- image: {{ $.Values.otelcol.image.name }}:{{ $.Values.otelcol.image.tag }}
ports:
- containerPort: {{ .Values.otelcol.port }}
protocol: TCP
- containerPort: {{ .Values.otelcol.prometheusPort }}
protocol: TCP
imagePullPolicy: IfNotPresent
name: otelcol
resources:
{{ $.Values.otelcol.resources | toYaml | nindent 12 }}
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: {{ $.Values.baseName }}-opentelemetry-collector
restartPolicy: Always
terminationGracePeriodSeconds: 30
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ $.Values.baseName }}-opentelemetry-collector
namespace: {{ $.Release.Namespace }}
annotations:
helm.sh/resource-policy: keep
labels:
app: {{ $.Values.baseName }}-opentelemetry-collector
service: buildbarn
---
apiVersion: v1
kind: Service
metadata:
name: {{ $.Values.baseName }}-opentelemetry-collector
namespace: {{ $.Release.Namespace }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, shouldn't this be:

Suggested change
namespace: {{ $.Release.Namespace }}
namespace: {{ $.Values.namespace }}

spec:
ports:
- port: {{ $.Values.otelcol.port }}
protocol: TCP
selector:
app.kubernetes.io/component: opentelemetry-collector
---
apiVersion: v1
kind: Service
metadata:
name: {{ $.Values.baseName }}-opentelemetry-collector-headless
namespace: {{ $.Values.namespace }}
labels:
app.kubernetes.io/component: opentelemetry-collector
spec:
type: ClusterIP
clusterIP: None
ports:
- name: grpc
port: {{ .Values.otelcol.port }}
protocol: TCP
- name: http
port: {{ .Values.otelcol.prometheusPort }}
protocol: TCP
selector:
app.kubernetes.io/name: {{ $.Values.baseName }}
app.kubernetes.io/component: opentelemetry-collector
39 changes: 39 additions & 0 deletions kubernetes-with-otelcol-tracing/templates/config/collector.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ $.Values.baseName }}-opentelemetry-collector
namespace: {{ $.Values.namespace }}
labels:
app: {{ $.Release.Name }}-opentelemetry-collector
service: buildbarn
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
processors:
batch:
timeout: 10s
k8sattributes:
passthrough: true
exporters:
datadog:
env: {{ $.Values.otelcol.env }}
service: {{ $.Values.namespace }}
tags:
- kube_namespace:{{ $.Values.namespace }}
- deployment:{{ $.Values.baseName }}
api:
key: example-key
site: datadoghq.com
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, k8sattributes]
exporters: [datadog]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be invalid?

Error: cannot load configuration: unknown exporters type "datadog" for datadog
2021/12/28 01:08:34 collector server run finished with error: cannot load configuration: unknown exporters type "datadog" for datadog

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, interesting. I didn't even modify the collector config, I just copy and pasted it directly from my deployment, which is definitely working.

telemetry:
logs:
level: "debug"
62 changes: 62 additions & 0 deletions kubernetes-with-otelcol-tracing/templates/config/common.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ $.Values.baseName }}-common
namespace: {{ $.Values.namespace }}
data:
common.libsonnet: |
{
blobstore: {
contentAddressableStorage: {
sharding: {
hashInitialization: 1,
shards: [
{{- $replicas := until (int .Values.storage.replicas) -}}
{{- $bbName := .Values.baseName -}}
{{- $port := .Values.storage.port -}}
{{- $ns := .Values.namespace -}}
{{- $dnsName := .Values.serviceDnsName -}}
{{- range $_, $replicaNumber := $replicas }}
{
backend: {
grpc: { address: '{{ $bbName }}-storage-{{ $replicaNumber }}.{{ $bbName }}-storage-headless.{{ $ns }}.{{ $dnsName }}:{{ $port }}' },
},
weight: 1,
},
{{- end}}
],
},
},
actionCache: {
completenessChecking: {
sharding: {
hashInitialization: 1,
shards: [
{{- $replicas := until (int .Values.storage.replicas) -}}
{{- $bbName := .Values.baseName -}}
{{- $port := .Values.storage.port -}}
{{- $ns := .Values.namespace -}}
{{- range $_, $replicaNumber := $replicas }}
{
backend: {
grpc: { address: '{{ $bbName }}-storage-{{ $replicaNumber }}.{{ $bbName }}-storage-headless.{{ $ns }}.{{ $dnsName }}:{{ $port }}' },
},
weight: 1,
},
{{- end}}
],
},
},
},
},
maximumMessageSizeBytes: 16 * 1024 * 1024,
openTelemetryBackend: {
batchSpanProcessor: {},
otlpSpanExporter: {
address: '{{ $.Values.baseName }}-opentelemetry-collector-headless.{{ $.Release.Namespace }}.{{ $.Values.serviceDnsName }}:{{ $.Values.otelcol.port }}'
},
},
traceSampler: {
always: {},
},
}
43 changes: 43 additions & 0 deletions kubernetes-with-otelcol-tracing/templates/config/frontend.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ $.Values.baseName }}-frontend
namespace: {{ $.Values.namespace }}
data:
frontend.jsonnet: |
local common = import 'common.libsonnet';
{
blobstore: common.blobstore,
global: {
tracing: {
sampler: common.traceSampler,
resourceAttributes: {
"service.name": {string: 'buildbarn-frontend'},
"service.namespace": {string: '{{ $.Values.namespace }}'},
},
backends: [
common.openTelemetryBackend,
],
},
diagnosticsHttpServer: {
listenAddress: ':{{ .Values.frontend.prometheusPort }}',
enablePrometheus: true,
enablePprof: true,
},
},
grpcServers: [{
listenAddresses: [':{{ .Values.frontend.port }}'],
authenticationPolicy: { allow: {} },
}],
actionCacheAuthorizers: {
get: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
put: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
},
contentAddressableStorageAuthorizers: {
get: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
put: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
findMissing: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
},
executeAuthorizer:{ instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
maximumMessageSizeBytes: common.maximumMessageSizeBytes,
}
100 changes: 100 additions & 0 deletions kubernetes-with-otelcol-tracing/templates/config/storage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ $.Values.baseName }}-storage
namespace: {{ $.Values.namespace }}
data:
storage.jsonnet: |
local common = import 'common.libsonnet';
{
global: {
tracing: {
sampler: common.traceSampler,
resourceAttributes: {
"service.name": {string: 'buildbarn-storage'},
"service.namespace": {string: 'buildbarn'},
},
backends: [
common.openTelemetryBackend,
],
},
diagnosticsHttpServer: {
listenAddress: ':{{ .Values.storage.prometheusPort }}',
enablePrometheus: true,
enablePprof: true,
},
},
blobstore: {
actionCache: {
"local": {
persistent: {
stateDirectoryPath: "/persist/ac",
minimumEpochInterval: "300s",
},
keyLocationMapOnBlockDevice: {
file: {
path: "/ac-0/ac.keys",
size_bytes: 1*1024*1024*1024,
},
},
blocksOnBlockDevice: {
source: {
file: {
path: "/ac-0/ac.blocks",
size_bytes: 246*1024*1024*1024,
},
},
spareBlocks: 3,
},
keyLocationMapMaximumGetAttempts: 8,
keyLocationMapMaximumPutAttempts: 32,
oldBlocks: 8,
currentBlocks: 24,
newBlocks: 1,
},
},
contentAddressableStorage: {
"local": {
persistent: {
stateDirectoryPath: "/persist/cas",
minimumEpochInterval: "300s",
},
keyLocationMapOnBlockDevice: {
file: {
path: "/cas-0/cas.keys",
size_bytes: 1*1024*1024*1024,
},
},
blocksOnBlockDevice: {
source: {
file: {
path: "/cas-0/cas.blocks",
size_bytes: 246*1024*1024*1024,
},
},
spareBlocks: 3,
},
keyLocationMapMaximumGetAttempts: 8,
keyLocationMapMaximumPutAttempts: 32,
oldBlocks: 8,
currentBlocks: 24,
newBlocks: 3,
},
},
},
grpcServers: [{
listenAddresses: [':{{ .Values.storage.port }}'],
authenticationPolicy: { allow: {} },
}],
actionCacheAuthorizers: {
get: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
put: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
},
contentAddressableStorageAuthorizers: {
get: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
put: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
findMissing: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }},
},
executeAuthorizer: { deny: {}},
maximumMessageSizeBytes: 16 * 1024 * 1024,
}
Loading