-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example with tracing + opentelemetry exporting #61
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
apiVersion: v2 | ||
name: buildbarn | ||
description: remote building tool | ||
type: application | ||
version: 0.1.0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Buildbarn cache with traces and metrics collection | ||
|
||
This deployment is Datadog-centric, but it ought to be fairly straightforward to swap out the Datadog-specific bits for any other upstream that supports Opentelemetry. | ||
|
||
## Deployment Attributes | ||
|
||
1. sharded local storage, with each pod having 250GB for CAS, 250GB for AC, and 5GB for persistence | ||
2. traces being sent to Datadog via the opentelemetry collector in daemon mode. The trace sampler is set to "always" because the preferred approach is to send all traces to Datadog, and then sample down based on your cost needs on the datadog side. This is because analytics data comes from all ingested spans, not just the retained ones. | ||
3. scrape prometheus metrics to Datadog into the metrics namespace "buildbarn". This includes the metrics for the Opentelemetry Collector, so you can have some visibility into ingestion rate, errors, etc. | ||
|
||
## Further reading | ||
|
||
- Datadog prometheus scraping configuration: https://docs.datadoghq.com/agent/kubernetes/prometheus/ | ||
- Datadog span retention: https://docs.datadoghq.com/tracing/trace_retention_and_ingestion | ||
|
||
## Usage Notes | ||
|
||
1. Make sure to generate hashInitialization values for common.yaml | ||
2. Change serviceDnsName to reflect the actual DNS names assigned to services in your deployment | ||
3. Set the number of storage replicas to suit your storage and throughput needs. | ||
4. Configure the volumeClaimTemplates to suit your cluster's implementation. We're on AWS using basic gp2 EBS volumes and a large disk cache with no I/O issues on the volumes so far. | ||
|
||
Changing the number of replicas for the storage nodes will cause a redeployment of all frontend nodes, because the config will change. It will also rearrange the keyspace and cause the bulk of your cache to go cold. | ||
|
||
## Requirements: | ||
|
||
1. Datadog agent already deployed in the kubernetes cluster: https://docs.datadoghq.com/agent/kubernetes/?tab=helm#installation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this be set as a dependency of this chart? |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,100 @@ | ||||||||
apiVersion: apps/v1 | ||||||||
kind: Deployment | ||||||||
metadata: | ||||||||
labels: | ||||||||
app.kubernetes.io/name: {{ $.Values.baseName }} | ||||||||
app.kubernetes.io/component: opentelemetry-collector | ||||||||
service: buildbarn | ||||||||
team: {{ $.Values.team }} | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It appears this value doesn't exist:
|
||||||||
name: {{ $.Values.baseName }}-opentelemetry-collector | ||||||||
namespace: {{ $.Release.Namespace }} | ||||||||
spec: | ||||||||
replicas: {{ $.Values.otelcol.replicas }} | ||||||||
selector: | ||||||||
matchLabels: | ||||||||
app.kubernetes.io/name: {{ $.Values.baseName }} | ||||||||
app.kubernetes.io/component: opentelemetry-collector | ||||||||
template: | ||||||||
metadata: | ||||||||
annotations: | ||||||||
checksum/common.config: {{ include (print $.Template.BasePath "/config/collector.yaml.tpl") . | sha256sum }} | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
ad.datadoghq.com/otelcol.check_names: | | ||||||||
["openmetrics"] | ||||||||
ad.datadoghq.com/otelcol.init_configs: | | ||||||||
[{}] | ||||||||
ad.datadoghq.com/otelcol.instances: | | ||||||||
[ | ||||||||
{ | ||||||||
"prometheus_url": "http://%%host%%:%%port_http%%/metrics", | ||||||||
"namespace": "buildbarn.otelcol", | ||||||||
"metrics": ["*"] | ||||||||
} | ||||||||
] | ||||||||
labels: | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
This label is missing, while being present in the selector:
|
||||||||
app.kubernetes.io/component: opentelemetry-collector | ||||||||
spec: | ||||||||
serviceAccountName: {{ $.Values.baseName }}-opentelemetry-collector | ||||||||
containers: | ||||||||
- image: {{ $.Values.otelcol.image.name }}:{{ $.Values.otelcol.image.tag }} | ||||||||
ports: | ||||||||
- containerPort: {{ .Values.otelcol.port }} | ||||||||
protocol: TCP | ||||||||
- containerPort: {{ .Values.otelcol.prometheusPort }} | ||||||||
protocol: TCP | ||||||||
imagePullPolicy: IfNotPresent | ||||||||
name: otelcol | ||||||||
resources: | ||||||||
{{ $.Values.otelcol.resources | toYaml | nindent 12 }} | ||||||||
volumeMounts: | ||||||||
- name: config | ||||||||
mountPath: /etc/otel | ||||||||
volumes: | ||||||||
- name: config | ||||||||
configMap: | ||||||||
name: {{ $.Values.baseName }}-opentelemetry-collector | ||||||||
restartPolicy: Always | ||||||||
terminationGracePeriodSeconds: 30 | ||||||||
--- | ||||||||
apiVersion: v1 | ||||||||
kind: ServiceAccount | ||||||||
metadata: | ||||||||
name: {{ $.Values.baseName }}-opentelemetry-collector | ||||||||
namespace: {{ $.Release.Namespace }} | ||||||||
annotations: | ||||||||
helm.sh/resource-policy: keep | ||||||||
labels: | ||||||||
app: {{ $.Values.baseName }}-opentelemetry-collector | ||||||||
service: buildbarn | ||||||||
--- | ||||||||
apiVersion: v1 | ||||||||
kind: Service | ||||||||
metadata: | ||||||||
name: {{ $.Values.baseName }}-opentelemetry-collector | ||||||||
namespace: {{ $.Release.Namespace }} | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For consistency, shouldn't this be:
Suggested change
|
||||||||
spec: | ||||||||
ports: | ||||||||
- port: {{ $.Values.otelcol.port }} | ||||||||
protocol: TCP | ||||||||
selector: | ||||||||
app.kubernetes.io/component: opentelemetry-collector | ||||||||
--- | ||||||||
apiVersion: v1 | ||||||||
kind: Service | ||||||||
metadata: | ||||||||
name: {{ $.Values.baseName }}-opentelemetry-collector-headless | ||||||||
namespace: {{ $.Values.namespace }} | ||||||||
labels: | ||||||||
app.kubernetes.io/component: opentelemetry-collector | ||||||||
spec: | ||||||||
type: ClusterIP | ||||||||
clusterIP: None | ||||||||
ports: | ||||||||
- name: grpc | ||||||||
port: {{ .Values.otelcol.port }} | ||||||||
protocol: TCP | ||||||||
- name: http | ||||||||
port: {{ .Values.otelcol.prometheusPort }} | ||||||||
protocol: TCP | ||||||||
selector: | ||||||||
app.kubernetes.io/name: {{ $.Values.baseName }} | ||||||||
app.kubernetes.io/component: opentelemetry-collector |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
--- | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: {{ $.Values.baseName }}-opentelemetry-collector | ||
namespace: {{ $.Values.namespace }} | ||
labels: | ||
app: {{ $.Release.Name }}-opentelemetry-collector | ||
service: buildbarn | ||
data: | ||
config.yaml: | | ||
receivers: | ||
otlp: | ||
protocols: | ||
grpc: | ||
processors: | ||
batch: | ||
timeout: 10s | ||
k8sattributes: | ||
passthrough: true | ||
exporters: | ||
datadog: | ||
env: {{ $.Values.otelcol.env }} | ||
service: {{ $.Values.namespace }} | ||
tags: | ||
- kube_namespace:{{ $.Values.namespace }} | ||
- deployment:{{ $.Values.baseName }} | ||
api: | ||
key: example-key | ||
site: datadoghq.com | ||
service: | ||
pipelines: | ||
traces: | ||
receivers: [otlp] | ||
processors: [batch, k8sattributes] | ||
exporters: [datadog] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be invalid?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. huh, interesting. I didn't even modify the collector config, I just copy and pasted it directly from my deployment, which is definitely working. |
||
telemetry: | ||
logs: | ||
level: "debug" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: {{ $.Values.baseName }}-common | ||
namespace: {{ $.Values.namespace }} | ||
data: | ||
common.libsonnet: | | ||
{ | ||
blobstore: { | ||
contentAddressableStorage: { | ||
sharding: { | ||
hashInitialization: 1, | ||
shards: [ | ||
{{- $replicas := until (int .Values.storage.replicas) -}} | ||
{{- $bbName := .Values.baseName -}} | ||
{{- $port := .Values.storage.port -}} | ||
{{- $ns := .Values.namespace -}} | ||
{{- $dnsName := .Values.serviceDnsName -}} | ||
{{- range $_, $replicaNumber := $replicas }} | ||
{ | ||
backend: { | ||
grpc: { address: '{{ $bbName }}-storage-{{ $replicaNumber }}.{{ $bbName }}-storage-headless.{{ $ns }}.{{ $dnsName }}:{{ $port }}' }, | ||
}, | ||
weight: 1, | ||
}, | ||
{{- end}} | ||
], | ||
}, | ||
}, | ||
actionCache: { | ||
completenessChecking: { | ||
sharding: { | ||
hashInitialization: 1, | ||
shards: [ | ||
{{- $replicas := until (int .Values.storage.replicas) -}} | ||
{{- $bbName := .Values.baseName -}} | ||
{{- $port := .Values.storage.port -}} | ||
{{- $ns := .Values.namespace -}} | ||
{{- range $_, $replicaNumber := $replicas }} | ||
{ | ||
backend: { | ||
grpc: { address: '{{ $bbName }}-storage-{{ $replicaNumber }}.{{ $bbName }}-storage-headless.{{ $ns }}.{{ $dnsName }}:{{ $port }}' }, | ||
}, | ||
weight: 1, | ||
}, | ||
{{- end}} | ||
], | ||
}, | ||
}, | ||
}, | ||
}, | ||
maximumMessageSizeBytes: 16 * 1024 * 1024, | ||
openTelemetryBackend: { | ||
batchSpanProcessor: {}, | ||
otlpSpanExporter: { | ||
address: '{{ $.Values.baseName }}-opentelemetry-collector-headless.{{ $.Release.Namespace }}.{{ $.Values.serviceDnsName }}:{{ $.Values.otelcol.port }}' | ||
}, | ||
}, | ||
traceSampler: { | ||
always: {}, | ||
}, | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: {{ $.Values.baseName }}-frontend | ||
namespace: {{ $.Values.namespace }} | ||
data: | ||
frontend.jsonnet: | | ||
local common = import 'common.libsonnet'; | ||
{ | ||
blobstore: common.blobstore, | ||
global: { | ||
tracing: { | ||
sampler: common.traceSampler, | ||
resourceAttributes: { | ||
"service.name": {string: 'buildbarn-frontend'}, | ||
"service.namespace": {string: '{{ $.Values.namespace }}'}, | ||
}, | ||
backends: [ | ||
common.openTelemetryBackend, | ||
], | ||
}, | ||
diagnosticsHttpServer: { | ||
listenAddress: ':{{ .Values.frontend.prometheusPort }}', | ||
enablePrometheus: true, | ||
enablePprof: true, | ||
}, | ||
}, | ||
grpcServers: [{ | ||
listenAddresses: [':{{ .Values.frontend.port }}'], | ||
authenticationPolicy: { allow: {} }, | ||
}], | ||
actionCacheAuthorizers: { | ||
get: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
put: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
}, | ||
contentAddressableStorageAuthorizers: { | ||
get: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
put: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
findMissing: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
}, | ||
executeAuthorizer:{ instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
maximumMessageSizeBytes: common.maximumMessageSizeBytes, | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: {{ $.Values.baseName }}-storage | ||
namespace: {{ $.Values.namespace }} | ||
data: | ||
storage.jsonnet: | | ||
local common = import 'common.libsonnet'; | ||
{ | ||
global: { | ||
tracing: { | ||
sampler: common.traceSampler, | ||
resourceAttributes: { | ||
"service.name": {string: 'buildbarn-storage'}, | ||
"service.namespace": {string: 'buildbarn'}, | ||
}, | ||
backends: [ | ||
common.openTelemetryBackend, | ||
], | ||
}, | ||
diagnosticsHttpServer: { | ||
listenAddress: ':{{ .Values.storage.prometheusPort }}', | ||
enablePrometheus: true, | ||
enablePprof: true, | ||
}, | ||
}, | ||
blobstore: { | ||
actionCache: { | ||
"local": { | ||
persistent: { | ||
stateDirectoryPath: "/persist/ac", | ||
minimumEpochInterval: "300s", | ||
}, | ||
keyLocationMapOnBlockDevice: { | ||
file: { | ||
path: "/ac-0/ac.keys", | ||
size_bytes: 1*1024*1024*1024, | ||
}, | ||
}, | ||
blocksOnBlockDevice: { | ||
source: { | ||
file: { | ||
path: "/ac-0/ac.blocks", | ||
size_bytes: 246*1024*1024*1024, | ||
}, | ||
}, | ||
spareBlocks: 3, | ||
}, | ||
keyLocationMapMaximumGetAttempts: 8, | ||
keyLocationMapMaximumPutAttempts: 32, | ||
oldBlocks: 8, | ||
currentBlocks: 24, | ||
newBlocks: 1, | ||
}, | ||
}, | ||
contentAddressableStorage: { | ||
"local": { | ||
persistent: { | ||
stateDirectoryPath: "/persist/cas", | ||
minimumEpochInterval: "300s", | ||
}, | ||
keyLocationMapOnBlockDevice: { | ||
file: { | ||
path: "/cas-0/cas.keys", | ||
size_bytes: 1*1024*1024*1024, | ||
}, | ||
}, | ||
blocksOnBlockDevice: { | ||
source: { | ||
file: { | ||
path: "/cas-0/cas.blocks", | ||
size_bytes: 246*1024*1024*1024, | ||
}, | ||
}, | ||
spareBlocks: 3, | ||
}, | ||
keyLocationMapMaximumGetAttempts: 8, | ||
keyLocationMapMaximumPutAttempts: 32, | ||
oldBlocks: 8, | ||
currentBlocks: 24, | ||
newBlocks: 3, | ||
}, | ||
}, | ||
}, | ||
grpcServers: [{ | ||
listenAddresses: [':{{ .Values.storage.port }}'], | ||
authenticationPolicy: { allow: {} }, | ||
}], | ||
actionCacheAuthorizers: { | ||
get: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
put: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
}, | ||
contentAddressableStorageAuthorizers: { | ||
get: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
put: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
findMissing: { instanceNamePrefix: {allowedInstanceNamePrefixes: {{ mustToJson .Values.allInstanceNames }} }}, | ||
}, | ||
executeAuthorizer: { deny: {}}, | ||
maximumMessageSizeBytes: 16 * 1024 * 1024, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document real quick how to set this up? See for instance how the existing Kubernetes deployment is documented.