Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OTel] Additional components for the Elastic Distro #5092

Closed
5 tasks
ChrsMark opened this issue Jul 9, 2024 · 16 comments · Fixed by #5242
Closed
5 tasks

[OTel] Additional components for the Elastic Distro #5092

ChrsMark opened this issue Jul 9, 2024 · 16 comments · Fixed by #5242
Assignees
Labels
enhancement New feature or request

Comments

@ChrsMark
Copy link
Member

ChrsMark commented Jul 9, 2024

Describe the enhancement:

K8s is one of the most used envs according to https://opentelemetry.io/blog/2024/otel-collector-survey/#deployment-scale-and-environment, hence it makes sense to consult what the community distro for k8s includes.
We can evaluate the list of the k8s-distro components and consider including them as well in the Elastic distro: https://github.com/open-telemetry/opentelemetry-collector-releases/blob/main/distributions/otelcol-k8s/manifest.yaml

/cc @elastic/otel-devs

Describe a specific use case for the enhancement or feature:

What is the definition of done?

Taken from #5092 (comment):

@ycombinator ycombinator added the enhancement New feature or request label Jul 9, 2024
@ycombinator
Copy link
Contributor

@strawgate @AlexanderWert What is the priority for this issue? Which release do we want to target it to?

@AlexanderWert
Copy link
Member

We can evaluate the list of the k8s-distro components and consider including them as well in the Elastic distro

Looking at the list of components it seems like there are still many components that sound less relevant for our distro (like spanmetricsconnector, servicegraphconnector, etc.). As you wrote, I think we need to evaluate and come up with our own list inspired by that list.

@strawgate @AlexanderWert What is the priority for this issue? Which release do we want to target it to?

I think 8.15.2 would be reasonable for that, since that's when we plan to have a broader k8s integration for OTel.

@ycombinator
Copy link
Contributor

Looking at the list of components it seems like there are still many components that sound less relevant for our distro (like spanmetricsconnector, servicegraphconnector, etc.). As you wrote, I think we need to evaluate and come up with our own list inspired by that list.

What would the criteria be for deciding whether to include a component from the list in our distro vs. not include it?

@ChrsMark
Copy link
Member Author

ChrsMark commented Jul 12, 2024

What would the criteria be for deciding whether to include a component from the list in our distro vs. not include it?

I would consider as required those that are defined by default in the upstream's Helm Charts: https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/values.yaml

With this users that already use the upstream charts could switch to the Elastic distro by just defining the image properly:

helm install my-opentelemetry-collector open-telemetry/opentelemetry-collector --set mode=daemonset --set image.repository="docker.elastic.co/beats/elastic-agent" --set image.tag="8.15.0-SNAPSHOT" --set command.name="/usr/share/elastic-agent/elastic-agent" --set command.extraArgs="{otel, -c, --config=etc/elastic-agent/otel.yaml}"

The above gives me the following in the failed Collector's Pod logs:

Starting in otel mode
1 error occurred:
	* failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):

error decoding 'receivers': unknown type: "prometheus" for id: "prometheus" (valid values: [otlp filelog kubeletstats k8s_cluster hostmetrics httpcheck])
error decoding 'processors': unknown type: "memory_limiter" for id: "memory_limiter" (valid values: [transform filter k8sattributes elasticinframetrics resourcedetection batch resource attributes])
error decoding 'extensions': unknown type: "health_check" for id: "health_check" (valid values: [file_storage memory_limiter])

A working PoC: https://github.com/elastic/opentelemetry-dev/pull/331

@rogercoll
Copy link
Contributor

rogercoll commented Jul 23, 2024

As defined in the collector best practices docs, the memory_limiter processor should also be included to efficiently manage the resource pipelines: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-around-resource-utilization

@ycombinator
Copy link
Contributor

ycombinator commented Jul 23, 2024

As defined in the collector best practices docs, the memory_limiter processor should also be included to efficiently manage the resource pipelines: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-around-resource-utilization

In #4689, we explicitly chose the memory_limiter extension over the processor because the extension doesn't drop events but instead puts backpressure on the receivers. Do you think we should revert that decision and go with the processor again? cc: @michalpristas

[UPDATE] Chatting with @cmacknz off-issue, there's no reason not to include both the extension and the processor, if we need the processor for our collector distro to work on k8s, so will keep the memory_limiter processor in the list of additional components.

@rogercoll
Copy link
Contributor

rogercoll commented Jul 24, 2024

@ycombinator I was checking the extension and from the docs it seems that in the future it should be used in favor of the processor:

The memory limiter extension is used to prevent out of memory situations on the collector. The extension will potentially replace the Memory Limiter Processor.

But the extension is currently nop:

The extension is under development and does nothing.

I suggested adding the processor because the upstream Helm Chart requires it. Opening an issue to remove the dependency in Helm open-telemetry/opentelemetry-helm-charts#1272

@michalpristas
Copy link
Contributor

let's also monitor size of binary, difference between core collector and full contrib collector is a bit over 300MB
@cmacknz do we find this concerning?

@cmacknz
Copy link
Member

cmacknz commented Jul 24, 2024

Including the collector has to make agent larger, and we need more functionality than what the core collector has, so this is really unavoidable. Be mindful of what components we include to make sure they have a purpose (e.g. we probably don't need everything in contrib) but there's not much else we can do in the short term.

@rogercoll
Copy link
Contributor

From the Helm deployment perspective, only the health_check extension is required in the values configuration. Other components (jaeger, zipkin, etc.) can be removed from the sample configuration by setting their value to null: https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector#basic-top-level-configuration

Meanwhile, the health_check requirement can also be skipped in the Helm deployment by using a custom configmap (demo example)

@ChrsMark
Copy link
Member Author

ChrsMark commented Aug 1, 2024

FYI a working example can be found at https://github.com/elastic/opentelemetry-dev/pull/331, which uses #5126.

@ycombinator
Copy link
Contributor

I'm trying to verify that the changes made in #5242 are sufficient to resolve this issue here.

If I try the helm install ... command mentioned by @ChrsMark in #5092 (comment) with the elastic-agent-8.16.0-SNAPSHOT image, I'm getting an error in the pod logs:

$ helm install my-opentelemetry-collector open-telemetry/opentelemetry-collector \
  --set mode=daemonset \
  --set image.repository="docker.elastic.co/beats/elastic-agent" \
  --set image.tag="8.16.0-SNAPSHOT" \
  --set command.name="/usr/share/elastic-agent/elastic-agent" \
  --set command.extraArgs="{otel, -c, --config=etc/elastic-agent/otel.yaml}"
$ kubectl logs -p my-opentelemetry-collector-agent-5nf44

Starting in otel mode
failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):

error decoding 'exporters': unknown type: "logging" for id: "logging" (valid values: [otlp debug file elasticsearch otlphttp])

@ChrsMark could you help me understand where the logging exporter in the error message is coming from? Also, are you sure the config path in the --set command.extraArgs should be etc/elastic-agent/otel.yaml and not /etc/elastic-agent/otel.yml?

@andrzej-stencel
Copy link
Contributor

andrzej-stencel commented Aug 26, 2024

@ycombinator I believe it's a combination of a couple issues. To fix this, do the following two things:

  1. Make sure you have an up-to-date OpenTelemetry Collector chart on your machine. You can check the current version with helm search repo and update with helm repo update open-telemetry.

Here's the output I had on my machine:

$ helm search repo
NAME                                   	CHART VERSION	APP VERSION	DESCRIPTION                                       
open-telemetry/opentelemetry-collector 	0.92.0       	0.101.0    	OpenTelemetry Collector Helm chart for Kubernetes 
open-telemetry/opentelemetry-demo      	0.30.6       	1.9.0      	opentelemetry demo helm chart                     
open-telemetry/opentelemetry-ebpf      	0.1.0        	v0.10.0    	OpenTelemetry eBPF Helm chart for Kubernetes      
open-telemetry/opentelemetry-kube-stack	0.0.5        	0.98.0     	OpenTelemetry Quickstart chart for Kubernetes. ...
open-telemetry/opentelemetry-operator  	0.60.0       	0.100.1    	OpenTelemetry Operator Helm chart for Kubernetes  
$ helm repo update open-telemetry                        
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "open-telemetry" chart repository
Update Complete. ⎈Happy Helming!⎈
$ helm search repo                                                                      
NAME                                   	CHART VERSION	APP VERSION	DESCRIPTION                                       
open-telemetry/opentelemetry-collector 	0.102.1      	0.107.0    	OpenTelemetry Collector Helm chart for Kubernetes 
open-telemetry/opentelemetry-demo      	0.32.6       	1.11.1     	opentelemetry demo helm chart                     
open-telemetry/opentelemetry-ebpf      	0.1.1        	v0.10.2    	OpenTelemetry eBPF Helm chart for Kubernetes      
open-telemetry/opentelemetry-kube-stack	0.2.0        	0.107.0    	OpenTelemetry Quickstart chart for Kubernetes. ...
open-telemetry/opentelemetry-operator  	0.68.1       	0.107.0    	OpenTelemetry Operator Helm chart for Kubernetes
  1. Use the corrected helm install command. Here's the command that worked for me to install the OpenTelemetry Collector Helm chart with the latest version of the Elastic Agent 8.16.0 snapshot image:
helm install elastic-otelcol open-telemetry/opentelemetry-collector --version=0.102.1 \
  --set mode=daemonset \
  --set image.repository="docker.elastic.co/beats/elastic-agent" \
  --set image.tag="8.16.0-SNAPSHOT" \
  --set image.pullPolicy=Always \
  --set command.name="/usr/share/elastic-agent/elastic-agent" \
  --set command.extraArgs="{otel, --config=otel.yml}"

I have introduced the following changes compared to the command from your comment:

a. Added the --version option to helm install command to make sure I'm using the version of the chart I think I'm using. This way this command will fail for a user that does not have the latest version of the chart downloaded.

b. Added --image.pullPolicy=Always to make sure I'm using the latest version of the docker.elastic.co/beats/elastic-agent:8.16.0-SNAPSHOT image. If I'm not mistaken, the tag 8.16.0-SNAPSHOT is not stable, i.e. it points to a different digest of the image over time. An alternative to setting the pullPolicy to Always is to specify the image digest instead of the tag with --set image.digest.

c. Corrected the path to the otel.yml config file bundled with the image.

@andrzej-stencel
Copy link
Contributor

One problem I'm seeing when running the chart with this configuration is that the Health Check extension is not used in the otel.yml file, which results in Kubernetes restarting the collector container and eventually resulting in a CrashLoopBackOff. We should add the health_check extension to service::extensions in the otel.yml file that's bundled with the image.

@ycombinator
Copy link
Contributor

Thanks for your guidance, @andrzej-stencel! Indeed, I had a (way) outdated version of the Helm chart!

$ helm search repo open-telemetry
open-telemetry/opentelemetry-collector	0.78.0       	0.92.0     	OpenTelemetry Collector Helm chart for Kubernetes
open-telemetry/opentelemetry-demo     	0.27.0       	1.7.0      	opentelemetry demo helm chart
open-telemetry/opentelemetry-ebpf     	0.1.0        	v0.10.0    	OpenTelemetry eBPF Helm chart for Kubernetes
open-telemetry/opentelemetry-operator 	0.45.2       	0.91.0     	OpenTelemetry Operator Helm chart for Kubernetes

And thanks for the fixes to the helm install command as well. ❤️

@ycombinator
Copy link
Contributor

One problem I'm seeing when running the chart with this configuration is that the Health Check extension is not used in the otel.yml file, which results in Kubernetes restarting the collector container and eventually resulting in a CrashLoopBackOff. We should add the health_check extension to service::extensions in the otel.yml file that's bundled with the image.

Thanks. Created #5369 to try and address this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants