Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DogStatsd.Configure does not catch SocketException "No such host is known" when the agent host cannot be found in DNS #138

Open
tylerohlsen opened this issue Oct 1, 2020 · 7 comments

Comments

@tylerohlsen
Copy link

I call DogStatsd.Configure once on application start. I ran into an issue where I had configured the agent wrong and so the agent pod in my cluster was not starting and the DNS entry was not yet added for the agent. This cascaded to where my application was failing to start because it would crash because the DNS lookup was failing and the exception was bubbling up the stack and was unhandled.

Now, I could write my own logic to catch this and retry on a background thread and queue the pending metrics. But I think the internals of this library already do that for transient network issues after it connects once, so it feels most appropriate for this library to also handle the DNS lookup failure.

Here's the stack trace:

 ---> System.Net.Sockets.SocketException (11001): No such host is known.
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
   at System.Net.Dns.EndGetHostEntry(IAsyncResult asyncResult)
   at System.Net.Dns.<>c.<GetHostEntryAsync>b__27_1(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at System.Threading.Tasks.Task`1.get_Result()
   at StatsdClient.StatsdUDP.GetIpv4Address(String name)
   at StatsdClient.StatsdBuilder.CreateUDPStatsSender(StatsdConfig config, String statsdServerName)
   at StatsdClient.StatsdBuilder.CreateStatsSender(StatsdConfig config, String statsdServerName)
   at StatsdClient.StatsdBuilder.BuildStatsData(StatsdConfig config)
   at StatsdClient.DogStatsdService.Configure(StatsdConfig config)
   at StatsdClient.DogStatsd.Configure(StatsdConfig config)
@ogaca-dd
Copy link
Contributor

ogaca-dd commented Nov 4, 2020

Hello @tylerohlsen,

Thank you for reporting this feature request. I have created a card in our backlog.
What kind of DNS failure do you have? Is it brief random DNS failures or the DNS can be unavailable for a few minutes or more?

@tylerohlsen
Copy link
Author

Hi @ogaca-dd,

Thanks for adding the card to your backlog! I've run into two causes of DNS failures that have cause this.

In the first case, I misconfigured the address to the Datadog agent. This caused an immediate and irrecoverable error where our services in our Kubernetes cluster could not start and were in a continuous restart loop. I would have liked the services to start and an error log created so I could fix the issue on my own time. In this case, the issue would not have been able to self-recover and therefore the issue would be present for a long period of time.

In the second case, we had a version of the Kubernetes CNI that had a bug and intermittently was causing pods to start up without any network access. This would prevent cron jobs and sidecar containers from starting up that would otherwise not need network access. In this case, the issue was only a few minutes at a time.

Tyler

@pblachut
Copy link

pblachut commented Jan 8, 2021

Is there any update regarding this issue? I'm using version 6.0.0 and still observe it.

DD agent availability or any other issue with metrics should not cause that the whole service is crashed.

@yoliva
Copy link

yoliva commented Oct 18, 2022

any update on this one? we are still experiencing the same issue. I had to create a wrapper around the service to prevent the whole system to crash because some misconfiguration or if DataDog is down

@ogaca-dd
Copy link
Contributor

Hello @yoliva , @pblachut

I have opened this PR to avoid Configure to throw an exception. It will be part of the next release.

@yoliva
Copy link

yoliva commented Oct 18, 2022

thanks for your reply @ogaca-dd.

I have one more question for you related to this topic. if for some reason the agent is down, the calls to .Gauge(), .Increment(), .Histogram(), etc will fail or the library has any sort of retry or ignore policy for scenarios like this one.
Thanks!

@ogaca-dd
Copy link
Contributor

@yoliva ,

This is a good question. In case of a DNS failure, any errors during Configure are fatal errors and all metrics will be ignored.
If the DNS resolution succeeds but the Agent is down and recovers 5 minutes later, then the new metrics will be sent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants