Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog application monitoring #100

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Charl1996
Copy link
Contributor

@Charl1996 Charl1996 commented Jan 14, 2025

Ticket

This PR

  1. adds datadog to the codebase
  2. Sends a timed metric to datadog

The way it works is, when you're invoking statd from datadog it send the metric to the datadog agent running on the machine (assuming it's running), which in turn handles the authentication etc. Since the datadog agent is already present on the staging and production machines this should just work, but I'll update here what my findings are.

I've considered simply adapting HQ's implementation of the metrics, but decided against that because we really don't need all that code at this stage and can get away with a clearly much simpler approach.

The corresponding CCA Ansible change can be found here.

@@ -191,3 +191,5 @@ class CeleryConfig:

USER_DOMAIN_ROLE_EXPIRY = 60 # minutes
SKIP_DATASET_CHANGE_FOR_DOMAINS = []

SERVER_ENVIRONMENT = 'changeme' # staging, production, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this would be set up by ansible ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@ajeety4 ajeety4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -34,6 +35,10 @@ class DataSetChange:
data: list[dict[str, Any]]

def update_dataset(self):
with statsd.timed('cca.dataset_change.timer', tags=[self.data_source_id]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and 💰 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently busy figuring that out, since this would potentially result in massive volumes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chat to Graham and he mentioned that the volumes should not be a problem, since DataDog is good at handling that. All metrics we gather here is sent to the datadog agent on the server which talks to our datadog account once every (I think) 15 seconds.

All in all I think we're good here.

@@ -35,7 +35,8 @@ class DataSetChange:
data: list[dict[str, Any]]

def update_dataset(self):
with statsd.timed('cca.dataset_change.timer', tags=[SERVER_ENVIRONMENT, self.data_source_id]):
env_tag = f"env:{SERVER_ENVIRONMENT}"
with statsd.timed('cca.dataset_change.timer', tags=[env_tag, f"datasource:{self.data_source_id}"]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add domain tag ? I guess that would be useful?

Will datasource show on the metric itself and env would be on the top in the dropdown? and then if we add domain it would also be on the top of the dashboard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the domain could be useful I suppose.

Will datasource show on the metric itself and env would be on the top in the dropdown?

What do you mean here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how tags are used/available on datadog once they are added here in metrics.
For now, I have seen "env" or "domain" being available on top of a datadog dashboard (like here) and then additionally there are metrics available for different items within a graph (like here)
So, I was asking if the domain and data source ids will be available like so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I created a chart here to demonstrate that we can, yes.

(I'm not 100% sure if the timeseries is what we want long term, maybe the Top List chart works better?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it, thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeseries seems okay from the little I know, it does give me the data I think I want.

@mkangia
Copy link
Contributor

mkangia commented Jan 15, 2025

Just love it how straight forward this was and +1 on keeping things simple and to what we need.

@mkangia
Copy link
Contributor

mkangia commented Jan 15, 2025

Will wait for your findings for dropping an approval.

@Charl1996
Copy link
Contributor Author

Charl1996 commented Jan 17, 2025

I just want to touch up one or two things before I'll open this for review later today.

@Charl1996 Charl1996 marked this pull request as ready for review January 17, 2025 10:49
@Charl1996
Copy link
Contributor Author

Charl1996 commented Jan 20, 2025

Datadog has been tested on staging and the metrics pulls through (see cca.dataset_change.timer in metrics explorer).

Regarding costs:
There's not a high cardinality with this particular metric (currently 33 DataSource repeaters on production, including test ones, which would spawn 33 custom metrics for the production env). This puts us well below the "$5-per-100-custom-metrics" margin.

I've also spoken to Graham and his sentiment is that we don't have to worry too much about the volume of this metric (ie. the amount of potential data points per custom metric), because the datadog agent handles that well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants