Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heartbeat/monitoring dashboard for inference system #88

Open
micya opened this issue Aug 12, 2022 · 12 comments
Open

Add heartbeat/monitoring dashboard for inference system #88

micya opened this issue Aug 12, 2022 · 12 comments
Assignees
Labels
2023-hackathon Goals or topics for the 2023 annual Microsoft hackathon Azure Issues relating to Azure infrastructure or deployment to Azure inference system Code to perform inference with the trained model(s) notification system Issues relating to the notification system UX-design Needs UX/design team attention

Comments

@micya
Copy link
Member

micya commented Aug 12, 2022

Historically, troubleshooting for inference system/notification system failures involved manual steps to identify failures. Past hackathon focused on utilizing Azure Dashboards to surface some metrics from Log Analytics. However, Azure Dashboards is difficult for non-technical observers to use.

I'd like to look into setting up something separate from Azure for monitoring purposes. It can either be a self-developed application or an existing monitoring solution (prometheus?). It should show at minimum:

  • Heartbeats from inference system instances
  • Line chart for Cosmos DB read/write metrics
  • Line chart for Azure function executions
  • Line chart for SendGrid emails sent
@micya micya added notification system Issues relating to the notification system Azure Issues relating to Azure infrastructure or deployment to Azure inference system Code to perform inference with the trained model(s) labels Aug 12, 2022
@micya
Copy link
Member Author

micya commented Sep 21, 2022

Since we need to ultimately monitor across a range of different platforms, we will need a push-based system (as opposed to pull/scraped system like raw Prometheus).

@scottveirs
Copy link
Member

Hey @micya, noticed the Canadian Integrated Ocean Observing System is has an uptime monitor that is based on open source code https://github.com/upptime/upptime. It might not be able to help with the instances, but could help ensure we know when any of these sites are not available:

@scottveirs scottveirs assigned scottveirs and micya and unassigned scottveirs Sep 28, 2022
@scottveirs
Copy link
Member

Hey @micya -- Just noting a couple recent thoughts on possible tools, integrations, and/or data sources for an over-arching dashboard (i.e. maybe for not only the Azure-based realtime inference system, but the whole emerging ecosystem of Orcasound apps, APIs, and data layers):

  • Within each hydrophone location's data acquisition and streaming computer we have been using Dataplicity for remote monitoring and access. We have been thinking about tranisitioning to Balena.io...
  • Within the orcanode code we used to use logDNA, now Mezmo for monitoring processes and errors on each streaming computer.
  • In 2022, a volunteer used Orcasound's Google Analtyics data to create this UX Dashboard within the orcasound.net Wordpress site.
  • As we move raw Orcasound data into sponsored S3 buckets this year, I think we may be getting some more advanced AWS analytics on the raw data buckets than we've had thus far via the Quilt.com platform, e.g. this Quilt view of the streaming bucket. Of possible utility for the ML team/folks, this will include the acoustic-sandbox bucket where we will store labeled data, so maybe a dashboard feature could be quantifying the current, growing size of our labeled data sets (e.g. 13,435 SRKW call labels, with 13% validated to call type)?

@scottveirs
Copy link
Member

  • Line chart for Cosmos DB read/write metrics

A sub-feature of a CosmoDB read line chart that I would find interesting:

Number of API requests from "outsiders" -- a possible metric for measuring the value of our open labeled to external collaborators, e.g. ML developers or bioacousticians.

@scottveirs scottveirs added the 2023-hackathon Goals or topics for the 2023 annual Microsoft hackathon label Mar 2, 2023
@Rachel-Frazier Rachel-Frazier self-assigned this Sep 12, 2023
@xilin22 xilin22 self-assigned this Sep 12, 2023
@Rachel-Frazier
Copy link
Collaborator

We (@xilin22 and I) looked into setting up Prometheus and Grafana for a health dashboard, but determined Grafana doesn't allow individuals with personal accounts to access the Grafana dashboard without having a work or school account. (See following error:)
image
There's a feedback request for this feature, but it doesn't seem as though the Grafana team is looking to implement this any time soon.

We are now looking into using Azure Workbooks for data visualization instead, which is newer and may solve some of the pain points that were called out in 2022.

@xilin22
Copy link
Collaborator

xilin22 commented Sep 12, 2023

As for the alerting, we can add more azure functions to monitor service and resource health. Since Azure Managed Grafana does not allow personal accounts to login into Azure Managed Grafana instance

@xilin22
Copy link
Collaborator

xilin22 commented Sep 12, 2023

@micya @scottveirs We may be able to get Azure Managed Grafana to work if we create our own organizational domain. It might be worth a shot if there is little to no cost in creating one. Maybe then Azure won't view it as personal account.
image

@micya
Copy link
Member Author

micya commented Sep 12, 2023

@micya @scottveirs We may be able to get Azure Managed Grafana to work if we create our own organizational domain. It might be worth a shot if there is little to no cost in creating one. Maybe then Azure won't view it as personal account. image

We already have an organization. If you create a user in our AAD tenant, that should work. Though we would then need to track the username/password for the new user.

@xilin22
Copy link
Collaborator

xilin22 commented Sep 13, 2023

That makes sense. I dont have permissions to create one. Maybe either you @micya and @scottveirs can create one and send me the credentials?
image

@micya
Copy link
Member Author

micya commented Sep 13, 2023

@xilin22 - granted "User Administrator" on AAD tenant. Let me know if that doesn't work.

@scottveirs
Copy link
Member

There's a few thoughts on computed latency KPIs that could be valuable in a high-level heartbeat dashboard here -- #157

@scottveirs scottveirs added the UX-design Needs UX/design team attention label Sep 5, 2024
@micya
Copy link
Member Author

micya commented Sep 17, 2024

Additional thoughts in orcasound/orcanode-monitor#19 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023-hackathon Goals or topics for the 2023 annual Microsoft hackathon Azure Issues relating to Azure infrastructure or deployment to Azure inference system Code to perform inference with the trained model(s) notification system Issues relating to the notification system UX-design Needs UX/design team attention
Projects
Development

No branches or pull requests

4 participants