Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

platform stability #1426

Closed
sanderegg opened this issue Apr 2, 2020 · 14 comments · Fixed by #1349, #1360, #1365, #1394 or #1407
Closed

platform stability #1426

sanderegg opened this issue Apr 2, 2020 · 14 comments · Fixed by #1349, #1360, #1365, #1394 or #1407
Assignees
Labels
bug buggy, it does not work as expected High Priority a totally crucial bug/feature to be fixed asap

Comments

@sanderegg
Copy link
Member

sanderegg commented Apr 2, 2020

reduce issues with the platform on master/staging/production. Includes cases referring:

  • deploys malfunctions (when something goes wrong in the deployed environs)
  • design upgrades of key component aiming greater stability
  • monitoring & diagnostics tooling (configuring or implementing tools to detect issues or help investigating them)
  • maintenance services: ops services to maintain running framework stable. E.g. cron jobs to clean hosts/databases/storage; auto-restarting unhealthy services; etc
  • testing services/tooling: to ensure stability during development e.g. new e2e, mock fixtures, ...
@sanderegg
Copy link
Member Author

sanderegg commented Apr 6, 2020

Update on Dim Sum sprint for review

Current status

  • improved logging from oSparc micro-services to report errors in async tasks, reduce noise
  • enhanced sidecar refactored with async libraries to overcome stability issues and responsiveness to rabbitMQ heartbeat
  • new monitoring and diagnostic modules that measure's service health (currently only in webserver service)
    • set-up diagnose tool in webserver to detect async exception, slow tasks and better analysis of memory consumption - link to prometheus/grafana
    • auto-restart webserver service when handling of tasks slows down too much
  • increased test coverage
    • added unit/integration testing of sidecar
    • new pytest-plugin reusable by any service aims to increase tests coverage
    • improved e2e testing (parallel / series streams)
  • new python docker image base: moved from alpine to debian-slim to gain stability and build speed (based on published study).

Ongoing development

  • add separate traefik.io instance to handle internal reverse-proxying (remove task from webserver) => reduces traffic load in webserver
  • increase unit/integration testing coverage (using the new pytest plugin)
  • fix code coverage metric
  • use new monitoring to explore the cause of slowdown (still not fully resolved)
  • test webserver scaling
  • extend python docker-image base to webserver

@pcrespov pcrespov modified the milestones: Dim Sum, Zhong Zi Apr 23, 2020
@sanderegg
Copy link
Member Author

sanderegg commented Apr 27, 2020

Update on mid-sprint Zhongzi

Current status

Separated internal reverse proxy from webserver microservice:

Ongoing development

@sanderegg
Copy link
Member Author

Update on sprint Chrigel Maurer

Current status

Ongoing development

Open points

  • new sidecar (for dynamic services, improve platform stability when 3rd party service is bad, security, access rights, simpler integration of services)
  • test webserver scaling
  • fix code coverage metric (unit testing)
  • use new monitoring to explore the cause of slowdown (still not fully resolved)

@KZzizzle KZzizzle modified the milestones: Zhong Zi, Huo Guo Jun 14, 2020
@sanderegg sanderegg removed this from the Huo Guo milestone Jun 15, 2020
@sanderegg
Copy link
Member Author

sanderegg commented Jul 3, 2020

Update on sprint Huo Guo

Current status

Ongoing development

Open points

  • test webserver scaling
  • use new monitoring to explore the cause of slowdown (still not fully resolved)
  • filter e2e testing from metrics Filter metrics from e2e testing #1561
  • metrics monitoring issue (data size to generate some metrics causes issue) #1599

@GitHK
Copy link
Contributor

GitHK commented Aug 17, 2020

Update on sprint Da Jia

Current status

Ongoing development

Open points

@pcrespov
Copy link
Member

pcrespov commented Sep 29, 2020

Update on sprint Nephelai

image

A snapshot of the stability cases follows:
image

The board above is accessible via zenhub (requires github account)

@sanderegg
Copy link
Member Author

sanderegg commented Nov 20, 2020

Update on sprint Wankel

Current status

Ongoing development

Open points

@sanderegg
Copy link
Member Author

Update on sprint Alfred Büchi

Current status

Ongoing development

Open points

@sanderegg
Copy link
Member Author

sanderegg commented Jan 29, 2021

Update on sprint Chronos

Done

Ongoing

Open points

@sanderegg
Copy link
Member Author

sanderegg commented Feb 25, 2021

Update on sprint The White Rabbit

Done

Ongoing

Open points

@sanderegg
Copy link
Member Author

sanderegg commented Mar 24, 2021

Update on sprint Red Panda

Done

Ongoing

Open points

@sanderegg
Copy link
Member Author

The list of changes for stability/maintenance tasks will be moved to osparc-issues#428

@KZzizzle
Copy link
Contributor

should we close this issue since everything is now under ITISFoundation/osparc-issues#428

@sanderegg
Copy link
Member Author

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment