Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the infra cost for Knative #3104

Closed
13 of 14 tasks
chizhg opened this issue Feb 17, 2022 · 3 comments
Closed
13 of 14 tasks

Reduce the infra cost for Knative #3104

chizhg opened this issue Feb 17, 2022 · 3 comments
Assignees

Comments

@chizhg
Copy link
Member

chizhg commented Feb 17, 2022

Currently the infra cost for running Prow, E2E tests, release pipelines is around $50K per month. There have already been some efforts to reduce the cost. As part of the effort for CNCF infra migration, we want to further reduce the cost as much as possible before fully handing off the infra to the Knative community.

The current infra cost breakdown can be checked from #3003

Below are some tracking issues that could help:

/kind cncf-infra

@mattmoor
Copy link
Member

From the meeting: we should be very very very careful about any sort of image cleanup.

There was cleanup logic added to Knative very early on which ended up doing an rm -rf on our public release public and burned a day or so of my and others time getting the GCR/GCS teams to restore it from backups.

If release images aren't a significant cost, then please just don't touch them.

If they are, the cost is almost certainly egress, and whatever solution we implement should shadow what Kubernetes is doing to solve this (still WIP), which does NOT involve deleting things, but mirroring them to avoid egress charges.


On GCS object lifecycle management, I said "don't turn it on", but this was really aimed at using it to GC old blobs.

The more nuanced answer is: OLM should absolutely be turned on, but the feature around retaining older generations of objects so that we don't have to page the GCS oncall to restore things from backup. The features that allow object TTLs should absolutely never be used on a GCR bucket.

Turning on OLM to avoid paging GCS was one of the requests from the GCS team when we had to restore things from backup (above).

@chizhg chizhg moved this to In Progress in Infra (Productivity) Feb 24, 2022
@krsna-m krsna-m moved this from In Progress to Ready To Work in Infra (Productivity) Feb 24, 2022
@chizhg
Copy link
Member Author

chizhg commented Mar 2, 2022

/cc @upodroid @kvmware

This issue is tracking all the work we are doing / planning to do to reduce the infra cost. If you have other ideas, please create a new issue and link it here. If you have concerns on the existing issues, please leave comments under the corresponding issue. Thanks!

@chizhg chizhg self-assigned this Mar 2, 2022
@chizhg chizhg moved this from Ready To Work to In Progress in Infra (Productivity) Mar 3, 2022
@krsna-m krsna-m moved this from In Progress to Done in Infra (Productivity) Apr 7, 2022
@chizhg
Copy link
Member Author

chizhg commented May 2, 2022

As of now, we have met our goal to reduce the infra cost to below $250k/year (the cost for 2022/04 was ~13k so the year round will be ~$156k). The remaining tasks initially added to this issue are good-to-have and are already tracked in separate issues. Closing this issue to mark it as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants