Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move closer to production configuration #342

Open
drewwells opened this issue May 1, 2024 · 8 comments
Open

Move closer to production configuration #342

drewwells opened this issue May 1, 2024 · 8 comments

Comments

@drewwells
Copy link
Contributor

I'm getting stumped by settings I'd expect to be on by default but are not. I see some references to documentation on production ready configuration, but honestly this is pretty limited. Can we start moving to defaults that are production ready? A few things that come to mind that should be on by default but are not.

  • Get rid of recommendations.{}.enabled, these can just be the default. If recommendations need to be off for some reason, turn them off in a unit test override with kind.yaml or dev.yaml
  • Telemetry, turn these on! No reason for them to ever be off
  • Maybe turn off all TTL and point people to documentation to give them recommendation for these settings. We have moved through several iterations of these values, caused outages in production and development trying to find the right values. A document explaining this near the values.yaml would be terrific: upstreamCA TTL > spire CA TTL (perhaps a new name is warranted here) > federation bundle_refresh TTL

Federation

    • caTTL is set to 24hrs but refresh bundle is 365 days (still a bug since chart 0.14.0). This is a bad default, caTTL should be 3-5x what refresh_bundle is set to per the spec
  • federation.ingress.enabled. It's pretty pointless to turn on federation but ingress is disabled. I think a better default is enabled: true. Have the ingress only render if if (AND federation.enabled federation.ingress.enabled). If you turn on federation it also turns on ingress

HA

  • I can't put to words how much pain this statefulset has caused me. Please move to deployment as the default and only kind available and abandon the statefulset idea. If you really want a local database subchart postgres and contact it over a local port. I have to field questions daily about how much downtime our production spire deployments will have and it's entirely due to this statefulset.
@kfox1111
Copy link
Collaborator

kfox1111 commented May 1, 2024

I'm getting stumped by settings I'd expect to be on by default but are not. I see some references to documentation on production ready configuration, but honestly this is pretty limited. Can we start moving to defaults that are production ready? A few things that come to mind that should be on by default but are not.

good defaults often take time to find. Happy to discuss.

  • Get rid of recommendations.{}.enabled, these can just be the default. If recommendations need to be off for some reason, turn them off in a unit test override with kind.yaml or dev.yaml

This was discussed multiple times in the syncs. The general consensus is to make it as easy as possible to kick the tires. For production, you must specify some settings to get it to work at all, so specifying one extra "recommendations.enabled=true" to get all the production defaults was deemed a reasonable usability tradeoff to make it easy for new folks to do simple evaluations of spire.

  • Telemetry, turn these on! No reason for them to ever be off

It is turned on by default when using recommendations. When just kicking the tires, it takes up some more resources that may be fairly constrained in environments like minikube.

  • Maybe turn off all TTL and point people to documentation to give them recommendation for these settings. We have moved through several iterations of these values, caused outages in production and development trying to find the right values. A document explaining this near the values.yaml would be terrific: upstreamCA TTL > spire CA TTL (perhaps a new name is warranted here) > federation bundle_refresh TTL

Without specifying a TTL, there will always be a default TTL. So we can't just turn it off.

Would be happy to hear what values ended up working for you. We can then document them and/or set better defaults. I'm thinking you probably have a lot more experience with finding the proper values then we do at the moment. Also, any ideas where we could put in some sanity checks too would help. like, a configure fail if ustreamCA TTL < spire CA TTL sounds like a good check to have.

Federation

    • caTTL is set to 24hrs but refresh bundle is 365 days (still a bug since chart 0.14.0). This is a bad default, caTTL should be 3-5x what refresh_bundle is set to per the spec

How is that set? Thinking there should be an option here but not finding one other then refresh_hint, which looks to be a default of 5 min.
https://spiffe.io/docs/latest/deploying/spire_server/#configuration-options-for-federationbundle_endpoint

  • federation.ingress.enabled. It's pretty pointless to turn on federation but ingress is disabled. I think a better default is enabled: true. Have the ingress only render if if (AND federation.enabled federation.ingress.enabled). If you turn on federation it also turns on ingress

There's multiple ways of exposing services. Ingress is just one of them. With the gateway api finally hitting 1.0 and multiple implementations finally coming about, it wont be long before this chart gets gateway api support too, and enabling ingress by default may not be the right choice then.

HA

  • I can't put to words how much pain this statefulset has caused me. Please move to deployment as the default and only kind available and abandon the statefulset idea. If you really want a local database subchart postgres and contact it over a local port. I have to field questions daily about how much downtime our production spire deployments will have and it's entirely due to this statefulset.

Sorry it has hurt. In the future, it may be a better default. At the moment, the feature is experimental so shouldn't be a default. As it stabalizes, I do think we should re-evaluate making it a new default.

Dropping support for statefulset entirely is a show stopper. sqlite support is needed in some cases for some folks.

I'd be in favor of subcharting postgresql as an option. I had a pr implementation of that before and it got blocked.

I'm curious about the concern about downtimes. From what I gather of the design of spire, it should be functional with short downtimes. All the tools should block the calls until spire comes back. There are caches involved too so not many calls need to happen. Is there a functional gap here somewhere that I missed?

Please do kick the tires with deployments and let us know any issues you have with them. The more feedback we receive on its usability, the quicker we all will become marking the deployment settings as non experimental.

@drewwells
Copy link
Contributor Author

The statefulset can't be upgraded, you have to recreate it. Try installing 0.11.0 then upgrading to 0.14.0. I don't see how this is considered a stable solution, upgrades are not supported.

Can you explain how sqlite is a show stopper? Can you linked the postgres PR?

This chart doesn't support gateway API, so I don't see that as an argument against enabling ingress by default when federation is turned on. The future may not be helm, that's not a reason to delete this repo.

@kfox1111
Copy link
Collaborator

kfox1111 commented May 1, 2024

Upgrading the statefulset does work. There was one change a while ago where we had to recreate it, and had an upgrade job in that version to help, along with an upgrade note. Its not specific to statefulsets. Your trying to skip versions which is unsupported, and will likely run into issues like this as its not tested.

Its a show stopper for some users. I have deployments I don't want to ever switch to postgresql/mysql. For the pr, its long and complicated. but you can see some of it here: https://github.com/spiffe/helm-charts/pull/233/files#diff-af518a1128fd5263dc90f01e610b5dc85d618789b0f74576755a27d4bfafe147

gateway api support in the chart is on my todo list in the near future.

@drewwells
Copy link
Contributor Author

These are the jobs that are made to help me? It's blocking the upgrade again. These incremental upgrade jobs are doing more harm than good. If you want to make perfection during upgrades, then I would suggest adding some guards to check they're actually needed. Trying to upgrade from 0.14.0 to 0.20.0

❯ k -n spire-server logs -f spire-server-pre-upgrade-rdwrk                                         
I0502 16:10:33.784196       1 request.go:690] Waited for 1.016280636s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/apis/autoscaling.aws.crossplane.io/v1beta1?timeout=32s
Error from server (NotFound): validatingwebhookconfigurations.admissionregistration.k8s.io "spire-server-spire-controller-manager-webhook" not found

@faisal-memon
Copy link
Collaborator

@drewwells we don't support major version skipping. As we are current in the Development phase alot is changing with each release. I know its a drag but you have to step through all the intermediate versions.

@drewwells
Copy link
Contributor Author

0.14.0 -> 0.20.0 is a minor version change. Can we start using major versions? 0.15.0 is a major version change, it renamed objects then did a short term renaming fix in the pre-upgrade job that was later removed.

@kfox1111
Copy link
Collaborator

kfox1111 commented May 2, 2024

pre 1.0's are always a bit difficult/different versioning wise. until 1.0.0, 0.14.x -> 0.15.x is a major change.

https://spiffe.io/docs/latest/spire-helm-charts-hardened-about/upgrading/

@kfox1111
Copy link
Collaborator

kfox1111 commented May 2, 2024

I guess its not as clear here: https://artifacthub.io/packages/helm/spiffe/spire#upgrade-notes

how should we clarify that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants