Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP-267] Rotate Kubelet Server Certificate Retrospective KEP #4411

Conversation

kannon92
Copy link
Contributor

  • One-line PR description: Open a retrospective KEP to document existing design and mark what work will be needed to eventually GA this feature.
  • Other comments: Taking over @SergeyKanzhelev KEP as I will be leading this effort.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Jan 17, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 17, 2024
@kannon92
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 17, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kannon92 kannon92 force-pushed the sergei-rotate-kubelet-server-certificate branch from 8ae93de to 31e9394 Compare January 18, 2024 17:19
@kannon92
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 18, 2024
@kannon92
Copy link
Contributor Author

/cc @tallclair

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Here's some feedback. I think we need more detail, even if we already have shipped code.


## Alternatives

N/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why “N/A”? Maybe:

Suggested change
N/A
_None identified_

### Goals

- ability to configure kubelet to get a server certificate for the kubelet from
the Certificate Signing Request API instead of generating one self signed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
the Certificate Signing Request API instead of generating one self signed
the CertificateSigningRequest API instead of generating one self signed

expiration approaches.

## Motivation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Cluster operators want TLS between their nodes and the control plane. However, managing the
certificates at scale is problematic. Kubernetes provides APIs to define X.509 certificates, signing
requests for those certificates, and even trust anchors.
As a cluster operator, automating the rotation of kubelet X.509 certificates has the potential to replace
significant toil with an easy and secure ongoing process.
### Background
Kubernetes v1.7 introduced kubelet server certificate rotation as alpha.

Comment on lines +69 to +74
This is retrospective KEP description for the feature `RotateKubeletServerCertificate`.

The feature gate `RotateKubeletServerCertificate` allows to configure kubelet to
get a server certificate for the kubelet from the Certificate Signing Request
API instead of generating one self signed and auto rotates the certificate as
expiration approaches.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is retrospective KEP description for the feature `RotateKubeletServerCertificate`.
The feature gate `RotateKubeletServerCertificate` allows to configure kubelet to
get a server certificate for the kubelet from the Certificate Signing Request
API instead of generating one self signed and auto rotates the certificate as
expiration approaches.
This KEP proposes a mechanism to allow the kubelet to request and then use a certificate
using the Kubernetes CertificateSigningRequest (CSR) API. A kubelet that can authenticate to the
cluster's control plane can use that mechanism to support automatic certificate rotation
before the existing certificate expires. Certificates are only issued to nodes if their CSR is
explicitly approved.

I'd add a Background heading later in the KEP that explains why it's retrospective.


### Non-Goals

- built-in support for approving CSRs nodes request (not currently done because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- built-in support for approving CSRs nodes request (not currently done because
- built-in support for approving CSRs that nodes have requested (omitted because

Comment on lines +365 to +368
<!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section needs answering.


###### Will enabling / using this feature result in increasing size or count of the existing API objects?

No
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
No
Yes, CertificateSigningRequests.


### Troubleshooting

Failure to get certificate can be troubleshooted using logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need any more detail here?

Comment on lines +158 to +160
- "server_expiration_renew_errors"
- "certificate_manager_server_rotation_seconds"
- "certificate_manager_server_ttl_seconds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right?

Suggested change
- "server_expiration_renew_errors"
- "certificate_manager_server_rotation_seconds"
- "certificate_manager_server_ttl_seconds"
- `server_expiration_renew_errors` (counter)
- `certificate_manager_server_rotation_seconds` (histogram)
- `certificate_manager_server_ttl_seconds` (histogram)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add the help text (metric descriptions) here?

Comment on lines +240 to +242
Discussing this feature in sig-auth it was suggested that integration tests should be sufficient

NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Discussing this feature in sig-auth it was suggested that integration tests should be sufficient
NA
We won't do any end to end tests; this is because SIG Auth feel that integration tests provide sufficient
coverage.

?

I think an optional end to end test would be nice to have, even so.

@tallclair tallclair self-assigned this Jan 31, 2024
ServerTLSBootstrap bool `json:"serverTLSBootstrap,omitempty"`
```

The corresponding CLI (which CLI for Kubelet are deprecated) is `--rotate-server-certificates`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation also states that the command line flag is deprecated, but we don't actually mark it as deprecated in the code. I don't think it is deprecated?

// the 'certificates.k8s.io' API. This requires an approver to approve the
// certificate signing requests (CSR). The RotateKubeletServerCertificate feature
// must be enabled when setting this field.
// Default: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any path to enabling this by default? It wouldn't make sense for standalone kubelets, but could we enable it by default for Kubelets with a kubeconfig?


The corresponding CLI (which CLI for Kubelet are deprecated) is `--rotate-server-certificates`.

When the feature gate `RotateKubeletServerCertificate` is true, then `ServerTLSBootstrap` is also required to be true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the feature gate `RotateKubeletServerCertificate` is true, then `ServerTLSBootstrap` is also required to be true.
To enable this feature, both the feature gate `RotateKubeletServerCertificate` and configuration option `ServerTLSBootstrap` must be enabled.

- "certificate_manager_server_rotation_seconds"
- "certificate_manager_server_ttl_seconds"

```golang
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just link to the implementation, rather than inlining it here?

Comment on lines +213 to +214
These metrics should be promoted to [beta](https://github.com/kubernetes/kubernetes/pull/122834
) before GA.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Stable? This line seems to conflict with the previous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly because they were alpha so I would think they should go alpha -> beta -> ga?

Comment on lines +158 to +160
- "server_expiration_renew_errors"
- "certificate_manager_server_rotation_seconds"
- "certificate_manager_server_ttl_seconds"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add the help text (metric descriptions) here?

Comment on lines +238 to +240
##### e2e tests

Discussing this feature in sig-auth it was suggested that integration tests should be sufficient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the apiserver is configured to validate Kubelet serving certificates, and this feature is enabled, then it's functionality will be covered by a number existing E2E tests (anything that touches the kubelet serving path, e.g. exec/attach/portforward/logs).

Can you confirm whether any E2E test suites match this configuration?

Comment on lines +268 to +269
- [ ] There are conformance tests for the certificates API endpoint, functionality
itself do not require covering with conformance tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't make this a check box. It sounds like this is rationalizing the fact that conformance tests aren't part of the GA criteria?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the template for PRR.


###### Does this feature depend on any specific services running in the cluster?

No
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this depend on a CSR approver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What usually approves CSRs? I'm not very well versed in this area so I'm interested to learn more.

Comment on lines +417 to +418
- If it is a rotation, kubelet will become non-functional unless will be able to
renew the certificate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it will keep using the old cert until expiration? And I also assume kubelet would continue to function, but it's serving cert would be invalid? API server can be configured to ignore kubelet serving cert verification.

@kannon92
Copy link
Contributor Author

Sorry I didn't address these comments. I realize that I don't have the domain knowledge to really answer these questions.

I brought this up to sig-node hoping that someone else could spearhead this feature. Our hope was to get this KEP out of the permanent beta state but the requirement is that we need to document the design of this feature.

I will be on leave for 1.31 so this won't make much progress unless someone else takes it.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2024
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 22, 2024
@kannon92
Copy link
Contributor Author

/close

@k8s-ci-robot
Copy link
Contributor

@kannon92: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kannon92
Copy link
Contributor Author

@aojea is taking this over. Closing my attempt.

@k8s-ci-robot k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory kind/support Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants