Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/nodeipamcontroller #679

Closed

Conversation

DockToFuture
Copy link

@DockToFuture DockToFuture commented Oct 17, 2023

What type of PR is this?
/kind feature

What this PR does / why we need it:
This change adds a nodeipam controller to the aws cloud-controller-manager for IPv6 prefix delegation. The node ipam controller assigns POD addresses to kubernetes nodes based on prefix delegations. The controller supports plain IPv6 (--dualstack=false) and dual stack (--dualstack=true). In dual stack mode the IPAM for IPv4 is done in the same way it was done in the kube-controller-manager.

Which issue(s) this PR fixes:
Fixes #608

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

This change adds a `nodeipam controller` to the aws cloud-controller-manager for IPv6 prefix delegation. The node ipam controller assigns POD addresses to kubernetes nodes based on prefix delegations. The controller supports plain IPv6 (`--dualstack=false`) mode and dual stack (`--dualstack=true`) mode. 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Oct 17, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @DockToFuture!

It looks like this is your first PR to kubernetes/cloud-provider-aws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/cloud-provider-aws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 17, 2023
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 17, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @DockToFuture. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 17, 2023
@cartermckinnon
Copy link
Contributor

cartermckinnon commented Oct 20, 2023

/assign @cartermckinnon

I'll give this a look 👍

@cartermckinnon
Copy link
Contributor

/assign @M00nF1sh

@hakman
Copy link
Member

hakman commented Oct 23, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2023
@DockToFuture DockToFuture force-pushed the add/nodeipamcontroller branch from 5c1d5e5 to 93424bc Compare October 24, 2023 08:22
@dims
Copy link
Member

dims commented Nov 2, 2023

/test all

@dims
Copy link
Member

dims commented Nov 2, 2023

/kind feature
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 2, 2023
@dims
Copy link
Member

dims commented Nov 2, 2023

@DockToFuture how did you test this? I can't spot any tests in this PR, are parts of this controller here borrowed from other places? could we borrow the tests that exist there as well?

@DockToFuture
Copy link
Author

DockToFuture commented Nov 3, 2023

Hello @dims I've deployed the controller into our IPv6 Kubernetes clusters and tested both modes (--dualstack=true,--dualstack=false) with the following controllers enabled (--controllers=cloud-node,cloud-node-lifecycle,nodeipam,service,route).
The code is similar to https://github.com/kubernetes/cloud-provider-gcp/blob/master/cmd/cloud-controller-manager/nodeipamcontroller.go . What for tests do you expect? Can you elaborate on it a little bit?

@dims
Copy link
Member

dims commented Nov 3, 2023

@DockToFuture so what i was thinking was if we were to enhance one of the existing harness(es) to enable this control, for example make test-e2e uses kops to setup a cluster and run a bunch of tests

if we take this harness (switch on ipv6!), add a way to ensure that this new controller is activated/enabled that would be the first step. then we could add some basic golang based tests in test/e2e folder to ensure we don't break this controller going forward. You called out The node ipam controller assigns POD addresses to kubernetes nodes based on prefix delegations in your PR body as the feature, so that's what we would test, launch pods and test if the ip addresses look right.

Does that make sense?

@dims
Copy link
Member

dims commented Nov 3, 2023

@DockToFuture also there are some unit tests there in that other repo you pointed to, some of it may be ported over here as well?
https://cs.k8s.io/?q=func%20Test&i=nope&files=.*nodeipam.*_test%5C.go&excludeFiles=&repos=kubernetes/cloud-provider-gcp

@olemarkus
Copy link
Member

It should be possible to use this as a drop-in replacement to the ipam controller we implemented in kOps.
I expect kOps would still handle assigning the actual prefixes to instances.

kOps would not support dualstack though, so it may be an idea to mark that option experimental unless there is another way of getting that e2e-tested.

@DockToFuture DockToFuture force-pushed the add/nodeipamcontroller branch from 93424bc to d198bf0 Compare November 20, 2023 14:34
@DockToFuture DockToFuture force-pushed the add/nodeipamcontroller branch from 0e5cf91 to ba8554d Compare April 2, 2024 12:47
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 2, 2024
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@DockToFuture
Copy link
Author

@dims I've rebased the PR. Can you take a look?

@dims
Copy link
Member

dims commented Apr 2, 2024

/retest

@dims dims removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2024
@dims
Copy link
Member

dims commented Apr 2, 2024

@DockToFuture kicked off the tests again hoping these are flakes, if not, then we need to add some option to enable the controller on demand and leave it disabled otherwise i think. (then we can work on a follow up PR to enable them by default)

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 2, 2024

@DockToFuture: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-aws-e2e 9869f39 link true /test pull-cloud-provider-aws-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@DockToFuture
Copy link
Author

@dims a valid dns-zone needs to be set in the environment variable "DNS_ZONE" in order to create the kops cluster successfully, see my earlier comment: #679 (comment) for more information.

Error: cannot find DNS Zone "example.com".  Please pre-create the zone and set up NS records so that it resolves
Error: exit status 1
make: *** [Makefile:182: test-e2e] Error 1

@dims
Copy link
Member

dims commented Apr 3, 2024

@M00nF1sh any suggestions on how to make this work?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

// cidrSet are mapped to clusterCIDR by index
cidrSets := make([]*cidrset.CidrSet, len(allocatorParams.ClusterCIDRs))
for idx, cidr := range allocatorParams.ClusterCIDRs {
cidrSet, err := cidrset.NewCIDRSet(cidr, allocatorParams.NodeCIDRMaskSizes[idx])
Copy link
Contributor

@M00nF1sh M00nF1sh May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can cause crash if the len(allocatorParams.NodeCIDRMaskSizes) < len(allocatorParams.ClusterCIDRs).
It's a nit comment given this shall be supplied by cluster admins so lt's low risk, but good to have validations with clear error message when len(allocatorParams.NodeCIDRMaskSizes) != len(allocatorParams.ClusterCIDRs)

// If node has a pre allocate cidr that does not exist in our cidrs.
// This will happen if cluster went from dualstack(multi cidrs) to non-dualstack
// then we have now way of locking it
if idx >= len(r.cidrSets) {
Copy link
Contributor

@M00nF1sh M00nF1sh May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, we can optimize the performance and readability by doing the check once before the loop
e.g.

if len(node.Spec.PodCIDRs) < len(r.cidrSets) {
   return "some error"
}
for idx, cidr := range node.Spec.PodCIDRs {
 .....original logic without the idx >= len(r.cidrSets) check
}

// If node has a pre allocate cidr that does not exist in our cidrs.
// This will happen if cluster went from dualstack(multi cidrs) to non-dualstack
// then we have now way of locking it
if idx >= len(r.cidrSets) {
Copy link
Contributor

@M00nF1sh M00nF1sh May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous comment, i'd like to move this check before the loop for performance and readability.
the len(node.spec.PodCIDRs)(when it's not 0) seems must always >= len(r.cidrSets)

// function you have to make sure to update nodesInProcessing properly with the
// disposition of the node when the work is done.
func (r *rangeAllocator) AllocateOrOccupyCIDR(node *v1.Node) error {
if node == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, when will node been nil? or is this sort of Defensive programming(if so, let's add a comment like this should never happen for readability) or remove this check if we can ensure it's non-nil.


for idx := range r.cidrSets {
podCIDR, err := r.cidrSets[idx].AllocateNext()
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this happens, seems the allocations for other index(< current idx) will be leaked

}

klog.V(4).Infof("release CIDR %s for node:%v", cidr, node.Name)
if err = r.cidrSets[idx].Release(podCIDR); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we do a best effort by try release for each cidrSet and return aggregated error?

}

// aws:///eu-central-1a/i-07577a7bcf3e576f2
instanceID, _ := awsv1.KubernetesInstanceID(node.Spec.ProviderID).MapToAWSInstanceID()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error is ignored..what id the instance id is a non-valid instanceid? like a fargate task?

}

// aws:///eu-central-1a/i-07577a7bcf3e576f2
instanceID, _ := awsv1.KubernetesInstanceID(node.Spec.ProviderID).MapToAWSInstanceID()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error is ignored.

recorder: recorder,
nodesInProcessing: sets.NewString(),
}
nodeList, err := listNodes(kubeClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to use listNodes? shouldn't the informer handles all existing nodes?

Copy link
Contributor

@M00nF1sh M00nF1sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a more in-depth review of the code.

Background

  1. Kubernetes KCM have built-in node-iam controller that is capable of either:
  2. The Kubernetes CloudProvider interface don't have a interface defined for "NodeIPAM".
    • If with a strict definition of "CloudProvider" to be "every thing in CloudProvider interface", this shouldn't be part of "AWSCloudProvider".
    • If with a general definition of "CloudProvider" to be "every thing AWS related", then this controller can be part of "AWSCloudProvider", however personally i don't favor it given a few reasons:
      • Personally I prefer to keep the AWSCloudProvider as thin as possible, thus it only contains common & critical parts to get Kubernetes on AWS working.
      • Similar to why we move the "Service/Ingress Implementation" into a dedicated AWSLoadBalancerController(which also falls into this general definition "CloudProvider"), dedicated controllers allows us to iterate faster and more flexible(new features/ bug fixes), and don't need to worry our changes would break other k8s on AWS users, since the versioning of those optional features can be upgraded independently from core AWS cloudProvider functionality.

High level comments on the code itself:

  1. the use case of the "cidr_allocator" seems confusing to me. From the code seems it performs both "allocate from cidr" and "allocate from IPv6"(it's not "dualstack" but more like cidr_and_ipv6prefix_allocator :D). I'm wondering whether there are real use case for this mode.

  2. the implementation pattern of "cidr_allocator" and "ipv6_allocator" seems differs a lot, while technically they are doing same work, and i feel a bit hard to maintain tbh.

    • with the "cidr_allocator", the major work is handle directly during informer's event handler, while uses a work queue(channel) to defer the expensive ipv6 prefix lookup and cidr updates.
      • it's an anti-pattern to do expensive computes in event handlers.
      • there is no retries if there are failures in event handlers
      • i understand that the "nodesInProcessing" tries to aim only allow one worker thread to processing a node at any given time, but it a bit hard to follow tbh given it's removeNodeFromProcessing is called from multiple places/goroutines)..

@DockToFuture
Copy link
Author

DockToFuture commented May 15, 2024

I will close this PR as we have decided to implement an own IPAM controller as suggested by @M00nF1sh to keep the CloudProvider AWS as slim as possible (analogous to the AWS loadbalancer controller). Furthermore this approach gives us more degrees of freedom how the tests for the IPAM controller are run and executed.

@DockToFuture
Copy link
Author

I want to share the freshly created gardener aws-ipam-controller which originated out of this PR and might be beneficial to everyone who might be considering the use of IPv6 prefix delegation for single-stack or dual-stack scenarios.

@dims
Copy link
Member

dims commented Jul 1, 2024

thanks @DockToFuture !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prefix delegation based IPAM support in cc provider extension for IPv6
7 participants