Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGDSTRM-10301 switching routes to use the loadbalancer hostname #847

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

shawkins
Copy link
Contributor

@shawkins shawkins commented Jan 5, 2023

This is to resolve the issue we saw with a private rosa cluster lacking a public entry in the DNS cluster config. If that is the case, then the hostname of the ingress controller is only known within the private hosted zone - it cannot be used as a target for a route53 CNAME entry. It is valid to instead directly use the loadbalancer hostname associated with the ingresscontroller. This version of this change assumes we can do this all of time - but there are some concerns, such as we're effectively assuming that the loadbalancer hostname remains stable and that there's only one entry in the service status.

Alternatives include:

  • merging MGDSTRM-10232 initial configuration to control the ic scope #840 first and only do this when managedkafkaagent is designated as public and we introspect the DNS config and see that lacks a public entry.
  • instead mandate that cluster owners create an appropriate public entry in their DNS config if they want public kafka.
  • alleviate the concern about the loadbalancer host name stability by having the control plane update dns records - it looks like it only validates them currently.

This change is also blocked on the fleetmanager side in their DNS logic: https://github.com/bf2fc6cc711aee1a0c2a/kas-fleet-manager/blob/db4c66b6510a4935f793ddad6dc60952fddb0ae6/internal/kafka/internal/services/data_plane_kafka.go#L515 there is an expectation that the router hostname has the same root as the cluster.

@github-actions github-actions bot added the operator changes related to operator label Jan 5, 2023
Copy link
Contributor

@MikeEdgar MikeEdgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Regarding your comments on alternatives, I do think this approach is the cleanest/simplest in any case provided the domain name is stable.

Copy link
Contributor

@rareddy rareddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

* Domain part prefixed to domain reported on IngressController status. The CNAME DNS records
* need to point to a sub-domain on the IngressController domain, so we just add this.
*/
private static final String ROUTER_SUBDOMAIN = "ingresscontroller.";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change affect existing Kafkas somehow and a migration will be needed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As coded yes we would need both the check the control plane is doing relaxed and the control plane logic to update cname records implemented. We'll leave the do not merge label on this until we get both of these points straightened out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that make sense.

/cc @pb82

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does mean by "control plane is doing relaxed"? Also since the URL based on customer's load balancer, why should register cname in RHOSAK?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does mean by "control plane is doing relaxed"?

The control plane currently validates that the returned route URL by the fleet sync corresponds to the OSD cluster dns: https://github.com/bf2fc6cc711aee1a0c2a/kas-fleet-manager/blob/db4c66b6510a4935f793ddad6dc60952fddb0ae6/internal/kafka/internal/services/data_plane_kafka.go#L515

The OSD cluster dns is retrieved from cluster ingresses. I understand this won't be the case for private cluster and the check will have to be relaxed somehow.

Also since the URL based on customer's load balancer, why should register cname in RHOSAK?

If the route URL changes, KFM needs to update the cname record. From what I see, the existing router all had a ingresscontroller. prefix. If this URL is changed somehow, the CNAMEs record will need to be updated to point to the new routes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @machi1990

On CNAME registration, what I mean is since the URL is backed by users' AWS Route 53 already (I am assuming as it is their load balancer) why also registering in RHOSAK's AWS route 53.

Copy link
Contributor Author

@shawkins shawkins Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rareddy the scenario is a public kafka on an cluster that lacks a public dns entry in their config - we have already seen that this is the default state of private ROSA clusters. If we are in that circumstance, creating cname entries that point from the application domain endpoints broker1.something to ingresscontroller.kas-zone.clusterdomain will not work - as nothing is creating a public dns entry for kas-zone.clusterdomain. The necessary entry will point from kas-zone.clusterdomain to the aws loadbalancer hostname. So here we are skipping that and pointing broker1.something directly to the aws loabalancer hostname.

A way to do this that would address the migration concern would be leave the top level dns entries as is - create additional route entries to point from ingresscontroller.kas-zone.clusterdomain to the aws loadbalancer hostname. That will still require a change on the control plane side however as the target still won't be based upon the cluster domain. However this may not work for enterprise because the customer's clusterdomain isn't necessarily something that we would own and create dns entries for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shawkins explanation. I recall some of this conversation from before. So in this scenario, we do not use/need rhosak deployed ingress controllers anymore, but we leave them as is for now for to support other scenarios.

Copy link
Contributor Author

@shawkins shawkins Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the ingresscontroller is still used. We still are routing traffic to the aws loadbalancer. The aws loadbalancer has already been configured by the ingress controller operator to point to the ingress controller service / ha proxy pods.

We can only get away from using the ingress controllers if we use loadbalancer services or switch to using ingress and the https://github.com/kubernetes-sigs/aws-load-balancer-controller - I didn't see that the operator for this is running by default, let me do some more testing on a local instance. Updated https://issues.redhat.com/browse/MGDSTRM-10055?focusedCommentId=21523841&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-21523841 with the findings related to Ingress. Long story short, we're not going to use Ingress any time soon.

LoadBalancer services per my current understanding would still require us to handle dns on our own - for both the public and private cases.

@shawkins
Copy link
Contributor Author

The next changes make the switch to the loadbalancer hostname condition on it being a public kafka on a cluster that lacks a public dns config. This addresses an immediate need to worry about migrating / modifying existing dns entries - my understanding in that case was that we'd still continue to function appropriately as the existing entries would still be in place but we'd fail the validation that checks if the cname records need to change.

The other changes in the operator config - @rareddy was it intentional to leave in https://github.com/bf2fc6cc711aee1a0c2a/kas-fleetshard/pull/847/files#diff-0a4c59e14630b9f89b740bf0d391beda94261fcf66093ddba7d9226761e90960L227 - I didn't think we actually wanted it hard-coded in this manner, so it's been removed.

I'm also overriding the fabric8 max concurrent values - we're working to address this in 6.3+, but right now if you create 20 websockets then okhttp won't submit any additional requests. With the new dns informer, I count 18 informers we're creating and the operator sdk should be creating another - dangerously close to the default limit. That would also force anything that we have as concurrent logic to be forced over a limited number of connections.

Copy link

@machi1990 machi1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for adding in the code change to require no migration of already existing dns entries
@shawkins

Copy link
Contributor

@MikeEdgar MikeEdgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but a couple comments.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

81.8% 81.8% Coverage
0.0% 0.0% Duplication

Copy link
Contributor

@rareddy rareddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@grdryn grdryn removed their request for review November 21, 2023 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO NOT MERGE operator changes related to operator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants