Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting stretch Kafka cluster with Strimzi #129

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

aswinayyolath
Copy link

This proposal describes design details of stretch cluster

@aswinayyolath aswinayyolath changed the title Enabling Stretch Kafka Deployments with Strimzi Supporting stretch Kafka cluster with Strimzi Sep 5, 2024
@aswinayyolath aswinayyolath force-pushed the stretch-cluster branch 3 times, most recently from 80f778e to 8396b40 Compare September 6, 2024 10:18
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the proposal. Left some initial comments.

Can you please put one sentence per line to make the review easier? You can look at one of the other proposals for an example.

The word "cluster" is overloaded in this context, so we should always pay attention and clarify if we are talking about Kubernetes or Kafka.


### Prerequisites

- **Multiple Kubernetes Clusters**: Stretch Kafka clusters will require multiple Kubernetes clusters. Ideally, an odd number of clusters (at least three) is needed to maintain quorum in the event of a cluster outage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add LoadBalancer or dedicated Ingress controller as prerequisite to avoid the potential bottleneck caused by a shared Ingress controller? If the Kubernetes clusters hosts other services, the actual latency could become unpredictable even if the network latency is good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea, anything that can make latency predictable will help with the stability of communication. I'll add that as a prerequisite for now. We can relax that requirement in future if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But those are not prerequisites. We should not rely only on the primitives for outside access. We need to consider / support a wide range of technologies designed for multicluster networking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but we should at least have some recommendations in the documentation.


- **Multiple Kubernetes Clusters**: Stretch Kafka clusters will require multiple Kubernetes clusters. Ideally, an odd number of clusters (at least three) is needed to maintain quorum in the event of a cluster outage.

- **Low Latency**: Kafka clusters should be deployed in environments that allow low-latency communication between Kafka brokers and controllers. Stretch Kafka clusters should be deployed in environments such as data centers or availability zones within a single region, and not across distant regions where high latency could impair performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The network between data centers could have significant levels of jitter and/or packet loss, so I think we should rather talk about predictable and stable low-latency (p99s TCP round-trip?) and high-bandwidth connections between relatively close data centers.

Should we clearly define regions (e.g. separate geographic areas) and availability zones (e.g. geographically close data centers) to avoid any confusion?

Should we provide some numbers such as optimal and maximum latency values? I guess that would be a common question, so it may be better to have it documented somewhere. Wdyt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relatively close data centers is exactly what I had in mind. Terms like data center, regions, availability zones can sometimes be used to mean different things, so I tried to avoid them but we can call out upfront what we mean by those terms in this proposal and that should help in creating a common understanding.


### Design

The cluster operator will be deployed in all Kubernetes clusters and will manage Kafka brokers/controllers running on that cluster. One Kubernetes cluster will act as the control point for defining custom resources (Kafka, KafkaNodePool) required for stretch Kafka cluster. The KafkaNodePool custom resource will be extended to include information about a Kubernetes cluster where the pool should be deployed. The cluster operator will create necessary resources (StrimziPodSets, services etc.) on the target clusters specified within the KafkaNodePool resource.
Copy link
Contributor

@fvaleri fvaleri Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the single control point fails and cannot be restored? Should we also deploy the Kafka and NodePool CRs to the other Kubernetes clusters, and make the operators running there as standby control points?

Copy link
Author

@aswinayyolath aswinayyolath Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely correct that if the control point fails and cannot be restored, this model doesn't allow any modifications, as the control point is where the Kafka CR is defined. While the existing setup would continue to operate, no further changes could be made once the central cluster goes down. In such a case, restoring the central cluster would be the only option to modify the existing deployment.

We did initially consider the idea of standby control points during the design phase but ultimately moved it to the rejected alternatives due to the complexity involved in coordinating between Cluster Operators (COs). The original idea was to have a standby Kafka CR in all participating Kubernetes clusters, and if a central cluster outage is detected, one of the standby Kafka CRs in another Kubernetes cluster would assume leadership and begin the reconciliation process.

This approach is similar to how Strimzi currently supports multiple COs, where one operator is in standby mode and can acquire the lease to take over if the primary operator crashes. However, implementing this for Kafka CRs across clusters requires complex coordination mechanisms, which led us to move away from this design.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed model is similar to how cross-cluster technologies like Submariner work. For example, in Submariner, the unavailability of the broker (Submariner uses a central Broker component to facilitate the exchange of metadata information between Gateway Engines deployed in participating clusters) cluster does not impact the operation of the data plane in participating clusters. The data plane continues to route traffic using the last known information while the broker is offline. However, during this time, control plane components won’t be able to share or receive new updates between clusters. Once the connection to the broker is restored, all components automatically re-synchronize with the broker and update the data plane if needed.

083-stretch-cluster.md Outdated Show resolved Hide resolved

## Motivation

By distributing Kafka nodes across multiple clusters, a stretch Kafka cluster can tolerate outages of individual Kubernetes clusters and will continue to serve clients seamlessly even if one of the clusters goes down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add that another benefit of a stretch Kafka cluster over using MM2 is strong data durability thanks to synchronous replication, and fast disaster recovery with automated client failover.

083-stretch-cluster.md Outdated Show resolved Hide resolved
083-stretch-cluster.md Outdated Show resolved Hide resolved
annotations:
strimzi.io/node-pools: enabled
strimzi.io/kraft: enabled
strimzi.io/stretch-mode: enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this annotation to NodePools?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion! May I ask why moving the stretch-mode annotation to the NodePools would be a good idea?

We think that adding the stretch-mode annotation in the Kafka CR makes sense because it clearly represents a global configuration that applies to the entire Kafka cluster. By placing it in the Kafka CR, it signals that the entire cluster is operating in stretch mode, affecting how brokers, controllers, and listeners are handled across multiple Kubernetes clusters.

Having this configuration at the Kafka level also makes it easier to manage and audit, as it is immediately visible from the main Kafka resource. This avoids scattering critical configurations across multiple NodePool resources, which could lead to complexity when maintaining or troubleshooting the cluster. Additionally, stretch mode is fundamentally a clusterwide behavior rather than something that is specific to individual node pools, so we believe the Kafka CR is the most appropriate place to define it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pool cannot be stretched AFAIU from the proposal, so I think the annotation belongs to the Kafka custom resource. Having it on the node pool let me think that pods for the specific pool are stretched which should not be the case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with your point. By keeping the annotation in the Kafka CR, we ensure that the stretch configuration remains cluster-wide, clearly indicating that it applies across all nodes and resources. This also simplifies the management and understanding of the cluster’s operational mode

Comment on lines +54 to +83
listenerConfig:
- configuration:
Copy link
Contributor

@fvaleri fvaleri Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only for inter-broker communication discovery?
Why we need multiple configurations per NodePool?

Copy link

@neeraj-laad neeraj-laad Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this area is not as good so I might have misunderstood this but was thinking of scenarios where there might be a need for these to be different:

  • Kubernetes clusters might use different ingress controllers.
  • one Kubernetes cluster wants to use Ingress but the other wants to use LoadBalancer.
  • host configuration for bootstrap address on each Kubernetes cluster.

Is there is a neater way to do this form the Kafka resource itself? If so we can remove this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not support exposing first broker with load balancer, second with Ingress, and third with node ports. So why is it needed here? I think that expecting that the Kubernetes clusters have all comparable setup and infrastructure is a reasonable prerequisite which might simplify things for you.

This whole thing is IMHO also a super niche use-case. So I think we need to be careful about what kind of testing and maintenance surface this creates.


```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help to show how the reconciled status would look like for a stretch cluster.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal. I left some comments.

But TBH, I do not think the level of depth it has is nowhere near to where it would need to be to approve or not approve anything. It is just a super high-level idea that without the implementation details cannot be correct or wrong. We cannot approve some API changes and then try to figure out how to implement the code around it. It needs to go hand in hand.

It also almost completely ignores the networking part which is the most complicated part. It needs to cover how the different mechanisms will be supported and handled as we should be able to integrate into the cloud native landscape and fit in with the tools already being used in this area. Relying purely on something like Ingress is not enough. So the proposal needs to cover how this will be handled and how do we ensure the extensibility of this.

It would be also nice to cover topics such as:

  • How will the installation be handled both on the side clusters as well as on the main Kubernetes cluster
  • Testing strategy (how and where will we test this given our resources)


At present, the availability of Strimzi-managed Kafka clusters is directly tied to the availability of the underlying Kubernetes cluster. If a Kubernetes cluster experiences an outage, the entire Kafka cluster becomes unavailable, disrupting all connected Kafka clients.

## Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section would deserve a bit more attention. There are also some other use-cases worth mentioning such as moving the Kafka cluster between Kubernetes clusters etc.

You should also describe the limitations and issues it brings:
* Increased network unreliability and costs
* Requirement for a limited distance between the clusters (e.g. what is the minimal expected latency between the Kubernetes clusters required for this?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made some modifications to this section. Could you please take a look?


### Design

The cluster operator will be deployed in all Kubernetes clusters and will manage Kafka brokers/controllers running on that cluster. One Kubernetes cluster will act as the control point for defining custom resources (Kafka, KafkaNodePool) required for stretch Kafka cluster. The KafkaNodePool custom resource will be extended to include information about a Kubernetes cluster where the pool should be deployed. The cluster operator will create necessary resources (StrimziPodSets, services etc.) on the target clusters specified within the KafkaNodePool resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong model. The operator on the individual clusters should run only as the PodSet controller. Only the main operator in the "central" cluster with the custom resources will do the actual management of the Kafka nodes and will be responsible for managing things such services, secrets, rolling pods etc. This is the built in assumption into the design of things such as StrimziPodSets or node pools.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only running the StrimziPodSet reconcilers on other clusters is what we had in mind, all other aspects are created by the central cluster operator that is managing Kafka and KafkaNodePool resource and has complete view of the entire stretch cluster.

Perhaps we didn't articulate/explain it clearly enough. Will try to clarify that.

Comment on lines +51 to +81
target:
clusterUrl: <K8S Cluster URL>
secret: <SecretName>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The related Kubernetes cluster should be IMHO defined in the CO Deployment (via Env Vars, Secrets etc.). The KafkaNodePool should only include the name / alias for the Kubernetes cluster as defined in the CO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here’s what I understand from your suggestion. Is my understanding correct?

using env var (Option 1 )

Add the cluster information as env var in the CO Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: strimzi-cluster-operator
  namespace: myproject
spec:
  template:
    spec:
      containers:
        - name: strimzi-cluster-operator
          env:
            - name: CLUSTER_A_URL
              value: "<K8S_CLUSTER_A_URL>"
            - name: CLUSTER_A_SECRET_NAME
              value: "<SecretNameA>"
            - name: CLUSTER_B_URL
              value: "<K8S_CLUSTER_B_URL>"
            - name: CLUSTER_B_SECRET_NAME
              value: "<SecretNameB>"
         ................................
         ...............................

(Option 2) Alternatively, we can define a CM for cluster URLs and a Secret for sensitive information like

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-info
  namespace: myproject
data:
  clusters.yaml: |
    clusters:
      - name: cluster-a
        url: <K8S_CLUSTER_A_URL>
      - name: cluster-b
        url: <K8S_CLUSTER_B_URL>
apiVersion: v1
kind: Secret
metadata:
  name: cluster-secrets
  namespace: myproject
type: Opaque
data:
  cluster-a-secret: <base64-encoded-secret-for-cluster-a>
  cluster-b-secret: <base64-encoded-secret-for-cluster-b>

Then mount CM and Secret to CO deployment

spec:
  template:
    spec:
      containers:
        - name: strimzi-cluster-operator
          envFrom:
            - configMapRef:
                name: cluster-info
            - secretRef:
                name: cluster-secrets

Now reference clusters by alias in KNP CR like

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: controller
  labels:
    strimzi.io/cluster: my-cluster
spec:
  replicas: 3
  target:
    clusterAlias: "cluster-a"  # Referencing the alias defined in CO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, when using the environment variable, you should IMHO consider using a single environment variable with some map similar to what we use for images.

But the main problem is that we cannot decide on some API change before knowing the implementation details. The best way to configure it depends on how it will work:

  • How will users create these accounts on various Kube distributions? How will it work on OpenShift? how on AKS EKS? GKE? Rancher?
  • What are these credentials? Are they kubeconfig files? Should they be an API Server URL + token?
  • Are these long-term credentials? Short-term credentials that will be changing?
  • How will these credentials be used in the code to create and share the clients for the different clusters?
  • How will the RBACs of these clients look like on the remote clusters?

That needs to be clarified and designed. And that should drive the optimal outcome of how the API will look like.

target:
clusterUrl: <K8S Cluster URL>
secret: <SecretName>
listenerConfig:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the function of this is. The listeners should remain to be configured centrally.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this area is not as good so I might have misunderstood this but was thinking of scenarios where there might be a need for these to be different:

  • Kubernetes clusters might use different ingress controllers.
  • one Kubernetes cluster wants to use Ingress but the other wants to use LoadBalancer.
  • host configuration for bootstrap address on each Kubernetes cluster.

Is there is a neater way to do this form the Kafka resource itself? If so we can remove this.

type: ingress
```

A new annotation (`stretch-mode: enabled`) will be introduced in Kafka custom resource to indicate when it is representing a stretch Kafka cluster. This approach is similar to how Strimzi currently enables features like KafkaNodePool (KNP) and KRaft mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should there be a feature gate instead?
  • Why is the annotation needed? Shouldn't the nature of the cluster be clear from the target configurations in the node pools?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind introducing the stretch-mode: enabled annotation in the Kafka custom resource was to explicitly signal when a Kafka cluster is operating in stretch mode. This would serve as a clear, simple indicator for users and tools that the Kafka cluster is spanning multiple Kube clusters, similar to how other Strimzi features like KNP and KRaft mode are enabled.

However, I understand the point about whether a feature gate might be more appropriate and whether the nature of the cluster can be inferred from the configurations in the KNP CR. The reasoning for the annotation was to provide a straightforward and unambiguous flag to identify stretch clusters. It could be beneficial when managing clusters in complex environments where users might want a quick way to distinguish between regular and stretch setups.

That said, I agree that the stretch configuration could potentially be inferred directly from the target configurations in the KNP resources. If we remove the annotation, the reconciler could look at the KNP definitions to determine whether the cluster is stretched, without the need for an explicit flag.

In a stretch Kafka cluster, we'll need bootstrap and broker services to be present on each Kubernetes cluster and be accessible from other clusters. The Kafka reconciler will identify all target clusters from KafkaNodePool resources and create these services in target Kubernetes clusters. This will ensure that even if the central cluster experiences an outage, external clients can still connect to the stretch cluster and continue their operations without interruption.

#### Cross-cluster communication
Kafka controllers/brokers are distributed across multiple Kubernetes environments and will need to communicate with each other. Currently, the Strimzi Kafka operator defines Kafka listeners for internal communication (controlplane and replication) between brokers/controllers (Kubernetes services using ports 9090 and 9091). The user is not able to influence how these services are set up and exposed outside the cluster. We would remove this limitation and allow users to define how these internal listeners are configured in the Kafka resource, just like they do for Kafka client listeners.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you allow it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thinking was for the Kafka reconciler will detect this is a stretch cluster, and if so, then relax the restrictions that we have in place that limit the minimum listener port to 9090 (instead of 9092). So user can then define listeners for 9090 and 9091 too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to be in control of those. The things such as security etc. should remain under our control. We will of course need some additional configs to define the networking used for the stretch cluster etc. But I do not think it is as simple as freeing the port numbers.

#### Cross-cluster communication
Kafka controllers/brokers are distributed across multiple Kubernetes environments and will need to communicate with each other. Currently, the Strimzi Kafka operator defines Kafka listeners for internal communication (controlplane and replication) between brokers/controllers (Kubernetes services using ports 9090 and 9091). The user is not able to influence how these services are set up and exposed outside the cluster. We would remove this limitation and allow users to define how these internal listeners are configured in the Kafka resource, just like they do for Kafka client listeners.

Users will also be able to override listener configurations in each KafkaNodePool resource, if the listeners need to be exposed in different ways (ingress host names, Ingress annotations etc.) for each Kubernetes cluster. This will be similar to how KafkaNodePools are used to override other configuration like storage etc. To override a listener, KafkaNodePool will define configuration with same listner name as in Kafka resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds crazy complexity. You should have one mechanism shared for the whole cluster. Not a different mechanism per-node-pool.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this area is not as good so I might have misunderstood this but was thinking of scenarios where there might be a need for these to be different: host configuration for Ingress, dns, ingress related annotations etc.

Is there is a way to do define such variations from within a single Kafka resource? or can we put prerequisites that will mean on each cluster same configuration can be used? That would be much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I suggested it in the other threads -> I think having a clear prerequisites that the clusters support the same mechanisms makes sense to me. But as I also said, you need to think outside the box here. It is not about loadbalancers or Ingresses. It is (also) about things such as Submariner, Skupper, Istio Federation etc. that would overlay the Kubernetes clusters and abstract from us what exactly they use underneath.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did consider some of these and tried technologies like skupper and they do make networking much simpler as services on one cluster are visible on another cluster, but they do add a new dependency to the project and will need upfront setup from customers. So there are trade-offs.

I have briefly touched upon skupper in alternatives towards the bottom of the proposal, but I will elaborate more and we can reconsider if that is a cleaner approach that what is laid out here currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think these things are what users are asking for - at least some of them. Because they already use them for other services. And you do not really want to have each piece of software use its own way to do things. And you likely also don't want to have ingresses or load balancers for every single project. So they need to be considered at least to the extent to make things extensible and be ready for them.

Copy link
Author

@aswinayyolath aswinayyolath Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been looking into technologies like Skupper and Submariner for cross-cluster communication. The idea is that Submariner/SKupper can help simplify communication between Kafka brokers and controllers across clusters by exposing services across k8s clusters, avoiding the need for complex external listeners. IMO Here’s how it could work

  • Submariner/Skupper connects Kubernetes clusters, making it easier for Kafka brokers and controllers to communicate directly using their native IPs and DNS names across clusters. This is especially important for Kafka’s internal coordination and replication traffic.
  • Submariner/Skupper can extend Kafka’s internal listeners (used for broker coordination and replication) to work across clusters, making the communication between brokers seamless without needing to configure external services like load balancers or Ingress.
  • Submariner/Skupper uses encrypted tunnels for communication between clusters. Since Kafka already requires encrypted communication for its internal traffic, Submariner can enhance security without needing extra setup for mTLS between clusters.

Set Up Submariner

  • Install Submariner across the clusters where Kafka brokers are running.
  • Ensure cross-cluster connectivity is established for the namespaces running Strimzi Kafka.

Modify Kafka Listeners

  • Adjust the Kafka internal listeners to support cross-cluster communication through Submariner.

Secure Communication

  • Submariner’s encrypted tunnels will automatically secure communication, minimizing the need for additional mTLS configuration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation to support multiple technologies for enabling cross-cluster communication?
For example:

  • Kubernetes-native solutions like Ingresses and Load Balancers
  • Skupper
  • Submariner
  • Istio Federation
  • Linkerd Service Mesh
  • Consul Connect
  • Cilium
  • ..............
  • ..............
    Will there be any preference for one solution over the others?

Copy link
Author

@aswinayyolath aswinayyolath Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the discussion here, we explored cross-cluster technologies such as Submariner and Skupper as potential solutions for achieving communication between Kafka brokers and controllers that are distributed across multiple Kubernetes environments. These tools facilitate cross-cluster communication by overlaying Kubernetes clusters without relying solely on traditional methods like LoadBalancers or Ingresses.

As part of our investigation, we manually deployed Kafka pods and necessary configuration, similar to what the Strimzi Kafka operator would normally do. The goal of these experiments was to evaluate the suitability, reliability, and overall feasibility of each technology, assessing their strengths and limitations for our use case.

To enable cross-cluster communication for a stretch Kafka cluster, the advertised.listener configuration needed to be adapted. In its default form, Strimzi creates headless services that support communication within a single Kubernetes cluster, using addresses such as my-cluster-broker-0.my-cluster-kafka-brokers.svc.cluster.local. However, these default service addresses are not accessible across cluster boundaries.

When multiple Kubernetes clusters were connected using Submariner, the broker/controller services of type clusterIp can be exported using Submariner to make them accessible by other clusters in the network.

By running the command subctl export service --kubeconfig <CONFIG> --namespace <NAMESPACE> my-cluster-kafka-brokers, we create a ServiceExport resource in the specified namespace. This resource signals Submariner to register the service with the Submariner Broker. The Broker acts as the coordinator for cross-cluster service discovery, leveraging the Lighthouse component to allow services in different clusters to find and communicate with each other. This process results in secure IP routing being configured between clusters. Submariner sets up tunnels and routing tables that enable direct traffic flow, overcoming the limitations of isolated cluster networks.

Once the service is exported, its fully qualified domain name becomes accessible as <service-name>.<namespace>.svc.clusterset.local. This global DNS name ensures that any cluster participating in the Submariner deployment can reach the service, facilitating the cross-cluster communication needed for Kafka brokers and controllers. For example, the advertised.listener configuration was updated from my-cluster-broker-0.my-cluster-kafka-brokers.svc.cluster.local to my-cluster-broker-0.cluster1.my-cluster-kafka-brokers.svc.clusterset.local, where cluster1 represents the Submariner cluster ID. This update ensures that when a Kafka broker sends its advertised listeners to clients or other brokers, they will receive a service address that is reachable from any cluster involved in the setup.

Similar changes are required for controller.quorum.voters property as well.

SSL hostname verification between pods relies on SAN (Subject Alternative Name) entries in the certificates provided to the pods. For this verification to function in a stretched Kafka cluster using Submariner, the FQDNs (Fully Qualified Domain Names) of the Submariner-exported services need to be included in the pod certificates. This can be accomplished in two main ways:

The first method allows users to define the SANs for the brokers directly through the Kafka CR's listener configuration property. Users can input the Submariner-exported FQDNs in this field, which ensures the brokers inject these SANs into their certificates. For example, if there are two k8s clusters with four brokers (broker-0, broker-1, broker-100, broker-101), the listener configuration might look like this:

listeners:
  - name: tls
    port: 9093
    type: internal
    tls: true
    configuration:
      bootstrap:
        alternativeNames:
          - my-cluster-broker-0.cluster2.my-cluster-kafka-brokers.strimzi.svc.clusterset.local
          - my-cluster-broker-1.cluster2.my-cluster-kafka-brokers.strimzi.svc.clusterset.local
          - my-cluster-broker-100.cluster2.my-cluster-kafka-brokers.strimzi.svc.clusterset.local
          - my-cluster-broker-101.cluster2.my-cluster-kafka-brokers.strimzi.svc.clusterset.local

Although this approach works, it injects the FQDN of every broker into every broker's certificate, which is not ideal.

Controller pods do not follow this approach because they do not inherit listener configurations from the Kafka CR. Instead, they use a single control plane listener (TCP 9090). To make this work for controller pods, users would need to configure the control plane listener in the CR to include the necessary SANs, Strimzi currently doesn't support this and this is not considered optimal .

A better approach is for the operator to automatically read the Submariner cluster ID from the CR (The CR should be extended such that user should be able to provide Submariner ClusterId) and create the SANs entries in these formats:

<KAFKA-CLUSTER-NAME>-kafka-brokers.svc.<NAMESPACE>.clusterset.local
<SUBMARINER-CLUSTER-ID>.<KAFKA-CLUSTER-NAME>-kafka-brokers.svc.<NAMESPACE>.clusterset.local
<POD-NAME>.<SUBMARINER-CLUSTER-ID>.<KAFKA-CLUSTER-NAME>-kafka-brokers.svc.<NAMESPACE>.clusterset.local

This second approach is preferred as it simplifies the process for users, sparing them from manually adding SANs to the CR and reducing configuration complexity.

In summary, the following changes will be needed

Advertised Listener Configuration

  • The advertised.listener property should reference the Submariner-exported service. For example, the configuration should be updated as follows:

  • From: my-cluster-broker-0.my-cluster-kafka-brokers.svc.cluster.local

  • To: my-cluster-broker-0.cluster1.my-cluster-kafka-brokers.svc.clusterset.local

Controller Quorum Voters

  • Similar updates need to be made for the controller.quorum.voters setting to ensure it points to the Submariner-exposed service.

SANs (Subject Alternative Names)

  • All broker and controller pods must include SAN entries for the Submariner-exported service.

I will update the proposal to include detailed explanations of these changes and potential implementation details.

Comment on lines 114 to 127
#### Secrets
We need to create Kubernetes Secrets in the central cluster that will store the credentials required for creating resources on the target clusters. These secrets will be referenced in the KafkaNodePool custom resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to manage the credentials needed for creating / managing resources (like svc, netpol, SPS) on the target clusters. These Secrets would store the necessary authentication data (e.g., API tokens/certificates, Kubeconfig etc ) required to communicate securely between the central cluster and the target clusters.

By referencing these Secrets in the KNP CR, users can ensure that the appropriate credentials are automatically used for any cross-cluster operations (Mainly deploying SPS in target cluster). This helps centralize credential management, providing a consistent way to securely authenticate with target clusters

The reason for referencing these credentials in KNP is that, just as KNP allow for different configurations per pool (like storage), they could also handle the specific credentials for cross-cluster resource creation.

Comment on lines 117 to 130
#### Entity operator
We would recommend that all KafkaTopic and KafkaUser resources are managed from the cluster that holds Kafka and KafkaNodePool resources, and that should be the cluster where the entity operator should be enabled. This will allow all resource management/configuration form a central place. The entity operator should not be impacted by changes in this proposal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UO and TO might not be impacted directly in its source code. But the way it is deployed will for sure be impacted as you need to clarify how will it connect to the Kafka cluster.

083-stretch-cluster.md Outdated Show resolved Hide resolved

In addition to improving fault tolerance, this approach also facilitates other valuable use cases, such as:

- **Migration Flexibility**: The ability to move Kafka clusters between Kubernetes environments without downtime, supporting maintenance or migrations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not consider just moving an entire Kafka cluster between Kubernetes envs but also only some nodes of the Kafka cluster itself, or?

### Prerequisites

- **Multiple Kubernetes Clusters**: Stretch Kafka clusters will require multiple Kubernetes clusters.
Ideally, an odd number of clusters (at least three) is needed to maintain quorum in the event of a cluster outage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which "quorum" are you referring here?


### Design

The cluster operator will be deployed in all Kubernetes clusters and will manage Kafka brokers/controllers running on that cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But reconciling-kafka-knp.png shows just one running.

annotations:
strimzi.io/node-pools: enabled
strimzi.io/kraft: enabled
strimzi.io/stretch-mode: enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pool cannot be stretched AFAIU from the proposal, so I think the annotation belongs to the Kafka custom resource. Having it on the node pool let me think that pods for the specific pool are stretched which should not be the case.

The operators will then create necessary resources in target Kubernetes clusters, which can then be reconciled/managed by operators on those clusters.

### Reconciling Kafka and KafkaNodePool resources
![Reconciling Kafka and KafkaNodePool resources](./images/083-reconciling-kafka-knp.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this picture, my understanding is that there is just one cluster operator running in one Kube cluster where you deploy the custom resources ... while the other one has more operators (one for each cluster). I think just one picture would be enough ... AFAIU you envisage the other operators just to handle SPS but not other custom resources? or can they be used to handle local clusters as well?
I was wondering if just one operator is enough and it can reconcile SPS on other Kube clusters just talking to the right remote Kube API.

aswinayyolath added a commit to aswinayyolath/proposals that referenced this pull request Nov 15, 2024
…tion

Added details about how to use Submariner for cross cluster communication

Contributes to: strimzi#129

Signed-off-by: Aswin A <[email protected]>
katheris and others added 7 commits November 18, 2024 19:42
* MirrorMaker Connector Offsets Support

Signed-off-by: Katherine Stanley <[email protected]>

* Address scholzj review comments

* Clarify motivation wording.
* Move ConfigMap configuration to connector
level property.

Signed-off-by: Katherine Stanley <[email protected]>

* Scale back validation and allow multiple connectors

* Scale back the validation of input for
altering offsets to only validate the JSON
is syntactically correct, rather than validating
fields.

* Allow multiple connectors to be selected using
the mirrormaker-connector annotation.

Signed-off-by: Katherine Stanley <[email protected]>

* Address scholzj review comments

* Replace -> with -- in ConfigMap entry name.
* Add rejected alternatives section about
preventing users altering/reseting offsets for
MirrorCheckpointConnector and MirrorHeartbeatConnector.

Signed-off-by: Katherine Stanley <[email protected]>

* Update proposal to require single connector specified

Signed-off-by: Katherine Stanley <[email protected]>

* Address scholzj review comment

* Remove comma separated connector name example

Co-authored-by: Jakub Scholz <[email protected]>
Signed-off-by: Kate Stanley <[email protected]>

* Address mimaison review comments

Signed-off-by: Katherine Stanley <[email protected]>

* Address PaulRMellor review comments

Signed-off-by: Katherine Stanley <[email protected]>

* Prepare for merging

Signed-off-by: Jakub Scholz <[email protected]>

---------

Signed-off-by: Katherine Stanley <[email protected]>
Signed-off-by: Kate Stanley <[email protected]>
Signed-off-by: Jakub Scholz <[email protected]>
Co-authored-by: Jakub Scholz <[email protected]>
Signed-off-by: Aswin A <[email protected]>
Signed-off-by: Aswin A <[email protected]>
Moved sentences to separate lines to help with reviews

Signed-off-by: neeraj-laad <[email protected]>
Signed-off-by: Aswin A <[email protected]>
…tion

Added details about how to use Submariner for cross cluster communication

Contributes to: strimzi#129

Signed-off-by: Aswin A <[email protected]>
Editorial fixes

Signed-off-by: neeraj-laad <[email protected]>
Signed-off-by: Aswin A <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants