Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-3328: SimpleAclAuthorizer can lose ACLs with frequent add/remov… #1006

Closed
wants to merge 18 commits into from

Conversation

granthenke
Copy link
Member

…e calls

Changes the SimpleAclAuthorizer to:

  • Track and utilize the zookeeper version when updating zookeeper to prevent data loss in the case of stale reads and race conditions
  • Update local cache when modifying ACLs
  • Add debug logging

…e calls

Changes the SimpleAclAuthorizer to:
- Always read state from Zookeeper before updating acls
- Update local cache when modifying acls
@@ -46,14 +48,18 @@ class SimpleAclAuthorizerTest extends ZooKeeperTestHarness {
val props = TestUtils.createBrokerConfig(0, zkConnect)
props.put(SimpleAclAuthorizer.SuperUsersProp, superUsers)

Logger.getLogger(classOf[SimpleAclAuthorizer]).setLevel(Level.DEBUG)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I do that sometimes to view various logging statements for testing. I removed the debug statements, so I should remove this.

@granthenke
Copy link
Member Author

This change will now make a call to zookeeper for every add/remove call. I am assuming that the volume of ACL modification requests to the authorizer would be quite low, therefore this is okay. If thats not the case, let me know.

I think this issue can still occur with multiple Zookeeper nodes in the case where the client gets a stale zookeeper read. I am not sure the best way to handle this with Zookeeper and our packaged client. @fpj do you have any recommendations on making sure the "read, update, write" pattern is synchronized/safe? Do we need a completely different approach? (links or research material are welcome)

TestUtils.waitAndVerifyAcls(Set(acl1, acl2), simpleAclAuthorizer2, commonResource)
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a test for adding acl from one authorizer and removing from other and making sure it gets reflected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add something like that, but will need to think through it.

I used 2 additions, because it clearly shows "whats missing". I can add the test for the add & delete but in the case where the add was "dropped" checking for no acl would still be true.

@granthenke
Copy link
Member Author

@ijuma @SinghAsDev @fpj I updated the patch to be "safe". So that stale reads and race conditions are handled. It does this by tracking its expected zookeeper version and handling write failures. Its also more optimistic about its cached values instead of reading from zookeeper every time. I also added a test that concurrently updates the same node with 110 mixed requests.

This definitely makes the SimpleAclAuthorizer more safe, but perhaps less "simple".

@granthenke
Copy link
Member Author

@ijuma @Parth-Brahmbhatt @SinghAsDev @fpj @ewencp @gwenshap
Could any of you review this today?

@fpj
Copy link
Contributor

fpj commented Mar 7, 2016

@granthenke I'm not sure I'll have time to get to it today, but I can review it by wed. To answer your question above, it doesn't matter if the read is stale if you're using a conditional set, then the write won't go through if the version isn't the one expected. In the worst case, the broker will have to try it again.

One thing you can goto reduce the probability of a stale read is to issue a sync request right before the read. This way you'll be forcing all pending updates from the leader to the follower to go through. You can't completely avoid the issue though because you can still have concurrent writes even when syncing.

@granthenke
Copy link
Member Author

@fpj Thanks for the follow up and confirmation. Using the versioned requests is the approach I took in the "rewrite" after my initial question where I mentioned you.

I had considered sending a sync request, and would still like to add that, but unfortunately the ZkClient we use does not support it. It also doesn't support a versioned delete request, which is unfortunate. Perhaps I should open a PR for that.

@fpj
Copy link
Contributor

fpj commented Mar 7, 2016

@granthenke I'll see if I can suggest something concrete, but in the past I've bypassed zkclient and used the zk handle directly, it might be doable in your case as well.

@@ -71,6 +69,8 @@ class SimpleAclAuthorizer extends Authorizer with Logging {
private var zkUtils: ZkUtils = null
private var aclChangeListener: ZkNodeChangeNotificationListener = null

private val NoVersion = -2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why -2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should clean, that up. I just needed some way to indicated there is no version yet, and -1 in the Zookeeper api means "ignore the version". I can rework the code to eliminate this, or use Option/None.

getAclsFromZk(resource)
var newVersionedAcls = currentVersionedAcls
var writeComplete = false
while (!writeComplete) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you want to spin it until the write is complete. If there is an error and the operation does not complete, then perhaps it should be up to the caller to call it again. Could you elaborate on the case you have in mind and that makes this while loop necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goal of this while loop is to retry the update to zookeeper if we fail based on the zookeeper version. The zookeeper version could be incorrect if the cached version in this instance is not up to date. That is most likely to occur if there are concurrent updates on a separate instance of the authorizer.

If the failure is based on anything other than version, or the handled exceptions. Then an exception will be thrown and propagated back to the user.

Ideally I would use the sync call to prevent any "spinning" but the ZkClient we use does not have that method (sgroschupf/zkclient#44).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with sync is that it is an asynchronous call and I believe ZkClient only supports the synchronous API. I don't think it is worth using the sync call here because it won't stop concurrent accesses anyway. When we move to the async API for the controller, then perhaps we can make this change here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. Thanks for clarifying.

@granthenke
Copy link
Member Author

@fpj Thanks for the review. I updated the patch to use the versioned delete and handle the empty data edge cases.

@granthenke
Copy link
Member Author

@fpj Any further comments? @ewencp @gwenshap @junrao Could one of you review?

}

private def getAclsFromCache(resource: Resource): VersionedAcls = {
aclCache.get(resource).getOrElse(throw new IllegalArgumentException(s"Acls do not exist in the cache for resource $resource"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map also has a getOrElse so you can replace get followed by getOrElse with a single getOrElse.

@@ -71,7 +70,8 @@ class SimpleAclAuthorizer extends Authorizer with Logging {
private var zkUtils: ZkUtils = null
private var aclChangeListener: ZkNodeChangeNotificationListener = null

private val aclCache = new scala.collection.mutable.HashMap[Resource, Set[Acl]]
private case class VersionedAcls(acls: Set[Acl], zkVersion: Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be in the SimpleAclAuthorizer object since it does not need access to the instance.

@fpj
Copy link
Contributor

fpj commented Mar 15, 2016

@granthenke I was proposing two patches to separate concerns, but you're right that we could just do it all here.

@granthenke
Copy link
Member Author

@fpj I can do it in 2 steps. I will open a jira to track the upgrade change and make this depend on it.

@granthenke
Copy link
Member Author

@fpj I created KAFKA-3403 to track the upgrade. I tested zkclient locally and everything passes.

@fpj
Copy link
Contributor

fpj commented Mar 15, 2016

@granthenke ok, sounds good. do you think you can test a version of your patch with the conditional delete call? It'd be good to make sure that we don't have anything unexpected due to the conditional delete changes.

@granthenke
Copy link
Member Author

@fpj Yeah, that is what I tested.

@granthenke
Copy link
Member Author

@fpj I have merged with trunk and updated this patch to use the ZkClient versioned delete.

@granthenke
Copy link
Member Author

Tests pass locally. Jenkins failure looks unrelated.

@fpj @ijuma @junrao Would you guys be able to give this a final review today?


private def deletePath(path: String, expectedVersion: Int): Boolean = {
try {
zkUtils.zkClient.delete(path, expectedVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point, but I'm wondering if it is best to have a call in ZkUtils so that we consistently access zkclient through ZkUtils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fpj I chose not to because it's not used anywhere else yet and I didn't want to think through and test "generic logic" for all uses. This code is contained and tested with respect to the authorizer. I thought was was an okay choice because ZkClient is used in many other places throughout the code base as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change isn't strictly necessary, and in fact it makes it less efficient because we have an additional call. I was thinking about the case that we want to replace zkclient with direct calls to the raw zookeeper client (KAFKA-3210) and as I was thinking about doing it, we'd just rewire ZkUtils. It's not a big deal if you don't want to change, though. Just let me know so that we can wrap this one up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fpj I can move it.

@fpj
Copy link
Contributor

fpj commented Mar 17, 2016

@granthenke it looks great, thank you for updating. I left a few comments in the case you have a chance to go over them.

@granthenke
Copy link
Member Author

@fpj thanks for looking over it again. I responded to some of your requests.

@granthenke
Copy link
Member Author

@fpj I addressed your comments and updated the code.

@fpj
Copy link
Contributor

fpj commented Mar 18, 2016

@granthenke looks like the concurrency test fails because of too many attempts (one of the latest changes) :

java.lang.IllegalStateException: Failed to update ACLs for Topic:test after trying a maximum of 15 times)

I suppose that for any number you choose, there is a chance that the test case will need that number or more, which is a bit concerning because it can keep running for some time. I think we will need to back off in the case of collision, like with a random sleep if you want something simple. Perhaps you have a better idea?

@granthenke
Copy link
Member Author

@fpj Causing collisions is the point of the test. It verifies that the code always maintains the correct state no matter how many retries it takes. I really don't want to over engineer this and I don't think this is a real world issue. The test just goes to extremes to prove ACLs are never lost. Do you feel strongly about the arbitrary retry limit code?

@fpj
Copy link
Contributor

fpj commented Mar 18, 2016

@granthenke Even if this loops forever, this is just AclCommand running, right? In the very worst case of running for too long due to collisions, you'd have to kill it and try again. There is little harm in not having the retries if that's the case, unless it is problem that you're not sure whether the ACL change has gone through or not, but I suppose you can check if needed.

I still think that it'd be better to back off in the case of collisions rather than running in a tight loop, but I don't want to block this issue on it. Perhaps we can create a jira and work on it later.

@granthenke
Copy link
Member Author

@fpj I can add a backoff over the next few days. Its only quick in the test because we are on an EmbeddedZookeeper. In the real world the read of the data happens before the next update request which slows things down.

@granthenke
Copy link
Member Author

@fpj I added a simple backoff with a random jitter

@gwenshap
Copy link
Contributor

LGTM.
Test failure unrelated (and looks like a Jenkins issue) - validated tests on my machine.

I love the new concurrency tests util - should be very handy in the future.

@asfgit asfgit closed this in bfac36a Mar 20, 2016
@fpj
Copy link
Contributor

fpj commented Mar 22, 2016

Thanks for the work on the patch @granthenke.

@granthenke
Copy link
Member Author

Thanks for all the review @fpj!

efeg added a commit to efeg/kafka that referenced this pull request Jan 29, 2020
adixitconfluent pushed a commit to adixitconfluent/kafka that referenced this pull request Feb 7, 2024
…che#1006)

* AKCORE-1: ShareGroupHeartbeat support in group coordinator (1/N)

* Merged upstream changes and fixed core adapter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants