KAFKA-3328: SimpleAclAuthorizer can lose ACLs with frequent add/remov… #1006

granthenke · 2016-03-03T21:11:34Z

…e calls

Changes the SimpleAclAuthorizer to:

Track and utilize the zookeeper version when updating zookeeper to prevent data loss in the case of stale reads and race conditions
Update local cache when modifying ACLs
Add debug logging

…e calls Changes the SimpleAclAuthorizer to: - Always read state from Zookeeper before updating acls - Update local cache when modifying acls

ijuma · 2016-03-03T21:16:06Z

core/src/test/scala/unit/kafka/security/auth/SimpleAclAuthorizerTest.scala

@@ -46,14 +48,18 @@ class SimpleAclAuthorizerTest extends ZooKeeperTestHarness {
    val props = TestUtils.createBrokerConfig(0, zkConnect)
    props.put(SimpleAclAuthorizer.SuperUsersProp, superUsers)

+    Logger.getLogger(classOf[SimpleAclAuthorizer]).setLevel(Level.DEBUG)


Accidental?

Ah yeah, I do that sometimes to view various logging statements for testing. I removed the debug statements, so I should remove this.

granthenke · 2016-03-03T21:16:35Z

This change will now make a call to zookeeper for every add/remove call. I am assuming that the volume of ACL modification requests to the authorizer would be quite low, therefore this is okay. If thats not the case, let me know.

I think this issue can still occur with multiple Zookeeper nodes in the case where the client gets a stale zookeeper read. I am not sure the best way to handle this with Zookeeper and our packaged client. @fpj do you have any recommendations on making sure the "read, update, write" pattern is synchronized/safe? Do we need a completely different approach? (links or research material are welcome)

SinghAsDev · 2016-03-03T21:24:07Z

core/src/test/scala/unit/kafka/security/auth/SimpleAclAuthorizerTest.scala

+    TestUtils.waitAndVerifyAcls(Set(acl1, acl2), simpleAclAuthorizer2, commonResource)
+  }
+
+


Maybe add a test for adding acl from one authorizer and removing from other and making sure it gets reflected.

I can add something like that, but will need to think through it.

I used 2 additions, because it clearly shows "whats missing". I can add the test for the add & delete but in the case where the add was "dropped" checking for no acl would still be true.

granthenke · 2016-03-04T20:53:37Z

@ijuma @SinghAsDev @fpj I updated the patch to be "safe". So that stale reads and race conditions are handled. It does this by tracking its expected zookeeper version and handling write failures. Its also more optimistic about its cached values instead of reading from zookeeper every time. I also added a test that concurrently updates the same node with 110 mixed requests.

This definitely makes the SimpleAclAuthorizer more safe, but perhaps less "simple".

granthenke · 2016-03-07T15:58:08Z

@ijuma @Parth-Brahmbhatt @SinghAsDev @fpj @ewencp @gwenshap
Could any of you review this today?

fpj · 2016-03-07T17:44:43Z

@granthenke I'm not sure I'll have time to get to it today, but I can review it by wed. To answer your question above, it doesn't matter if the read is stale if you're using a conditional set, then the write won't go through if the version isn't the one expected. In the worst case, the broker will have to try it again.

One thing you can goto reduce the probability of a stale read is to issue a sync request right before the read. This way you'll be forcing all pending updates from the leader to the follower to go through. You can't completely avoid the issue though because you can still have concurrent writes even when syncing.

granthenke · 2016-03-07T17:50:32Z

@fpj Thanks for the follow up and confirmation. Using the versioned requests is the approach I took in the "rewrite" after my initial question where I mentioned you.

I had considered sending a sync request, and would still like to add that, but unfortunately the ZkClient we use does not support it. It also doesn't support a versioned delete request, which is unfortunate. Perhaps I should open a PR for that.

fpj · 2016-03-07T17:52:41Z

@granthenke I'll see if I can suggest something concrete, but in the past I've bypassed zkclient and used the zk handle directly, it might be doable in your case as well.

SinghAsDev · 2016-03-07T18:23:16Z

core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala

@@ -71,6 +69,8 @@ class SimpleAclAuthorizer extends Authorizer with Logging {
  private var zkUtils: ZkUtils = null
  private var aclChangeListener: ZkNodeChangeNotificationListener = null

+  private val NoVersion = -2


Just curious, why -2?

I should clean, that up. I just needed some way to indicated there is no version yet, and -1 in the Zookeeper api means "ignore the version". I can rework the code to eliminate this, or use Option/None.

fpj · 2016-03-08T14:37:41Z

core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala

+      getAclsFromZk(resource)
+    var newVersionedAcls = currentVersionedAcls
+    var writeComplete = false
+    while (!writeComplete) {


I'm not sure why you want to spin it until the write is complete. If there is an error and the operation does not complete, then perhaps it should be up to the caller to call it again. Could you elaborate on the case you have in mind and that makes this while loop necessary?

This goal of this while loop is to retry the update to zookeeper if we fail based on the zookeeper version. The zookeeper version could be incorrect if the cached version in this instance is not up to date. That is most likely to occur if there are concurrent updates on a separate instance of the authorizer.

If the failure is based on anything other than version, or the handled exceptions. Then an exception will be thrown and propagated back to the user.

Ideally I would use the sync call to prevent any "spinning" but the ZkClient we use does not have that method (sgroschupf/zkclient#44).

The problem with sync is that it is an asynchronous call and I believe ZkClient only supports the synchronous API. I don't think it is worth using the sync call here because it won't stop concurrent accesses anyway. When we move to the async API for the controller, then perhaps we can make this change here.

Yeah, that makes sense. Thanks for clarifying.

granthenke · 2016-03-08T18:40:38Z

@fpj Thanks for the review. I updated the patch to use the versioned delete and handle the empty data edge cases.

granthenke · 2016-03-09T20:18:51Z

@fpj Any further comments? @ewencp @gwenshap @junrao Could one of you review?

ijuma · 2016-03-10T00:37:34Z

core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala

+  }
+
+  private def getAclsFromCache(resource: Resource): VersionedAcls = {
+    aclCache.get(resource).getOrElse(throw new IllegalArgumentException(s"Acls do not exist in the cache for resource $resource"))


Map also has a getOrElse so you can replace get followed by getOrElse with a single getOrElse.

ijuma · 2016-03-10T12:13:47Z

core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala

@@ -71,7 +70,8 @@ class SimpleAclAuthorizer extends Authorizer with Logging {
  private var zkUtils: ZkUtils = null
  private var aclChangeListener: ZkNodeChangeNotificationListener = null

-  private val aclCache = new scala.collection.mutable.HashMap[Resource, Set[Acl]]
+  private case class VersionedAcls(acls: Set[Acl], zkVersion: Int)


This should probably be in the SimpleAclAuthorizer object since it does not need access to the instance.

fpj · 2016-03-15T14:13:35Z

@granthenke I was proposing two patches to separate concerns, but you're right that we could just do it all here.

granthenke · 2016-03-15T14:14:22Z

@fpj I can do it in 2 steps. I will open a jira to track the upgrade change and make this depend on it.

granthenke · 2016-03-15T14:51:23Z

@fpj I created KAFKA-3403 to track the upgrade. I tested zkclient locally and everything passes.

fpj · 2016-03-15T14:54:38Z

@granthenke ok, sounds good. do you think you can test a version of your patch with the conditional delete call? It'd be good to make sure that we don't have anything unexpected due to the conditional delete changes.

granthenke · 2016-03-15T14:55:47Z

@fpj Yeah, that is what I tested.

granthenke · 2016-03-17T03:40:31Z

@fpj I have merged with trunk and updated this patch to use the ZkClient versioned delete.

granthenke · 2016-03-17T14:51:27Z

Tests pass locally. Jenkins failure looks unrelated.

@fpj @ijuma @junrao Would you guys be able to give this a final review today?

fpj · 2016-03-17T15:33:33Z

core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala

+
+  private def deletePath(path: String, expectedVersion: Int): Boolean = {
+    try {
+      zkUtils.zkClient.delete(path, expectedVersion)


Minor point, but I'm wondering if it is best to have a call in ZkUtils so that we consistently access zkclient through ZkUtils.

@fpj I chose not to because it's not used anywhere else yet and I didn't want to think through and test "generic logic" for all uses. This code is contained and tested with respect to the authorizer. I thought was was an okay choice because ZkClient is used in many other places throughout the code base as well.

The change isn't strictly necessary, and in fact it makes it less efficient because we have an additional call. I was thinking about the case that we want to replace zkclient with direct calls to the raw zookeeper client (KAFKA-3210) and as I was thinking about doing it, we'd just rewire ZkUtils. It's not a big deal if you don't want to change, though. Just let me know so that we can wrap this one up.

@fpj I can move it.

fpj · 2016-03-17T17:58:46Z

@granthenke it looks great, thank you for updating. I left a few comments in the case you have a chance to go over them.

granthenke · 2016-03-17T19:54:02Z

@fpj thanks for looking over it again. I responded to some of your requests.

granthenke · 2016-03-18T15:16:06Z

@fpj I addressed your comments and updated the code.

fpj · 2016-03-18T21:36:20Z

@granthenke looks like the concurrency test fails because of too many attempts (one of the latest changes) :

java.lang.IllegalStateException: Failed to update ACLs for Topic:test after trying a maximum of 15 times)

I suppose that for any number you choose, there is a chance that the test case will need that number or more, which is a bit concerning because it can keep running for some time. I think we will need to back off in the case of collision, like with a random sleep if you want something simple. Perhaps you have a better idea?

granthenke · 2016-03-18T21:53:02Z

@fpj Causing collisions is the point of the test. It verifies that the code always maintains the correct state no matter how many retries it takes. I really don't want to over engineer this and I don't think this is a real world issue. The test just goes to extremes to prove ACLs are never lost. Do you feel strongly about the arbitrary retry limit code?

fpj · 2016-03-18T22:11:36Z

@granthenke Even if this loops forever, this is just AclCommand running, right? In the very worst case of running for too long due to collisions, you'd have to kill it and try again. There is little harm in not having the retries if that's the case, unless it is problem that you're not sure whether the ACL change has gone through or not, but I suppose you can check if needed.

I still think that it'd be better to back off in the case of collisions rather than running in a tight loop, but I don't want to block this issue on it. Perhaps we can create a jira and work on it later.

granthenke · 2016-03-18T22:13:22Z

@fpj I can add a backoff over the next few days. Its only quick in the test because we are on an EmbeddedZookeeper. In the real world the read of the data happens before the next update request which slows things down.

granthenke · 2016-03-19T04:39:52Z

@fpj I added a simple backoff with a random jitter

gwenshap · 2016-03-20T07:45:04Z

LGTM.
Test failure unrelated (and looks like a Jenkins issue) - validated tests on my machine.

I love the new concurrency tests util - should be very handy in the future.

fpj · 2016-03-22T10:02:44Z

Thanks for the work on the patch @granthenke.

granthenke · 2016-03-22T15:08:22Z

Thanks for all the review @fpj!

…che#1006) * AKCORE-1: ShareGroupHeartbeat support in group coordinator (1/N) * Merged upstream changes and fixed core adapter

KAFKA-3328: SimpleAclAuthorizer can lose ACLs with frequent add/remov…

3483b75

…e calls Changes the SimpleAclAuthorizer to: - Always read state from Zookeeper before updating acls - Update local cache when modifying acls

ijuma reviewed Mar 3, 2016
View reviewed changes

SinghAsDev reviewed Mar 3, 2016
View reviewed changes

granthenke added 2 commits March 3, 2016 16:37

Add test for delete

5b7d2b1

Refactor to be safe in the case of stale zookeeper reads

7b615d8

Reduce concurrent calls to 100

86cca1f

SinghAsDev reviewed Mar 7, 2016
View reviewed changes

Consolidate cache/versions

776320e

fpj reviewed Mar 8, 2016
View reviewed changes

Add versioned delete

0a3ecc0

ijuma reviewed Mar 10, 2016
View reviewed changes

Update based on reviews

d317dbb

ijuma reviewed Mar 10, 2016
View reviewed changes

granthenke added 2 commits March 16, 2016 22:13

Merge branch 'trunk' into simple-authorizer-fix

2cb5c5f

Use ZkClient versioned delete call

fb01532

fpj reviewed Mar 17, 2016
View reviewed changes

Move delete to zkutils, set max retries, move lock

756a308

Add a basic backoff to retries

e8eb25c

Merge branch 'trunk' into simple-authorizer-fix

fdf9af0

asfgit closed this in bfac36a Mar 20, 2016

efeg added a commit to efeg/kafka that referenced this pull request Jan 29, 2020

Add checkStyle regarding the JavadocMethod (apache#1006)

4721732

		TestUtils.waitAndVerifyAcls(Set(acl1, acl2), simpleAclAuthorizer2, commonResource)
		}

KAFKA-3328: SimpleAclAuthorizer can lose ACLs with frequent add/remov… #1006

KAFKA-3328: SimpleAclAuthorizer can lose ACLs with frequent add/remov… #1006

Conversation

granthenke commented Mar 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

granthenke commented Mar 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

granthenke commented Mar 4, 2016

granthenke commented Mar 7, 2016

fpj commented Mar 7, 2016

granthenke commented Mar 7, 2016

fpj commented Mar 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

granthenke commented Mar 8, 2016

granthenke commented Mar 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpj commented Mar 15, 2016

granthenke commented Mar 15, 2016

granthenke commented Mar 15, 2016

fpj commented Mar 15, 2016

granthenke commented Mar 15, 2016

granthenke commented Mar 17, 2016

granthenke commented Mar 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpj commented Mar 17, 2016

granthenke commented Mar 17, 2016

granthenke commented Mar 18, 2016

fpj commented Mar 18, 2016

granthenke commented Mar 18, 2016

fpj commented Mar 18, 2016

granthenke commented Mar 18, 2016

granthenke commented Mar 19, 2016

gwenshap commented Mar 20, 2016

fpj commented Mar 22, 2016

granthenke commented Mar 22, 2016