Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] IO Freeze on Leader cause cluster publication to get stuck #1165

Open
Bukhtawar opened this issue Aug 27, 2021 · 6 comments
Open

[BUG] IO Freeze on Leader cause cluster publication to get stuck #1165

Bukhtawar opened this issue Aug 27, 2021 · 6 comments

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Aug 27, 2021

Describe the bug

The publication of cluster state is time bound to 30s by a cluster.publish.timeout settings. If this time is reached before the new cluster state is committed then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master.

There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock(0x0000000097a2f970) to flush the new state under a mutex to disk. The same lock(0x0000000097a2f970) is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progress

"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27968]" #122622 daemon prio=5 os_prio=0 cpu=305.52ms elapsed=1415.72s tid=0x00007f12d842f000 nid=0x5ac5 waiting for monitor entry  [0x00007f12812d1000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1335)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

FS Health checks at this point

"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27998]" #122744 daemon prio=5 os_prio=0 cpu=4.93ms elapsed=83.04s tid=0x00007f12d8751000 nid=0x7737 runnable  [0x00007f128937c000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.FileDispatcherImpl.force0([email protected]/Native Method)
	at sun.nio.ch.FileDispatcherImpl.force([email protected]/FileDispatcherImpl.java:82)
	at sun.nio.ch.FileChannelImpl.force([email protected]/FileChannelImpl.java:461)
	at org.elasticsearch.core.internal.io.IOUtils.fsync(IOUtils.java:278)
	at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.monitorFSHealth(FsHealthService.java:173)
	at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.run(FsHealthService.java:146)
	at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:223)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

[2021-08-27T01:00:55,775][WARN ][o.e.m.f.FsHealthService  ] [6e913fcd73eacb190aa137e551d85dc6] health check of [var/data/nodes/0] took [10295ms] which is above the warn threshold of [5s]
[2021-08-27T01:04:24,251][WARN ][o.e.m.f.FsHealthService  ] [6e913fcd73eacb190aa137e551d85dc6] health check of [var/data/nodes/0] took [86111ms] which is above the warn threshold of [5s]
[2021-08-27T01:09:12,123][WARN ][o.e.m.f.FsHealthService  ] [6e913fcd73eacb190aa137e551d85dc6] health check of [var/data/nodes/0] took [100075ms] which is above the warn threshold of [5s]
[2021-08-27T01:13:42,773][WARN ][o.e.m.f.FsHealthService  ] [6e913fcd73eacb190aa137e551d85dc6] health check of [var/data/nodes/0] took [145478ms] which is above the warn threshold of [5s]

Leader trying to commit the new cluster state to disk causing other operations to be stalled on the same mutex.

"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27981]" #122676 daemon prio=5 os_prio=0 cpu=156.25ms elapsed=879.32s tid=0x00007f12d86b8800 nid=0x6595 runnable  [0x00007f1283af8000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.fs.UnixNativeDispatcher.stat0([email protected]/Native Method)
	at sun.nio.fs.UnixNativeDispatcher.stat([email protected]/UnixNativeDispatcher.java:301)
	at sun.nio.fs.UnixFileAttributes.get([email protected]/UnixFileAttributes.java:70)
	at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes([email protected]/UnixFileAttributeViews.java:52)
	at sun.nio.fs.UnixFileSystemProvider.readAttributes([email protected]/UnixFileSystemProvider.java:149)
	at sun.nio.fs.LinuxFileSystemProvider.readAttributes([email protected]/LinuxFileSystemProvider.java:99)
	at java.nio.file.Files.readAttributes([email protected]/Files.java:1763)
	at java.nio.file.Files.size([email protected]/Files.java:2380)
	at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:242)
	at org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
	at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558)
	at org.apache.lucene.index.MergePolicy.useCompoundFile(MergePolicy.java:540)
	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4518)
	- locked <0x0000000097051bd8> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4057)
	at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:40)
	- locked <0x0000000097a2fc28> (a org.apache.lucene.index.SerialMergeScheduler)
	at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2158)
	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3567)
	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3545)
	at org.elasticsearch.gateway.PersistedClusterStateService$MetaDataIndexWriter.flush(PersistedClusterStateService.java:516)
	at org.elasticsearch.gateway.PersistedClusterStateService$Writer.updateMetaData(PersistedClusterStateService.java:689)
	at org.elasticsearch.gateway.PersistedClusterStateService$Writer.writeIncrementalStateAndCommit(PersistedClusterStateService.java:615)
	at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setLastAcceptedState(GatewayMetaState.java:543)
	at org.elasticsearch.cluster.coordination.CoordinationState.handlePublishRequest(CoordinationState.java:357)
	at org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:353)
	- locked <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2310/0x0000000100c78040.apply(Unknown Source)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler.acceptState(PublicationTransportHandler.java:449)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:432)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler.lambda$new$0(PublicationTransportHandler.java:100)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$$Lambda$2313/0x0000000100c78c40.messageReceived(Unknown Source)
	......
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:767)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

Note other processing like Follower Checker remove node is stuck on the same mutex

 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27983]" #122679 daemon prio=5 os_prio=0 cpu=85.98ms elapsed=876.81s tid=0x00007f128411d800 nid=0x6599 waiting for monitor entry  [0x00007f12946cb000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.run(Coordinator.java:1348)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27987]" #122698 daemon prio=5 os_prio=0 cpu=6.49ms elapsed=426.07s tid=0x00007f12d8428000 nid=0x705a waiting for monitor entry  [0x00007f1283dfc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator.removeNode(Coordinator.java:248)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2329/0x0000000100c7e440.accept(Unknown Source)
	at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker$2.run(FollowersChecker.java:454)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)
 
   Locked ownable synchronizers:
	- <0x00000000804fb130> (a java.util.concurrent.ThreadPoolExecutor$Worker)
 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27988]" #122699 daemon prio=5 os_prio=0 cpu=0.25ms elapsed=423.34s tid=0x00007f127e36d800 nid=0x705b waiting for monitor entry  [0x00007f1282ae9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator$5.onResponse(Coordinator.java:1134)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$3.handleResponse(PublicationTransportHandler.java:286)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$3.handleResponse(PublicationTransportHandler.java:277)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1129)
	at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:222)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)
 
   Locked ownable synchronizers:
	- <0x000000008012f990> (a java.util.concurrent.ThreadPoolExecutor$Worker)
 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27989]" #122700 daemon prio=5 os_prio=0 cpu=6.40ms elapsed=423.33s tid=0x00007f12d842d000 nid=0x705c waiting for monitor entry  [0x00007f1289079000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator.removeNode(Coordinator.java:248)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2329/0x0000000100c7e440.accept(Unknown Source)
	at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker$2.run(FollowersChecker.java:454)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)
 
   Locked ownable synchronizers:
	- <0x00000000804fb590> (a java.util.concurrent.ThreadPoolExecutor$Worker)
 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27990]" #122701 daemon prio=5 os_prio=0 cpu=0.92ms elapsed=402.18s tid=0x00007f12d874e800 nid=0x715b waiting for monitor entry  [0x00007f128d58e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator$5.onResponse(Coordinator.java:1134)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$3.handleResponse(PublicationTransportHandler.java:286)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$3.handleResponse(PublicationTransportHandler.java:277)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1129)
	at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:222)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27993]" #122705 daemon prio=5 os_prio=0 cpu=27.30ms elapsed=373.33s tid=0x00007f12d8750800 nid=0x71a7 waiting for monitor entry  [0x00007f12822e0000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator.getStateForMasterService(Coordinator.java:1040)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$7(Coordinator.java:500)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$3288/0x0000000101881c40.accept(Unknown Source)
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
	at org.elasticsearch.transport.ClusterConnectionManager.connectToNode(ClusterConnectionManager.java:120)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:375)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:359)
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:499)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2285/0x0000000100c71c40.accept(Unknown Source)
	at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$0(JoinHelper.java:136)
	at org.elasticsearch.cluster.coordination.JoinHelper$$Lambda$2288/0x0000000100c72840.messageReceived(Unknown Source)
.......
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)
 
   Locked ownable synchronizers:
	- <0x0000000080be46b8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27994]" #122706 daemon prio=5 os_prio=0 cpu=29.90ms elapsed=342.92s tid=0x00007f12d8781800 nid=0x722a waiting for monitor entry  [0x00007f12819d8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator.removeNode(Coordinator.java:248)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2329/0x0000000100c7e440.accept(Unknown Source)
	at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker$2.run(FollowersChecker.java:454)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

Proposal

  1. Enforce a timeout on leader to commit a cluster state including the time it takes it commit to fsync the state.
  2. Improve health checks to fail the leader or any node for that master if the Fs Health check takes beyond 60s

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@Bukhtawar Bukhtawar added bug Something isn't working untriaged labels Aug 27, 2021
@Bukhtawar Bukhtawar changed the title [BUG] IO Freeze on Leader cause cluster processing to get stuck [BUG] IO Freeze on Leader cause cluster publication to get stuck Aug 27, 2021
@itiyamas
Copy link
Contributor

Can you add the steps to reproduce this issue or write the test case for the same?

@itiyamas
Copy link
Contributor

itiyamas commented Aug 30, 2021

I don't see the FS Health failing fixing this problem as once the health check fails, the onLeaderFailure will block again on the same mutex. We need to somehow add timeouts to cluster state persistence calls. I believe once that is fixed, this issue will go way, irrespective of FS health check on stuck IO (Essentially your point 1 in Proposal above). This change may be tricky to achieve. FS health check on stuck IO is a good change in itself, but not a fix for this issue.

@Bukhtawar
Copy link
Collaborator Author

Fs health checks does pro-active checks to identify a bad node and evict the same from the cluster, rather than waiting for a cluster state update to remove the stuck leader. You rightly pointed out that this fix itself is insufficient since the mutex for onLeaderFailed will still get BLOCKED, which is what the first proposal tries to fix. Moreover Fs health checks would help with situations in general across data nodes.

@anasalkouz
Copy link
Member

Hi Bukhtawar, are you actively working on this issue? I see a pull request #1167 already submitted. anything else is pending before we can close the issue?

@andrross
Copy link
Member

andrross commented Nov 1, 2021

We don't seem to have any metrics on how frequently this issue manifests, but it appears to happen when the filesystem in use is not a local disk (i.e. EBS).

@anasalkouz
Copy link
Member

@Bukhtawar is there anything else pending on this issue before we close it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🆕 New
Development

No branches or pull requests

6 participants