[BUG] IO Freeze on Leader cause cluster publication to get stuck #1165

Bukhtawar · 2021-08-27T15:50:29Z

Describe the bug

The publication of cluster state is time bound to 30s by a cluster.publish.timeout settings. If this time is reached before the new cluster state is committed then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master.

There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock(0x0000000097a2f970) to flush the new state under a mutex to disk. The same lock(0x0000000097a2f970) is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progress

"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27968]" #122622 daemon prio=5 os_prio=0 cpu=305.52ms elapsed=1415.72s tid=0x00007f12d842f000 nid=0x5ac5 waiting for monitor entry  [0x00007f12812d1000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1335)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

FS Health checks at this point

"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27998]" #122744 daemon prio=5 os_prio=0 cpu=4.93ms elapsed=83.04s tid=0x00007f12d8751000 nid=0x7737 runnable  [0x00007f128937c000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.FileDispatcherImpl.force0([email protected]/Native Method)
	at sun.nio.ch.FileDispatcherImpl.force([email protected]/FileDispatcherImpl.java:82)
	at sun.nio.ch.FileChannelImpl.force([email protected]/FileChannelImpl.java:461)
	at org.elasticsearch.core.internal.io.IOUtils.fsync(IOUtils.java:278)
	at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.monitorFSHealth(FsHealthService.java:173)
	at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.run(FsHealthService.java:146)
	at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:223)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

[2021-08-27T01:00:55,775][WARN ][o.e.m.f.FsHealthService  ] [6e913fcd73eacb190aa137e551d85dc6] health check of [var/data/nodes/0] took [10295ms] which is above the warn threshold of [5s]
[2021-08-27T01:04:24,251][WARN ][o.e.m.f.FsHealthService  ] [6e913fcd73eacb190aa137e551d85dc6] health check of [var/data/nodes/0] took [86111ms] which is above the warn threshold of [5s]
[2021-08-27T01:09:12,123][WARN ][o.e.m.f.FsHealthService  ] [6e913fcd73eacb190aa137e551d85dc6] health check of [var/data/nodes/0] took [100075ms] which is above the warn threshold of [5s]
[2021-08-27T01:13:42,773][WARN ][o.e.m.f.FsHealthService  ] [6e913fcd73eacb190aa137e551d85dc6] health check of [var/data/nodes/0] took [145478ms] which is above the warn threshold of [5s]

Leader trying to commit the new cluster state to disk causing other operations to be stalled on the same mutex.

"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27981]" #122676 daemon prio=5 os_prio=0 cpu=156.25ms elapsed=879.32s tid=0x00007f12d86b8800 nid=0x6595 runnable  [0x00007f1283af8000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.fs.UnixNativeDispatcher.stat0([email protected]/Native Method)
	at sun.nio.fs.UnixNativeDispatcher.stat([email protected]/UnixNativeDispatcher.java:301)
	at sun.nio.fs.UnixFileAttributes.get([email protected]/UnixFileAttributes.java:70)
	at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes([email protected]/UnixFileAttributeViews.java:52)
	at sun.nio.fs.UnixFileSystemProvider.readAttributes([email protected]/UnixFileSystemProvider.java:149)
	at sun.nio.fs.LinuxFileSystemProvider.readAttributes([email protected]/LinuxFileSystemProvider.java:99)
	at java.nio.file.Files.readAttributes([email protected]/Files.java:1763)
	at java.nio.file.Files.size([email protected]/Files.java:2380)
	at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:242)
	at org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
	at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558)
	at org.apache.lucene.index.MergePolicy.useCompoundFile(MergePolicy.java:540)
	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4518)
	- locked <0x0000000097051bd8> (a org.apache.lucene.index.IndexWriter)
	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4057)
	at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:40)
	- locked <0x0000000097a2fc28> (a org.apache.lucene.index.SerialMergeScheduler)
	at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2158)
	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3567)
	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3545)
	at org.elasticsearch.gateway.PersistedClusterStateService$MetaDataIndexWriter.flush(PersistedClusterStateService.java:516)
	at org.elasticsearch.gateway.PersistedClusterStateService$Writer.updateMetaData(PersistedClusterStateService.java:689)
	at org.elasticsearch.gateway.PersistedClusterStateService$Writer.writeIncrementalStateAndCommit(PersistedClusterStateService.java:615)
	at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setLastAcceptedState(GatewayMetaState.java:543)
	at org.elasticsearch.cluster.coordination.CoordinationState.handlePublishRequest(CoordinationState.java:357)
	at org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:353)
	- locked <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2310/0x0000000100c78040.apply(Unknown Source)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler.acceptState(PublicationTransportHandler.java:449)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:432)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler.lambda$new$0(PublicationTransportHandler.java:100)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$$Lambda$2313/0x0000000100c78c40.messageReceived(Unknown Source)
	......
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:767)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

Note other processing like Follower Checker remove node is stuck on the same mutex

 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27983]" #122679 daemon prio=5 os_prio=0 cpu=85.98ms elapsed=876.81s tid=0x00007f128411d800 nid=0x6599 waiting for monitor entry  [0x00007f12946cb000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.run(Coordinator.java:1348)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27987]" #122698 daemon prio=5 os_prio=0 cpu=6.49ms elapsed=426.07s tid=0x00007f12d8428000 nid=0x705a waiting for monitor entry  [0x00007f1283dfc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator.removeNode(Coordinator.java:248)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2329/0x0000000100c7e440.accept(Unknown Source)
	at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker$2.run(FollowersChecker.java:454)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)
 
   Locked ownable synchronizers:
	- <0x00000000804fb130> (a java.util.concurrent.ThreadPoolExecutor$Worker)
 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27988]" #122699 daemon prio=5 os_prio=0 cpu=0.25ms elapsed=423.34s tid=0x00007f127e36d800 nid=0x705b waiting for monitor entry  [0x00007f1282ae9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator$5.onResponse(Coordinator.java:1134)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$3.handleResponse(PublicationTransportHandler.java:286)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$3.handleResponse(PublicationTransportHandler.java:277)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1129)
	at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:222)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)
 
   Locked ownable synchronizers:
	- <0x000000008012f990> (a java.util.concurrent.ThreadPoolExecutor$Worker)
 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27989]" #122700 daemon prio=5 os_prio=0 cpu=6.40ms elapsed=423.33s tid=0x00007f12d842d000 nid=0x705c waiting for monitor entry  [0x00007f1289079000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator.removeNode(Coordinator.java:248)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2329/0x0000000100c7e440.accept(Unknown Source)
	at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker$2.run(FollowersChecker.java:454)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)
 
   Locked ownable synchronizers:
	- <0x00000000804fb590> (a java.util.concurrent.ThreadPoolExecutor$Worker)
 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27990]" #122701 daemon prio=5 os_prio=0 cpu=0.92ms elapsed=402.18s tid=0x00007f12d874e800 nid=0x715b waiting for monitor entry  [0x00007f128d58e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator$5.onResponse(Coordinator.java:1134)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$3.handleResponse(PublicationTransportHandler.java:286)
	at org.elasticsearch.cluster.coordination.PublicationTransportHandler$3.handleResponse(PublicationTransportHandler.java:277)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1129)
	at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:222)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27993]" #122705 daemon prio=5 os_prio=0 cpu=27.30ms elapsed=373.33s tid=0x00007f12d8750800 nid=0x71a7 waiting for monitor entry  [0x00007f12822e0000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator.getStateForMasterService(Coordinator.java:1040)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$7(Coordinator.java:500)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$3288/0x0000000101881c40.accept(Unknown Source)
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
	at org.elasticsearch.transport.ClusterConnectionManager.connectToNode(ClusterConnectionManager.java:120)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:375)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:359)
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:499)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2285/0x0000000100c71c40.accept(Unknown Source)
	at org.elasticsearch.cluster.coordination.JoinHelper.lambda$new$0(JoinHelper.java:136)
	at org.elasticsearch.cluster.coordination.JoinHelper$$Lambda$2288/0x0000000100c72840.messageReceived(Unknown Source)
.......
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:701)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)
 
   Locked ownable synchronizers:
	- <0x0000000080be46b8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
 
"elasticsearch[6e913fcd73eacb190aa137e551d85dc6][generic][T#27994]" #122706 daemon prio=5 os_prio=0 cpu=29.90ms elapsed=342.92s tid=0x00007f12d8781800 nid=0x722a waiting for monitor entry  [0x00007f12819d8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.cluster.coordination.Coordinator.removeNode(Coordinator.java:248)
	- waiting to lock <0x0000000097a2f970> (a java.lang.Object)
	at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2329/0x0000000100c7e440.accept(Unknown Source)
	at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker$2.run(FollowersChecker.java:454)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

Proposal

Enforce a timeout on leader to commit a cluster state including the time it takes it commit to fsync the state.
Improve health checks to fail the leader or any node for that master if the Fs Health check takes beyond 60s

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

itiyamas · 2021-08-30T09:31:42Z

Can you add the steps to reproduce this issue or write the test case for the same?

itiyamas · 2021-08-30T09:58:35Z

I don't see the FS Health failing fixing this problem as once the health check fails, the onLeaderFailure will block again on the same mutex. We need to somehow add timeouts to cluster state persistence calls. I believe once that is fixed, this issue will go way, irrespective of FS health check on stuck IO (Essentially your point 1 in Proposal above). This change may be tricky to achieve. FS health check on stuck IO is a good change in itself, but not a fix for this issue.

Bukhtawar · 2021-08-30T14:19:20Z

Fs health checks does pro-active checks to identify a bad node and evict the same from the cluster, rather than waiting for a cluster state update to remove the stuck leader. You rightly pointed out that this fix itself is insufficient since the mutex for onLeaderFailed will still get BLOCKED, which is what the first proposal tries to fix. Moreover Fs health checks would help with situations in general across data nodes.

anasalkouz · 2021-10-04T23:19:23Z

Hi Bukhtawar, are you actively working on this issue? I see a pull request #1167 already submitted. anything else is pending before we can close the issue?

andrross · 2021-11-01T20:02:06Z

We don't seem to have any metrics on how frequently this issue manifests, but it appears to happen when the filesystem in use is not a local disk (i.e. EBS).

anasalkouz · 2022-04-01T16:31:48Z

@Bukhtawar is there anything else pending on this issue before we close it?

Bukhtawar added bug Something isn't working untriaged labels Aug 27, 2021

Bukhtawar changed the title ~~[BUG] IO Freeze on Leader cause cluster processing to get stuck~~ [BUG] IO Freeze on Leader cause cluster publication to get stuck Aug 27, 2021

Bukhtawar mentioned this issue Aug 28, 2021

Introduce FS Health HEALTHY threshold to fail stuck node #1167

Merged

5 tasks

minalsha added distributed framework and removed untriaged labels Aug 31, 2021

dblock mentioned this issue Feb 7, 2022

[Backport] Introduce FS Health HEALTHY threshold to fail stuck node #1269

Merged

5 tasks

anasalkouz added Migration:Pending Input and removed Migration:Pending Input labels Mar 16, 2023

shwetathareja added the Cluster Manager label Mar 26, 2024

github-project-automation bot added this to Cluster Manager Project Board Mar 26, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] IO Freeze on Leader cause cluster publication to get stuck #1165

[BUG] IO Freeze on Leader cause cluster publication to get stuck #1165

Bukhtawar commented Aug 27, 2021 •

edited

Loading

itiyamas commented Aug 30, 2021

itiyamas commented Aug 30, 2021 •

edited

Loading

Bukhtawar commented Aug 30, 2021

anasalkouz commented Oct 4, 2021

andrross commented Nov 1, 2021

anasalkouz commented Apr 1, 2022

[BUG] IO Freeze on Leader cause cluster publication to get stuck #1165

[BUG] IO Freeze on Leader cause cluster publication to get stuck #1165

Comments

Bukhtawar commented Aug 27, 2021 • edited Loading

itiyamas commented Aug 30, 2021

itiyamas commented Aug 30, 2021 • edited Loading

Bukhtawar commented Aug 30, 2021

anasalkouz commented Oct 4, 2021

andrross commented Nov 1, 2021

anasalkouz commented Apr 1, 2022

Bukhtawar commented Aug 27, 2021 •

edited

Loading

itiyamas commented Aug 30, 2021 •

edited

Loading