Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Follower index goes to "failed" state when the retention lease period expires. #1466

Open
skumarp7 opened this issue Nov 26, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@skumarp7
Copy link
Contributor

skumarp7 commented Nov 26, 2024

What is the bug?

When the retention lease period expires for few indices. Few of the indices are going to "AutoPaused" and few of the indices are going to "Failed" state without any exception/reason.

As per expectation, the indices should have gone to "AutoPaused" from the documentation. Is this expected scenario?

bash:~$ curl -XGET "https://localhost:9200/_plugins/_replication/test-1/_status?pretty" -k
{
  "status" : "PAUSED",
  "reason" : "AutoPaused:  + [[test-1][0] - org.opensearch.index.seqno.RetentionLeaseNotFoundException - \"retention lease with ID [replication:default:vNY-ECIHQr-PcfjBYTJzDA:[test-1][0]] not found\"], ",
  "leader_alias" : "site-1",
  "leader_index" : "test-1",
  "follower_index" : "test-1"
}
bash:~$ curl -XGET "https://localhost:9200/_plugins/_replication/test-2/_status?pretty" -k
{
  "status" : "FAILED",
  "reason" : "",
  "leader_alias" : "site-1",
  "leader_index" : "test-2",
  "follower_index" : "test-2"
}
@skumarp7 skumarp7 added bug Something isn't working untriaged labels Nov 26, 2024
@skumarp7
Copy link
Contributor Author

Hi @ankitkala, @monusingh-1 ,

Can you confirm this behaviour? Early feedback is much appreciated.

@ankitkala
Copy link
Member

All indices are expected to moved to autopause. For the indices which moved to failed, can you check the application logs?

@skumarp7
Copy link
Contributor Author

Hi @ankitkala ,

For the failed indices, we could see the below logs:

{"type":"log","host":"opensearch-data-2.default","container":"is-data","level":"ERROR","time": "2024-11-21T07:28:13.320Z","logger":"o.o.r.t.i.IndexReplicationTask","timezone":"UTC","marker":"[opensearch-data-2] [test-1] ","log":{"message":"Encountered exception while auto-pausing test-1"}}
org.opensearch.ResourceAlreadyExistsException: Index test-1 is already paused
        at org.opensearch.replication.action.pause.TransportPauseIndexReplicationAction.validatePauseReplicationRequest(TransportPauseIndexReplicationAction.kt:96) ~[?:?]
        at org.opensearch.replication.action.pause.TransportPauseIndexReplicationAction.access$validatePauseReplicationRequest(TransportPauseIndexReplicationAction.kt:45) ~[?:?]
        at org.opensearch.replication.action.pause.TransportPauseIndexReplicationAction$masterOperation$1.invokeSuspend(TransportPauseIndexReplicationAction.kt:71) ~[?:?]
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) ~[?:?]
        at kotlinx.coroutines.internal.DispatchedContinuationKt.resumeCancellableWith(DispatchedContinuation.kt:367) ~[?:?]
        at kotlinx.coroutines.intrinsics.CancellableKt.startCoroutineCancellable(Cancellable.kt:30) ~[?:?]
        at kotlinx.coroutines.intrinsics.CancellableKt.startCoroutineCancellable$default(Cancellable.kt:25) ~[?:?]
        at kotlinx.coroutines.CoroutineStart.invoke(CoroutineStart.kt:110) ~[?:?]
        at kotlinx.coroutines.AbstractCoroutine.start(AbstractCoroutine.kt:126) ~[?:?]
        at kotlinx.coroutines.BuildersKt__Builders_commonKt.launch(Builders.common.kt:56) ~[?:?]
        at kotlinx.coroutines.BuildersKt.launch(Unknown Source) ~[?:?]
        at kotlinx.coroutines.BuildersKt__Builders_commonKt.launch$default(Builders.common.kt:47) ~[?:?]
        at kotlinx.coroutines.BuildersKt.launch$default(Unknown Source) ~[?:?]
        at org.opensearch.replication.action.pause.TransportPauseIndexReplicationAction.masterOperation(TransportPauseIndexReplicationAction.kt:68) ~[?:?]
        at org.opensearch.replication.action.pause.TransportPauseIndexReplicationAction.masterOperation(TransportPauseIndexReplicationAction.kt:45) ~[?:?]
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction.clusterManagerOperation(TransportClusterManagerNodeAction.java:135) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction.masterOperation(TransportClusterManagerNodeAction.java:144) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction.clusterManagerOperation(TransportClusterManagerNodeAction.java:153) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction.lambda$doStart$3(TransportClusterManagerNodeAction.java:271) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction.doStart(TransportClusterManagerNodeAction.java:271) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction.tryAction(TransportClusterManagerNodeAction.java:206) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.RetryableAction$1.doRun(RetryableAction.java:139) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.RetryableAction.run(RetryableAction.java:117) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction.doExecute(TransportClusterManagerNodeAction.java:167) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction.doExecute(TransportClusterManagerNodeAction.java:79) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.indexmanagement.controlcenter.notification.filter.IndexOperationActionFilter.apply(IndexOperationActionFilter.kt:39) ~[?:?]
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:118) ~[?:?]
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:324) ~[?:?]
        at org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:165) ~[?:?]
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:102) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:98) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) ~[?:?]
        at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceivedDecorate(SecuritySSLRequestHandler.java:206) ~[?:?]
        at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:317) ~[?:?]
        at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:154) ~[?:?]
        at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:795) ~[?:?]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:271) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:144) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]

Im not sure when did this index get the first pause. Before this i could see there is a pause for shardReplicationTask which has got triggered.

{"type":"log","host":"opensearch-data-2.default","container":"is-data","level":"INFO","time": "2024-11-21T07:28:13.243Z","logger":"o.o.r.t.s.ShardReplicationTask","timezone":"UTC","marker":"[opensearch-data-2] [test-1][3] ","log":{"message":"opensearch[opensearch-data-2][replication_follower][T#9]: Received cancellation of ShardReplicationTask java.util.concurrent.CancellationException: Shard replication task received pause.
        at kotlinx.coroutines.ExceptionsKt.CancellationException(Exceptions.kt:22)
        at kotlinx.coroutines.CoroutineScopeKt.cancel(CoroutineScope.kt:295)
        at kotlinx.coroutines.CoroutineScopeKt.cancel$default(CoroutineScope.kt:295)
        at org.opensearch.replication.task.CrossClusterReplicationTask.cancelTask(CrossClusterReplicationTask.kt:88)
        at org.opensearch.replication.task.shard.ShardReplicationTask.access$cancelTask(ShardReplicationTask.kt:60)
        at org.opensearch.replication.task.shard.ShardReplicationTask$ClusterStateListenerForTaskInterruption.clusterChanged(ShardReplicationTask.kt:187)
        at org.opensearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:627)
        at org.opensearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:614)
        at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:579)
        at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486)
        at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
"}}

Is the index being marked because of these subsequent pause form different tasks?

Also, can you confirm the below behaviour:

Any index which doesnt undergo any change within the retention lease period (i.e retention lease expires without any change in documents in leader and hence no change in follower) goes to auto-pause ?

@skumarp7
Copy link
Contributor Author

Hi @ankitkala,

can you help me understand on retention lease expiry ? is there any document which i can refer ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants