Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segment Replication stats throwing NPE when shards are unassigned or are in delayed allocation phase #11945

Open
shourya035 opened this issue Jan 19, 2024 · 4 comments · May be fixed by #14580
Assignees
Labels
bug Something isn't working good first issue Good for newcomers low hanging fruit Storage Issues and PRs relating to data and metadata storage

Comments

@shourya035
Copy link
Member

Describe the bug

We are seeing NPEs coming up from the NodesStats API when there are nodes data nodes dropping out of the cluster because of resource constraints. NodesStats API fired at that point of time when there are shards getting unassigned from the data nodes (because of data nodes leaving the cluster), fails with this error:

java.lang.NullPointerException: Cannot invoke "org.opensearch.cluster.routing.AllocationId.getId()" because the return value of "org.opensearch.cluster.routing.ShardRouting.allocationId()" is null
    at org.opensearch.index.seqno.ReplicationTracker.lambda$isPrimaryRelocation$18(ReplicationTracker.java:1246)
    at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
    at java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1602)
    at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:129)
    at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:527)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:513)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
    at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:150)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:652)
    at org.opensearch.index.seqno.ReplicationTracker.isPrimaryRelocation(ReplicationTracker.java:1247)
    at org.opensearch.index.seqno.ReplicationTracker.lambda$getSegmentReplicationStats$23(ReplicationTracker.java:1318)
    at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
    at java.base/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
    at org.opensearch.index.seqno.ReplicationTracker.getSegmentReplicationStats(ReplicationTracker.java:1321)
    at org.opensearch.index.shard.IndexShard.getReplicationStatsForTrackedReplicas(IndexShard.java:3120)
    at org.opensearch.index.shard.IndexShard.getReplicationStats(IndexShard.java:3125)
    at org.opensearch.index.shard.IndexShard.segmentStats(IndexShard.java:1500)
    at org.opensearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:228)
    at org.opensearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:146)
    at org.opensearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:66)
    at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:495)
    at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:469)
    at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:456)
    at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceivedDecorate(SecuritySSLRequestHandler.java:224)
    at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:323)
    at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:172)
    at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:797)
    at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:113)
    at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43)
    at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
    at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:471)
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917)
    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

This seems to be coming up from the SegmentReplicationStats code, specifically from this code block which tries to detect if a primary shard is being relocated by cross checking the current allocationId with all the allocationIds from the shard routing table.

private boolean isPrimaryRelocation(String allocationId) {
Optional<ShardRouting> shardRouting = routingTable.shards()
.stream()
.filter(routing -> routing.allocationId().getId().equals(allocationId))
.findAny();
return shardRouting.isPresent() && shardRouting.get().primary();
}

Related component

Storage

To Reproduce

N/A

Expected behavior

NodesStats API should not fail even during transient data node drops

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@shourya035 shourya035 added bug Something isn't working untriaged labels Jan 19, 2024
@github-actions github-actions bot added the Storage Issues and PRs relating to data and metadata storage label Jan 19, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3]
@shourya035 Thanks for filing this bug with great details - could you create a pull request to resolve it?

@sachinpkale sachinpkale moved this from 🆕 New to Ready To Be Picked in Storage Project Board May 30, 2024
@rampreeth
Copy link

Hi, I'd be happy to pick this up if it's not already addressed.

@peternied
Copy link
Member

@rampreeth Thanks - assigned this issue to you!

@rampreeth
Copy link

rampreeth commented Jun 27, 2024

Hi @peternied , thanks.
I have created a PR here.
I tried adding a unit test but looks like allocationId can never be null. I'm a bit unsure as to how to test this change. Would you be able to help with that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers low hanging fruit Storage Issues and PRs relating to data and metadata storage
Projects
Status: Ready To Be Picked
4 participants