-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition map error / Replica partition nodes not defined #329
Comments
This is the built-in validation code for when a partition map is rebuilt after a cluster event (node up/down, partitioning, etc.) It means that the partition table was not fully set after tend, meaning that the client didn't catch up to the full change on that tend. |
Thanks for your reply @khaf; I can reliably reproduce this issue every time the Go application starts. I cannot find a corresponding metric in https://www.aerospike.com/docs/reference/metrics/ and the error message seems to indicate there might be a problem on server side, do you know how to proceed to identify the issue on server side? |
This is unlikely to be a server issue, rather a client issue. Could you share some information regarding your server config, number of nodes, client policy and how you connect to the server? How many log lines correspond to this message? Is it a one off every time you start the client, or is it recurrent? |
I see it every time service is started, let me share more details later about client/server configuration later. |
Thanks, looking forward. |
The Aerospike in use is the one provided by current Aerospike AMI on AWS, asinfo and configuration follow:
|
Hi @khaf By itself, this should not be a problem, unless it happens over a continuous period of time. We ever encounter partition map issues(partition with nil node) during runtime more than once, which leads to Get/BatchGet failures. Client.IsConnected() can't reflect if there is nil node for a partition. Correct me if my understanding is wrong. Is it possible to add a flag to indicate health status? Then, setPartitions() will trigger partition map validation if updatePartitionMap is true and output Error level log if having any issue. But the later clstr.getPartitions().validate() will output Debug level log if err not nil. It may lead to the host process failing to catch up partition map error, like one node is down or not connected actually(not sure if proper case). Could we change it to Error level log too? if updatePartitionMap { clstr.setPartitions(partitionMap) } if err := clstr.getPartitions().validate(); err != nil { Logger.Debug("Error validating the cluster partition map after tend: %s", err.Error()) } BTW, I have a technical issue to consult: for the tending in waitTillStabilized() during aerospike client initing, most of the seed nodes will be removed and will try to add at the next round tending. What's the purpose for that? Or is there a design wiki/doc about the tend strategy? Thanks ahead |
Isn't this a different issue than the one reported here? |
@gmazzotta |
Is there a possibility here that the client is operating inconsistently for an initial time window? If yes, should the client withhold operations until such multiple tend rounds are complete? |
@xqzhang2015 Thanks for the report, I'll take care of the log change before the weekend. For the other issue during the first tend, I'll have to look. |
Ah, no problem, and glad you have enough information to reproduce the issue @khaf! Many thanks for your work. I forgot to mention some details:
c, err := aero.NewClientWithPolicyAndHost(aero.NewClientPolicy(), nodeIPs...)
[...]
// set default policy for all reads without a specific policy
c.DefaultPolicy = aero.NewPolicy()
// allow reading from replica nodes instead of master only
c.DefaultPolicy.ReplicaPolicy = aero.MASTER_PROLES
// set default policy for all batch reads/writes without a specific policy
c.DefaultBatchPolicy = aero.NewBatchPolicy()
c.DefaultBatchPolicy.ReplicaPolicy = aero.MASTER_PROLES |
By the way, this is still happening (I see it when a service starts up):
|
This issue should have been mitigated significantly by the last release. Is it still happening? |
@khaf I am going to try and report back; question about the release: I get this error with
Would you consider merging a PR that adds |
The project does have a
|
Shall this be mirrored in README.md of both |
It is, right on the top. |
@khaf have not yet upgraded to use latest v5, but can confirm that with
(was using v3.1.0 before) Will report back about v5.3.0 in a couple days. |
I will not be able to provide any result here because I just discovered that in order to use the |
Hi there,
today I noticed this error in the logs (must be logged via
Printf
orInfof
), which looks quite worrysome:The mentioned URL does not provide help to troubleshoot this specific issue and I could not find online similar reports.
The text was updated successfully, but these errors were encountered: