-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silo won't recover #9160
Comments
Do you see anything logs mentioning |
We have some messages like this: The only non default value in ClusterMembershipOptions is DefunctSiloExpiration = 1 day |
Ok, that Thread Pool delay is an indication that application performance is severely degraded for some reason. Profiling or a memory dump might help to indicate what. Can you tell me more about your scenario? Does your app log at a very high rate? Where are you running your app? If kubernetes, what resources are provided to the pods and do you have CPU limits set? |
We did rolling restart of the silos with debug logging enabled. Multiple apps login attempts were made to reproduce the Timeout issue previously encountered. The following observations were made:
We were able to get the environment working once we clear up the membership table completely and roll out the nodes again. Let us know if there is any related information that could help to find the root cause of it. |
I've been working with @pablo-salad on this issue. Whenever the silos get gummed up, this process seems to be the surefire way to get it working again:
Although this once occurred during a period of heavy load, over the past couple of weeks this problem has persisted after every new rolling deployment while the servers have been consistently ~20% resource utilization. As noted in the last message from @pablo-salad, there is almost nothing interesting we can find in the debug level logs from Orleans system components. Would this issue be better served in the Orleans.Clustering.Kubernetes repo? |
@seniorquico @pablo-salad thanks for the information. A little more information may be helpful:
|
This particular cluster is used for QA. We have a K8s autoscaler attached, and we allow it to scale down to 4 silos. It rarely scales up. When automated tests are running, we may be talking 1000-2000 grain activations over several minutes. When QA and developers sign-in for something exploratory, it looks like the screenshot above- maybe 100-200 activations over several minutes. We're way overprovisioned, but we want to keep it as similar as we can to our production environment and have had issues in the past when we scale too low. This "membership" issue (just what it appears to look like to us) has been occurring on every rolling deployment to this QA cluster for the past ~4 weeks. We were finding silly workarounds, but ultimately have landed on the above procedure being the only reliable way to get it working every time. This problem has never occurred in the production cluster (knock on wood!). Hardware wise, it runs on the same GCP class of VMs as our production cluster, but a fully separate GKE cluster/nodes. We use Terraform and Argo CD templates for everything- they're as close to each other as we think we can get. |
Thanks for the additional info. When silos restart during the rolling restart, are they able to connect to the cluster? Do you have any startup tasks? Which version of .NET and Orleans are you running? Are you able to provide the contents of the membership table during an outage? |
I'm not quite sure what you mean (technically) by "connect to the cluster". The new silos that come up successfully add themselves to the membership table, and I haven't noticed any network errors in the logs. We have debug-level logging enabled for the Orleans.* namespaces. Is there an example message I could look for to confirm they've "connected"? It may not be relevant, but we have used this gist as the basis of our K8s health checks, and the health checks continue to pass during the outage: https://gist.github.com/ReubenBond/2881c9f519910ca4a378d148346e9114 Apologies, I don't know how the local grain call is meant to behave before the silo successfully connects to the cluster.
No.
Orleans 8.2.0. Compiled using .NET SDK 8.0.401 and using the .NET 8.0.8 runtime.
It may be a few days until our next rolling deployment. I'll dump the K8s |
Apologies for the delayed reply. Here's a ZIP file of kubectl dumps of the membership table: In summary... before the rolling deployment (
... during the rolling deployment (
... after the rolling deployment (
EDIT: Just to confirm, I let it run for a couple of hours after the rolling deployment finished. The silos in the "active" state continued to show up as such in the membership table, but no messages were flowing around. I scaled all of the deployments to zero, deleted all of the |
Context:
DefunctSiloExpiration
period).Orleans.Runtime.OrleansException: Current directory at S10.216.2.133:11111:86745024 is not stable to perform the lookup for grainId user/05f621896f744c099f6136809969d981 (it maps to S10.216.3.135:11111:86731531, which is not a valid silo). Retry later.
Response did not arrive on time in 00:00:30 for message: Request [S10.216.3.136:11111:86745026 sys.svc.dir.cache-validator/10.216.3.136:11111@86745026]->[S10.216.2.133:11111:86745024 sys.svc.dir.cache-validator/10.216.2.133:11111@86745024] Orleans.Runtime.IRemoteGrainDirectory.LookUpMany(System.Collections.Generic.List1[System.ValueTuple2[Orleans.Runtime.GrainId,System.Int32]]) #1865509. . About to break its promise.
Steps Taken:
Questions/Concerns:
The text was updated successfully, but these errors were encountered: