-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue on windows when closing several subscriptions #1249
Comments
What version are you using? Is this the Java client? |
Java AeronMediaDriver, v 1.31.1 (3 threads spinning), |
As of 1.28.0 the Java client does an async close of all resources. I'll check if this has been ported to the C# client. Are the log buffers set to What type of drives are in those Windows machines and how often do you optimize them? |
We're currently using sparse=true and pre.touch=false. They are SSDs in an HPE RAID controller. I don't think it's possible to TRIM them. We think that the problem is actually the media driver blocking rather than the client. We can see that it stops sending status messages for unrelated subscriptions when this happens. Or is it possible that a blocking client could cause this somehow? I think status messages are still sent when clients aren't making progress. |
I believe you are right that this is the media driver taking too long to service the subscription changes. Can you provide more details about how many active streams, term lengths, and how many subscriptions are closed and re-opened at the same time? This will help with creating a test so we can investigate. |
Also can you provide details of what Windows version and Java version this is running on? |
The Windows version is Windows Server 2016 and the Java version 1.8.0_121. This is the full list of subscriptions and publications from one of the affected machines. We've seen an issue in different situations. Closing and re-opening the subscriptions with the stream id 70004 and roughly the same number with the stream id 800XX at the same time can cause the issue. SUBSCRIPTIONS
PUBLICATIONS
|
Thanks @chrisejones. Are the publications concurrent or exclusive? |
I've updated the table above |
Did you mean 1800XX? Also are they exclusive? We have been simulating this and found a few things but nothing to the extent of the pauses you are describing. |
To help avoid such issues we have made some changes. The number of commands processed per work cycle has been reduced from 10 to 2. We have also moved the sending of status messages on an interval from the conductor to receiver thread which means this can continue if the conductor takes a pause due to file IO. While doing this it was discovered the updating of the cached clocks could get stale due to the conductor taking a long pause. The sender and receiver now have their own local cached clocks so that heartbeats and status messages continue. It would be good if you could test the head of the main repository to see if this fixes the issue for you. We would also recommend you upgrade your Java version from 1.8.0_121 to 1.8.0_282 as many bugs have since been fixed. |
I wonder if this could possibly be related to name resolution for DNS. Are you using names rather than IP addresses? Can you use |
* [Java] Go immediately from LEADER_LOG_REPLICATION to LEADER_REPLAY and don't update the leader's commit position. * [Java] Move replication deadline forward if newLeadershipTerms are being received. * [Java] Change LEADER_LOG_REPLICATION back to waiting for followers to replication, but don't update leader commitPosition until LEADER_REPLAY has completed.
Hello,
We're seeing an issue only on Windows where closing several subscriptions at once seems to cause an unrelated subscription to go unavailable. We believe this is because the media driver may be spending too long flushing the log buffers for those subscriptions to disk and this is causing the aeron.image.liveness.timeout to be hit.
We've tried using a ramdisk as recommended in the Aeron documentation and that seemed to make closing subscriptions about twice as fast, which was not enough to solve this issue.
Is there a solution to this? Other than reducing the number of streams? Would it be possible to make closing subscriptions asynchronous?
The text was updated successfully, but these errors were encountered: