-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initiating connection migration with tmb? #98
Comments
We continue seeing this message semi frequently for several days so I do not think we have any underlying network problem (we run on an enterprise class DC network where any disturbances are resolved quickly) and I can't see any excessive GC pauses in our log... Is this a message we need to act on or not critical? |
Seem to be triggered in class com.oracle.coherence.common.internal.net.socketbus.BufferedSocketBus method checkHealth(long) but it is still not clear to me exactly what ack it is that is timing out or what the code is trying to "do about it" i.e. the "migration":
|
Hi, these messages can be quite common depending on system size, load, context (application busy, GC, ...) and as you can see they are harmless. The heartbeat (health check) is exchanged regularly by cluster members to ensure the cluster is whole. Everyone looks after each other, so that generates a fair amount of ancillary activity for which once in a while we detect issues. At that point we decide to "migrate" the connection, which is merely opening a new connection and ditching the old one. In some cases a "fresh" connection may resolve OS or JVM congestion issues. In most cases it is just a precaution that has no consequences. These heartbeat messages are purely for housekeeping and they are timed relatively aggressively to weed out potential issues. Application/data traffic is not impacted. If you do see them frequently, however, they are an indication that some tuning may be necessary: heap and GC pressure, network health, native memory (OS/hardware) for example. Running network performance tests can show you if network settings are ok, they can be eye opening. Let us know how it goes. |
Thanks a lot for the info - will continue investigating!
…On Fri, Mar 17, 2023, 17:49 Maurice Gamanho ***@***.***> wrote:
Hi, these messages can be quite common depending on system size, load,
context (application busy, GC, ...) and as you can see they are harmless.
The heartbeat (health check) is exchanged regularly by cluster members to
ensure the cluster is whole. Everyone looks after each other, so that
generates a fair amount of ancillary activity for which once in a while we
detect issues. At that point we decide to "migrate" the connection, which
is merely opening a new connection and ditching the old one. In some cases
a "fresh" connection may resolve OS or JVM congestion issues. In most cases
it is just a precaution that has no consequences.
These heartbeat messages are purely for housekeeping and they are timed
relatively aggressively to weed out potential issues. Application/data
traffic is not impacted. If you do see them frequently, however, they are
an indication that some tuning may be necessary: heap and GC pressure,
network health, native memory (OS/hardware) for example. Running network
performance tests
<https://docs.oracle.com/en/middleware/standalone/coherence/14.1.1.0/administer/performing-network-performance-test.html>
can show you if network settings are ok, they can be eye opening.
Let us know how it goes.
—
Reply to this email directly, view it on GitHub
<#98 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQF42BBGGT42MGFDNXMTW4SI2DANCNFSM6AAAAAAV2EBS3E>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
We are getting quite frequent warnings in our Coherence log on storage enabled nodes mentioning "initiating connection migration with tmb" and I would like some info about what it means on a more technical level and perhaps suggestions on most common causes?
I am guessing some problem on the physical network as well as long GC pauses could result in more or less any network related warning in Coherence but are there also other possible reasons and are there any tips on how to further debug the problem?
2023-03-13 20:17:56.310/95085.948 Oracle Coherence CE 14.1.1.0.12 (thread=SelectionService(channels=112, selector=MultiplexedSelector(sun.nio.ch.EPollSelectorImpl@3676ac27), id=1660451908), member=17): tmb://138.106.96.41:9001.49982 initiating connection migration with tmb://138.106.96.25:33316.40573 after 15s ack timeout health(read=false, write=true), receiptWait=Message "PartialValueResponse"
{
FromMember=Member(Id=17, Timestamp=2023-03-12 17:53:14.974, Address=138.106.96.41:9001, MachineId=46694, Location=site:sss.se.xxxxyyyy.com,machine:l4041p.sss.se.xxxxyyyy.com,process:391,member:l4041p-2, Role=storagenode)
FromMessageId=0
MessagePartCount=0
PendingCount=0
BufferCounter=1
MessageType=70
ToPollId=19827300
Poll=null
Packets
{
}
Service=PartitionedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, OldestMemberId=1, LocalStorage=enabled, PartitionCount=601, BackupCount=1, AssignedPartitions=18, BackupPartitions=19, CoordinatorId=1}
ToMemberSet=MemberSet(Size=1
Member(Id=179, Timestamp=2023-03-12 18:43:36.847, Address=138.106.96.25:33316, MachineId=48549, Location=site:sss.se.xxxxyyyy.com,machine:l4025p,process:3473,member:l4025p_11990, Role=scex)
)
NotifyDelivery=false
}: peer=tmb://138.106.96.25:33316.40573, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/138.106.96.25,port=38114,localport=9001]}, migrations=17, bytes(in=104371345, out=101784244), flushlock false, bufferedOut=0B, unflushed=0B, delivered(in=203177, out=197772), timeout(ack=0ns), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=95922, out=99203/99206)
2023-03-13 20:17:56.310/95085.948 Oracle Coherence CE 14.1.1.0.12 (thread=SelectionService(channels=112, selector=MultiplexedSelector(sun.nio.ch.EPollSelectorImpl@3676ac27), id=1660451908), member=17): tmb://138.106.96.41:9001.49982 initiating connection migration with tmb://138.106.96.32:41070.40752 after 15s ack timeout health(read=true, write=false), receiptWait=null: peer=tmb://138.106.96.32:41070.40752, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/138.106.96.32,port=41070,localport=36388]}, migrations=5, bytes(in=95752773, out=99458811), flushlock false, bufferedOut=1.54KB, unflushed=0B, delivered(in=192506, out=187239), timeout(ack=0ns), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=90667, out=93950/93953)
The text was updated successfully, but these errors were encountered: