Replies: 20 comments 4 replies
-
This is a serious problem for our production environment with its 37 streams and their 610 consumers. Any workarounds we could implement, until the bug is fixed? |
Beta Was this translation helpful? Give feedback.
-
@scottf could you respond here. Thanks. |
Beta Was this translation helpful? Give feedback.
-
@scottf @derekcollison friendly reminder. |
Beta Was this translation helpful? Give feedback.
-
@ewirch Sorry for the delay getting on this. I have some questions.
try something like this:
Another thing to look into is this set of example code that runs against 2.16.10. It demonstrates how to use heartbeats, the error listener and what to expect in a few different situations, including when a server comes down, especially the ConnectionLostWhilePullActive.java class |
Beta Was this translation helpful? Give feedback.
-
Hi @scottf. Thanks for the hints. Sadly they did not lead to a workaround, which we could use.
There is no "not restarting". The NATS server restarts gracefully. I never claimed the server would fail to restart. I'm restarting the server to test the reconnection behavior of the library. Here are my findings following your tips: Trying
without code modifications shows same behavior. Using pull with heart beats and expire timeout shows interesting new behavior.
NATS server output:
We see the first 10 heart beats, and then nothing. Probably as expected as well? I initiated a pull, the pull expired. Do I need to pull again? Buf if this is the case, I'd expect I see no way to detect when the pull expired, so I simply move the var pro = PullRequestOptions.builder(10)
.expiresIn(10_000)
.idleHeartbeat(1_000)
.build();
while (true) {
try {
System.out.println("listening...");
subscription.pull(pro);
var message = subscription.nextMessage(Duration.ZERO);
if (message != null) {
System.out.println("received msg: " + new String(message.getData()));
message.ack();
}
} catch (JetStreamStatusException e) {
...
}
} This time, the pull is re-initiated, when it expires. We see a list of It gets even more interesting when we restart the NATS server. The library gets in an infinite loop, printing the same line over and over again:
According to
|
Beta Was this translation helpful? Give feedback.
-
Ok I saw this and was just asking for clarification about whether this was to show a problem or not.
|
Beta Was this translation helpful? Give feedback.
-
@ewirch So next thing is this:
Zero (0) means block forever until there is a message. So even if the As far as the re-issuing the pulls, you are going to need to track the time somehow. I'll work on a way to surface information about raw-pulls when they complete, we just added them for simplification, but there is currently no way to get that info for the old api. The fetch is a good example of how to loop over a pull with expiration. Basically I'm tracking the time used and make sure it's more than the pull expiration or the full batch size, whichever comes first. So for heartbeat alarms, when you get one via the error listener, this is the way you know the server is down or there is a likely non recoverable situation with the specific subscription. So there needs to be some logic in the code to recognize this state. At the moment I'm not sure if a pull will recover after a heartbeat warning, I'll have to put an example together. Can I suggest you look into using the fetch or iterate api, or even looking into the new simplification api, there is a lot of work done for you? |
Beta Was this translation helpful? Give feedback.
-
So here are some examples using our new simplification api. (they currently experimental so signatures/ api can change) So these will make the happy path much easier. The error listener still is required to know when there are connection / disconnection / reconnection events as well as heartbeat alarms. These indicators can be used to help recognize that an subscription is probably dead and needs to be discarded and re-made, possibly to another entire connection |
Beta Was this translation helpful? Give feedback.
-
Also this piece of code: https://github.com/nats-io/java-nats-examples/blob/main/error-and-heartbeat-experiments/src/main/java/io/nats/ConnectionLostWhilePullActive.java |
Beta Was this translation helpful? Give feedback.
-
I am going to move this to the java repo since its more about the java client, not the server. |
Beta Was this translation helpful? Give feedback.
-
Is it? Please review the original post. After a server restart, the java client correctly re-connects. Even sends subscription messages:
but the server still considers the consumer to have 0 waiting pulls. So is this really as expected for the server? |
Beta Was this translation helpful? Give feedback.
-
@scottf, do you think NATS server behavior described in original post (ignoring subscribes) is by design, or is it a bug? |
Beta Was this translation helpful? Give feedback.
-
@ewirch I'm think I understand. So if a server goes down and the client is disconnected, the pull is closed, which means even if the client reconnects, the pull is probably expired and you don't get messages. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the explanation, @scottf. In that case, I think the Java library pull API is flawed. The library detects a network problem and re-connects. It should check if the pull is still active, and either renew the pull or fail, so the lib caller can create a new pull. |
Beta Was this translation helpful? Give feedback.
-
@scottf Hi, i move to new api |
Beta Was this translation helpful? Give feedback.
-
I'm working on this whole problem starting today and next week. Please be patient. |
Beta Was this translation helpful? Give feedback.
-
Any solution ? |
Beta Was this translation helpful? Give feedback.
-
The heartbeat is the key. Could go as far as tracking connection events. There are a lot of variables to track, I'm debugging. Will have something by Monday. |
Beta Was this translation helpful? Give feedback.
-
@ewirch @kchrusciel @rdkirony Can you see if the latest snapshot (2.17.2-SNAPSHOT) solves your problems?
There is a similar issue here to follow #997. I'm closing this one in favor of it. But here is the synopsis. The work done in these PRs... This affects the following consumers.
|
Beta Was this translation helpful? Give feedback.
-
Hey, we had same problem, As i can see what happens is:
|
Beta Was this translation helpful? Give feedback.
-
Defect
We use the Java client library to create a pull-subscription and listen on it. When NATS server is restarted, the lib reconnects and even re-sends the subcriptions, but NATS won't send any messages to the client any more.
nats-server -DV
outputnats-server -DV
I'm running
docker.io/library/nats@sha256:58e483681983f87fdc738e980d582d4cc7218914c4f0056d36bab0c5acfc5a8b
locally. Executing the above command in the running container gives me:Versions of
nats-server
and affected client libraries used:nats-server: 2.9.15
client: io.nats:jnats:2.16.9
OS/Container environment:
Podman 4.4.1 on Arch Linux.
Steps or code to reproduce the issue:
Starting with this basic listener:
(full project: test-project.zip )
Start nats-server:
(you can also simply replace
podman
bydocker
)Follow the logs in the background:
Start the sample project above. App waits for messages after subscribing to stream:
Observe in nats-server output, that nats received the subscibe:
Check out the consumer info:
(I removed irrelevant lines)
So there is a pull registered.
Publish a message:
Observe in app output, that message was received:
Now restart nats-server:
Again, observe in nats-server output that nats receives subscriptions after re-connect:
Check out the consumer info:
Nats has no waiting pull any more!
Publish a message:
Observe, that app does not receive the message (nats does not deliver it). When you restart the app, message will be received.
Expected result:
Messages are received after re-connect.
Actual result:
Messages are not received after re-connect.
Beta Was this translation helpful? Give feedback.
All reactions