Query/Reply Timeout issue #1409

mschuckmann · 2024-09-12T22:12:11Z

Describe the bug

I've observed strange timeout behavior while waiting for the response of a get operation, sometimes the callback will be called with a timeout error and then quickly followed with another callback with the expected data. This behavior has been observed with all processes running on the same system and no known connectivity interruptions.

The process hosting the queryable does publish data to other keys in response to being queried and the frequency of the problem does seem to be related to number of subscribers of those other keys.

The typical test setup consists of 3 processes, all running locally on the same system:

A zenohd process with the file system storage backend running to save anything published to system/**
A controller process that has a queryable for system/thing/@new, whenever this queryable is queried
a new thing is created with some data and published to system/thing/<id> and then the reply is sent back to the
requester with the new thingy data.
A client process that issues a get for system/thing/@new then waits for the reply and prints the
new thing to the console and then exits.

With just these 3 processes running everything seems to run fine over many thousands of invocations of the client, however ...

If I start up a 4th (5th, 6th, ...) z_sub process that subscribes to system/** then strange things start happening to the client process. After some unknown number of iterations of running the client it will become blocked
waiting for the reply until it times out after 10 seconds, at which point the callback is called with a Timeout error which is then quickly followed by a callback with a valid Sample with the expected data for the newly created thing.

When the failure occurs I can see the following in the client process log output:

2024-09-11T20:20:03.470704Z  WARN net-0 ThreadId(03) zenoh::net::routing::dispatcher::queries: Didn't receive final reply Face{0, 8cea9a494591375d0c2008365e5565da}:1 from Face{3, b35e038aac11eab155ae6e8509a1c2a7}: Timeout(10s)!

I have not yet identified anything abnormal in any of the other processes logs.

Additional Notes:

The zenohd process can be replaced with a z_sub process subscribed to system/**. The problem seems to be exacerbated by the number of subscriptions to data published by the queryable.
The number of iterations required to exhibit the problem seems random, sometimes it fails after a few hundred iterations sometimes after a few thousand iterations.
The problem was originally observed with C++ code based on zenoh-cpp, but has been reproduced with standard zenoh Rust code.

To reproduce

Build zenoh examples using the modified z_get.rs and z_queryable.rs in the attached example.tar.gz file (or build zenoh-cpp examples with the z_get.cxx and z_queryable.cxx files). Note: example.tar.gz includes a custom-scouting-port.json5 file that can be optionally used with any of the commands below.
In one shell launch z_queryable -k "system/thing/@new"
In two or more other shells launch z_sub -k "system/**"
In a final shell start a loop to repeatedly call z_get -s "system/thing/@new" until the process reports an error, eg. in bash:
while ./z_get -s "system/thing/@new"; do : ; done
Allow the z_get loop to continue to run until a query reports a timeout and exits with a non-zero value stopping the loop. It may take a few thousand iterations to fail. Adding more subscriber processes may cause a failure to occur sooner.

When a failure occurs the output should look like this:

Opening session...
Sending Query 'system/thing/@new'...
>> Received (ERROR: 'Timeout')
>> Received ('system/thing/@new': 'Queryable from Rust!9696')

Note that queryable response is received after the timeout message (which is very odd) and the count at the end of the response payload text will match the last response sent as indicated in the output of the queryable process:

>> [Queryable ] Received Query 'system/thing/@new' with payload ''
>> [Queryable ] Responding ('system/thing/@new': 'Queryable from Rust!9695')
>> [Queryable ] Received Query 'system/thing/@new' with payload ''
>> [Queryable ] Responding ('system/thing/@new': 'Queryable from Rust!9696')

example.tar.gz

System info

Docker container based on Ubuntu 22.04 image (see Dockerfile in example.tar.gz), container is running in host network mode.
Host System OS Ubuntu 22.04.4
CPU 13th Gen Intel(R) Core(TM) i9-13950HX
Zenoh commit: a2eaf2e
Zenoh-c commit: eed8c27519924c79f27c74a68f4d5e4793e2fd30
Zenoh-cpp commit: 4ca772730b1362b0588db82352038d1723291bc6

The text was updated successfully, but these errors were encountered:

mschuckmann · 2024-09-12T22:20:54Z

Originally posted to Discord thread: https://discord.com/channels/914168414178779197/940584045287460885/1283539110019076159

Charles-Schleich · 2024-09-27T10:25:06Z

Hello @mschuckmann,
Thank you for your detailed issue.

Two things i've noted: the z_queryable.rs provided does not build against a2eaf2e
As the ability to clone a session was only implemented in a later commit. 77250fb
replacing the cloned session, i can confirm that it is possible to reproduce this behavior.

Running the provided z_queryable.rs and z_get.rs on the latest commit on main 941f699 with the above setup does not result in get reaching a timeout.

Please can you try your setup on this commit and confirm if the timeout still occurs.

mschuckmann added the bug Something isn't working label Sep 12, 2024

Charles-Schleich self-assigned this Sep 27, 2024

li9i mentioned this issue Dec 20, 2024

[Bug] "long" actions still not returning result eclipse-zenoh/zenoh-plugin-ros2dds#367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query/Reply Timeout issue #1409

Query/Reply Timeout issue #1409

mschuckmann commented Sep 12, 2024

mschuckmann commented Sep 12, 2024

Charles-Schleich commented Sep 27, 2024

Query/Reply Timeout issue #1409

Query/Reply Timeout issue #1409

Comments

mschuckmann commented Sep 12, 2024

Describe the bug

To reproduce

System info

mschuckmann commented Sep 12, 2024

Charles-Schleich commented Sep 27, 2024