You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've observed strange timeout behavior while waiting for the response of a get operation, sometimes the callback will be called with a timeout error and then quickly followed with another callback with the expected data. This behavior has been observed with all processes running on the same system and no known connectivity interruptions.
The process hosting the queryable does publish data to other keys in response to being queried and the frequency of the problem does seem to be related to number of subscribers of those other keys.
The typical test setup consists of 3 processes, all running locally on the same system:
A zenohd process with the file system storage backend running to save anything published to system/**
A controller process that has a queryable for system/thing/@new, whenever this queryable is queried
a new thing is created with some data and published to system/thing/<id> and then the reply is sent back to the
requester with the new thingy data.
A client process that issues a get for system/thing/@new then waits for the reply and prints the
new thing to the console and then exits.
With just these 3 processes running everything seems to run fine over many thousands of invocations of the client, however ...
If I start up a 4th (5th, 6th, ...) z_sub process that subscribes to system/** then strange things start happening to the client process. After some unknown number of iterations of running the client it will become blocked
waiting for the reply until it times out after 10 seconds, at which point the callback is called with a Timeout error which is then quickly followed by a callback with a valid Sample with the expected data for the newly created thing.
When the failure occurs I can see the following in the client process log output:
2024-09-11T20:20:03.470704Z WARN net-0 ThreadId(03) zenoh::net::routing::dispatcher::queries: Didn't receive final reply Face{0, 8cea9a494591375d0c2008365e5565da}:1 from Face{3, b35e038aac11eab155ae6e8509a1c2a7}: Timeout(10s)!
I have not yet identified anything abnormal in any of the other processes logs.
Additional Notes:
The zenohd process can be replaced with a z_sub process subscribed to system/**. The problem seems to be exacerbated by the number of subscriptions to data published by the queryable.
The number of iterations required to exhibit the problem seems random, sometimes it fails after a few hundred iterations sometimes after a few thousand iterations.
The problem was originally observed with C++ code based on zenoh-cpp, but has been reproduced with standard zenoh Rust code.
To reproduce
Build zenoh examples using the modified z_get.rs and z_queryable.rs in the attached example.tar.gz file (or build zenoh-cpp examples with the z_get.cxx and z_queryable.cxx files). Note: example.tar.gz includes a custom-scouting-port.json5 file that can be optionally used with any of the commands below.
In one shell launch z_queryable -k "system/thing/@new"
In two or more other shells launch z_sub -k "system/**"
In a final shell start a loop to repeatedly call z_get -s "system/thing/@new" until the process reports an error, eg. in bash: while ./z_get -s "system/thing/@new"; do : ; done
Allow the z_get loop to continue to run until a query reports a timeout and exits with a non-zero value stopping the loop. It may take a few thousand iterations to fail. Adding more subscriber processes may cause a failure to occur sooner.
When a failure occurs the output should look like this:
Opening session...
Sending Query 'system/thing/@new'...
>> Received (ERROR: 'Timeout')
>> Received ('system/thing/@new': 'Queryable from Rust!9696')
Note that queryable response is received after the timeout message (which is very odd) and the count at the end of the response payload text will match the last response sent as indicated in the output of the queryable process:
>> [Queryable ] Received Query 'system/thing/@new' with payload ''
>> [Queryable ] Responding ('system/thing/@new': 'Queryable from Rust!9695')
>> [Queryable ] Received Query 'system/thing/@new' with payload ''
>> [Queryable ] Responding ('system/thing/@new': 'Queryable from Rust!9696')
Hello @mschuckmann,
Thank you for your detailed issue.
Two things i've noted: the z_queryable.rs provided does not build against a2eaf2e
As the ability to clone a session was only implemented in a later commit. 77250fb
replacing the cloned session, i can confirm that it is possible to reproduce this behavior.
Running the provided z_queryable.rs and z_get.rs on the latest commit on main 941f699 with the above setup does not result in get reaching a timeout.
Please can you try your setup on this commit and confirm if the timeout still occurs.
Describe the bug
I've observed strange timeout behavior while waiting for the response of a
get
operation, sometimes the callback will be called with a timeout error and then quickly followed with another callback with the expected data. This behavior has been observed with all processes running on the same system and no known connectivity interruptions.The process hosting the queryable does publish data to other keys in response to being queried and the frequency of the problem does seem to be related to number of subscribers of those other keys.
The typical test setup consists of 3 processes, all running locally on the same system:
zenohd
process with the file system storage backend running to save anything published tosystem/**
controller
process that has a queryable forsystem/thing/@new
, whenever this queryable is querieda new thing is created with some data and published to
system/thing/<id>
and then the reply is sent back to therequester with the new thingy data.
get
forsystem/thing/@new
then waits for the reply and prints thenew thing to the console and then exits.
With just these 3 processes running everything seems to run fine over many thousands of invocations of the
client
, however ...If I start up a 4th (5th, 6th, ...)
z_sub
process that subscribes tosystem/**
then strange things start happening to the client process. After some unknown number of iterations of running the client it will become blockedwaiting for the reply until it times out after 10 seconds, at which point the callback is called with a
Timeout
error which is then quickly followed by a callback with a valid Sample with the expected data for the newly created thing.When the failure occurs I can see the following in the client process log output:
I have not yet identified anything abnormal in any of the other processes logs.
Additional Notes:
zenohd
process can be replaced with az_sub
process subscribed tosystem/**
. The problem seems to be exacerbated by the number of subscriptions to data published by the queryable.To reproduce
z_get.rs
andz_queryable.rs
in the attachedexample.tar.gz
file (or build zenoh-cpp examples with thez_get.cxx
andz_queryable.cxx
files). Note:example.tar.gz
includes acustom-scouting-port.json5
file that can be optionally used with any of the commands below.z_queryable -k "system/thing/@new"
z_sub -k "system/**"
z_get -s "system/thing/@new"
until the process reports an error, eg. in bash:while ./z_get -s "system/thing/@new"; do : ; done
z_get
loop to continue to run until a query reports a timeout and exits with a non-zero value stopping the loop. It may take a few thousand iterations to fail. Adding more subscriber processes may cause a failure to occur sooner.When a failure occurs the output should look like this:
Note that queryable response is received after the timeout message (which is very odd) and the count at the end of the response payload text will match the last response sent as indicated in the output of the queryable process:
example.tar.gz
System info
Dockerfile
in example.tar.gz), container is running in host network mode.The text was updated successfully, but these errors were encountered: