Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query/Reply Timeout issue #1409

Open
mschuckmann opened this issue Sep 12, 2024 · 2 comments
Open

Query/Reply Timeout issue #1409

mschuckmann opened this issue Sep 12, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@mschuckmann
Copy link

Describe the bug

I've observed strange timeout behavior while waiting for the response of a get operation, sometimes the callback will be called with a timeout error and then quickly followed with another callback with the expected data. This behavior has been observed with all processes running on the same system and no known connectivity interruptions.

The process hosting the queryable does publish data to other keys in response to being queried and the frequency of the problem does seem to be related to number of subscribers of those other keys.

The typical test setup consists of 3 processes, all running locally on the same system:

  1. A zenohd process with the file system storage backend running to save anything published to system/**
  2. A controller process that has a queryable for system/thing/@new, whenever this queryable is queried
    a new thing is created with some data and published to system/thing/<id> and then the reply is sent back to the
    requester with the new thingy data.
  3. A client process that issues a get for system/thing/@new then waits for the reply and prints the
    new thing to the console and then exits.

With just these 3 processes running everything seems to run fine over many thousands of invocations of the client, however ...

If I start up a 4th (5th, 6th, ...) z_sub process that subscribes to system/** then strange things start happening to the client process. After some unknown number of iterations of running the client it will become blocked
waiting for the reply until it times out after 10 seconds, at which point the callback is called with a Timeout error which is then quickly followed by a callback with a valid Sample with the expected data for the newly created thing.

When the failure occurs I can see the following in the client process log output:

2024-09-11T20:20:03.470704Z  WARN net-0 ThreadId(03) zenoh::net::routing::dispatcher::queries: Didn't receive final reply Face{0, 8cea9a494591375d0c2008365e5565da}:1 from Face{3, b35e038aac11eab155ae6e8509a1c2a7}: Timeout(10s)!

I have not yet identified anything abnormal in any of the other processes logs.

Additional Notes:

  • The zenohd process can be replaced with a z_sub process subscribed to system/**. The problem seems to be exacerbated by the number of subscriptions to data published by the queryable.
  • The number of iterations required to exhibit the problem seems random, sometimes it fails after a few hundred iterations sometimes after a few thousand iterations.
  • The problem was originally observed with C++ code based on zenoh-cpp, but has been reproduced with standard zenoh Rust code.

To reproduce

  1. Build zenoh examples using the modified z_get.rs and z_queryable.rs in the attached example.tar.gz file (or build zenoh-cpp examples with the z_get.cxx and z_queryable.cxx files). Note: example.tar.gz includes a custom-scouting-port.json5 file that can be optionally used with any of the commands below.
  2. In one shell launch z_queryable -k "system/thing/@new"
  3. In two or more other shells launch z_sub -k "system/**"
  4. In a final shell start a loop to repeatedly call z_get -s "system/thing/@new" until the process reports an error, eg. in bash:
    while ./z_get -s "system/thing/@new"; do : ; done
  5. Allow the z_get loop to continue to run until a query reports a timeout and exits with a non-zero value stopping the loop. It may take a few thousand iterations to fail. Adding more subscriber processes may cause a failure to occur sooner.

When a failure occurs the output should look like this:

Opening session...
Sending Query 'system/thing/@new'...
>> Received (ERROR: 'Timeout')
>> Received ('system/thing/@new': 'Queryable from Rust!9696')

Note that queryable response is received after the timeout message (which is very odd) and the count at the end of the response payload text will match the last response sent as indicated in the output of the queryable process:

>> [Queryable ] Received Query 'system/thing/@new' with payload ''
>> [Queryable ] Responding ('system/thing/@new': 'Queryable from Rust!9695')
>> [Queryable ] Received Query 'system/thing/@new' with payload ''
>> [Queryable ] Responding ('system/thing/@new': 'Queryable from Rust!9696')

example.tar.gz

System info

  • Docker container based on Ubuntu 22.04 image (see Dockerfile in example.tar.gz), container is running in host network mode.
  • Host System OS Ubuntu 22.04.4
  • CPU 13th Gen Intel(R) Core(TM) i9-13950HX
  • Zenoh commit: a2eaf2e
  • Zenoh-c commit: eed8c27519924c79f27c74a68f4d5e4793e2fd30
  • Zenoh-cpp commit: 4ca772730b1362b0588db82352038d1723291bc6
@mschuckmann mschuckmann added the bug Something isn't working label Sep 12, 2024
@mschuckmann
Copy link
Author

@Charles-Schleich Charles-Schleich self-assigned this Sep 27, 2024
@Charles-Schleich
Copy link
Member

Hello @mschuckmann,
Thank you for your detailed issue.

Two things i've noted: the z_queryable.rs provided does not build against a2eaf2e
As the ability to clone a session was only implemented in a later commit. 77250fb
replacing the cloned session, i can confirm that it is possible to reproduce this behavior.

Running the provided z_queryable.rs and z_get.rs on the latest commit on main 941f699 with the above setup does not result in get reaching a timeout.

Please can you try your setup on this commit and confirm if the timeout still occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants