Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Zephyr pub/sub example not communicating #156

Closed
dcorbeil opened this issue Jan 27, 2023 · 12 comments
Closed

[Bug] Zephyr pub/sub example not communicating #156

dcorbeil opened this issue Jan 27, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@dcorbeil
Copy link
Contributor

Describe the bug

Hi all,

I am trying to get the pub/sub example to work on the Nucleo-F429zi. I already have a Zephyr Cmake project that I integrated zenoh-pico into and got it to build without major issues. The problem comes when I try to get my two F429zi to communicate through the pub and sub example. I use the prj.conf example from docs/zephyr/nucleo_f767zi since the two boards are pretty similar to each other. Although there are some configs that are turned off by zephyr due to dependencies not being met (might or might not be related to my issue). That applies to the nucleo_f767zi as well so I commented out ETH_NATIVE_POSIX, ETH_NATIVE_POSIX_RANDOM_MAC and NET_SOCKETS_POSIX_NAMES. I then set my two devices as peers by setting #define CLIENT_OR_PEER 1. I tried connecting my two boards either directly to each other or both of them in a network switch. In both these cases they can ping each other so I know that at least there is some communication working. Then when I start the publisher and subscriber, both are declared subscriber and publisher correctly and the publisher is publishing its messages but the subscriber doesn't receive anything.

This is my first time playing with zenoh so I might be missing some important detail. Are there tips and tricks to help debug what's happening? Or rather not happening.

I also tried one F429zi in client mode connected directly on my host computer with a router running but had a different problem. The router was built from zenoh's latest commit the day of writing this. The client F429 wasn't able to create a zenoh session.

Here is my prj.conf

# Important for thread debugging info by doing:
#       info threads
# To switch threads:
#       thread <thread ID>
CONFIG_DEBUG_THREAD_INFO=y

# Kernel options
CONFIG_MAIN_STACK_SIZE=8192
CONFIG_HEAP_MEM_POOL_SIZE=8192
CONFIG_ENTROPY_GENERATOR=y
CONFIG_TEST_RANDOM_GENERATOR=y
CONFIG_INIT_STACKS=y

# Generic library options
CONFIG_NEWLIB_LIBC=y
CONFIG_NEWLIB_LIBC_NANO=n
CONFIG_POSIX_API=y

# Generic networking options
CONFIG_SERIAL=y
CONFIG_NETWORKING=y

CONFIG_NET_L2_ETHERNET=y
CONFIG_ETH_NATIVE_POSIX=y
CONFIG_ETH_NATIVE_POSIX_RANDOM_MAC=y

CONFIG_NET_IPV4=y
CONFIG_NET_IPV6=y
CONFIG_NET_TCP=y
CONFIG_NET_ARP=y
CONFIG_NET_UDP=y
CONFIG_NET_DHCPV4=n
CONFIG_NET_SHELL=y
CONFIG_NET_MGMT=y
CONFIG_NET_MGMT_EVENT=y
CONFIG_DNS_RESOLVER=y

# Sockets
CONFIG_NET_SOCKETS=y
CONFIG_NET_SOCKETS_POSIX_NAMES=y
CONFIG_NET_SOCKETS_POLL_MAX=4

# Network driver config
CONFIG_TEST_RANDOM_GENERATOR=y
CONFIG_INIT_STACKS=y

# Network buffers
CONFIG_NET_PKT_RX_COUNT=16
CONFIG_NET_PKT_TX_COUNT=16
CONFIG_NET_BUF_RX_COUNT=80
CONFIG_NET_BUF_TX_COUNT=80
CONFIG_NET_CONTEXT_NET_PKT_POOL=y

# Network address config
CONFIG_NET_CONFIG_SETTINGS=y
CONFIG_NET_CONFIG_NEED_IPV4=y
CONFIG_NET_IPV4_IGMP=y
CONFIG_NET_CONFIG_NEED_IPV6=y
CONFIG_NET_IPV6_MLD=y
CONFIG_NET_CONFIG_MY_IPV4_ADDR="192.0.2.4"
# CONFIG_NET_CONFIG_MY_IPV4_NETMASK="255.240.0.0"
# CONFIG_NET_CONFIG_MY_IPV4_GW="192.0.2.10"
#CONFIG_NET_CONFIG_PEER_IPV4_ADDR="192.168.10.10"
CONFIG_NET_CONFIG_MY_IPV6_ADDR="2001:db8::2"
CONFIG_NET_CONFIG_PEER_IPV6_ADDR="2001:db8::1"

# IP address options
CONFIG_NET_IF_UNICAST_IPV6_ADDR_COUNT=3
CONFIG_NET_IF_MCAST_IPV6_ADDR_COUNT=4
CONFIG_NET_MAX_CONTEXTS=10

# Logging
CONFIG_NET_LOG=y
CONFIG_LOG=y
CONFIG_NET_STATISTICS=y
CONFIG_PRINTK=y

The only difference in the prj.conf between my two boards is the ipv4 address.

To reproduce

  1. Checkout specified commits for Zephyr and zenoh-pico
  2. Start subscriber using source code from examples/zephyr/z_sub.c
  3. Start publisher using source code from examples/zephyr/z_pub.c
  4. Observe that both can open zenoh session but subscriber doesn't receive data.

System info

  • Host computer: Linux 20.04
  • Target micro controllers: 2x Nucleo-F429zi
    • Connected either directly to each other or both in an ethernet switch
  • zenoh-pico commit: 8c3bc79
  • zephyr commit: abe50cc
@dcorbeil dcorbeil added the bug Something isn't working label Jan 27, 2023
@cguimaraes
Copy link
Member

TL;DR: your device is getting out of memory, as the value you have configured to CONFIG_HEAP_MEM_POOL_SIZE is too small.
I am further checking on why the session is not being closed due to out of memory error.

The current version of Zenoh protocol needs to be extended with additional capability negotiations during the session establishment to adapt the communication according to each other capabilities. Several improvements in this respect will come with an improved version of the protocol (expected to Q2 2023 according to the public roadmap). This will be especially critical to address the resource constrained capabilities of the microcontrollers.

Until then, there are a couple of things you can do as a workaround to your issue.

  1. Increase the value of CONFIG_HEAP_MEM_POOL_SIZE in zephyr/prj.conf. This is the memory pool used by Zenoh for its dynamic allocations.

  2. If you cannot increase CONFIG_HEAP_MEM_POOL_SIZE to a big value, decrease the default size for buffers.

/**
 * Defaulf maximum batch size possible to be received.
 */
#ifndef Z_BATCH_SIZE_RX
#define Z_BATCH_SIZE_RX \
    65535  // Warning: changing this value can break the communication
           //          with zenohd in the current protocol version.
           //          In the future, it will be possible to negotiate such value.
           // Change it at your own risk.
#endif

/**
 * Defaulf maximum batch size possible to be sent.
 */
#ifndef Z_BATCH_SIZE_TX
#define Z_BATCH_SIZE_TX 65535
#endif

/**
 * Defaulf maximum size for fragmented messages.
 */
#ifndef Z_FRAG_MAX_SIZE
#define Z_FRAG_MAX_SIZE 300000
#endif

Let us know if these workarounds were able to solve your problem.

Related to #151 .

@eeas-joas
Copy link

Hello, we are also running Zenoh via Zephyr and have reduced the values Z_BATCH_SIZE_RX and Z_BATCH_SIZE_TX to 1024 bytes.

@dcorbeil
Copy link
Contributor Author

That doesn't seem to fix the problem. I changed CONFIG_HEAP_MEM_POOL_SIZE to 50000 and Z_BATCH_SIZE_RX and Z_BATCH_SIZE_TX to 1024 and Z_FRAG_MAX_SIZE to 30000. I tried the scout example and noticed that zenoh was blocking at scout.c:59 due to a forever timeout deeper in the zephyr code. As a temporary measure, I set the timeout to K_NO_WAIT and found that both my F429zi can see the router running on my host computer but they can't find each other when they both are scouting (I suspect that's to be expected).

I then tried to keep one F429zi in scouting mode and make the other one a publisher in peer mode. My expectation what that the scout would see the publisher as a peer but it didn't. The scout still only sees the router.

@cguimaraes
Copy link
Member

Hello, we are also running Zenoh via Zephyr and have reduced the values Z_BATCH_SIZE_RX and Z_BATCH_SIZE_TX to 1024 bytes.

@eeas-joas did it work for you?


@dcorbeil see replies inline below:

That doesn't seem to fix the problem. I changed CONFIG_HEAP_MEM_POOL_SIZE to 50000 and Z_BATCH_SIZE_RX and Z_BATCH_SIZE_TX to 1024 and Z_FRAG_MAX_SIZE to 30000.

From a first look (no testing), it seems that you would not have issues with the configured values.

I tried the scout example and noticed that zenoh was blocking at scout.c:59 due to a forever timeout deeper in the zephyr code.

I do not remember having this issue. I need to investigate it a bit with a similar board that we have.
I am wondering if it might be related with any additional zephyr/prj.conf configuration that you might be missing.

As a temporary measure, I set the timeout to K_NO_WAIT and found that both my F429zi can see the router running on my host computer but they can't find each other when they both are scouting (I suspect that's to be expected).

Zenoh-Pico, as of today, fully supports client mode, but the peer mode is limited to Zenoh multicast transports.
Which means that you cannot have the same peer mode as in the Rust implementation of Zenoh which operates on unicast transports.
We are looking on how to efficiently support it on Zenoh-Pico as well, but we have other challenges that we need to take into consideration while devising the solution.

I then tried to keep one F429zi in scouting mode and make the other one a publisher in peer mode. My expectation what that the scout would see the publisher as a peer but it didn't. The scout still only sees the router.

This is as expected. Zenoh-Pico aims to work mainly as a client mode with some support for peer mode in multicast transports. As I said before, we are trying to evolve on this, but we need to carefully design the solution to keep it constrained resource friendly.
That being said, the scouting mechanisms will only find routers or Zenoh Rust peers. Also the scouting mechanisms allows you to define if you want to scout for routers, peers or both.

@eeas-joas
Copy link

@cguimaraes
Sorry, I was not precise enough..

Hello, we are also running Zenoh via Zephyr and have reduced the values Z_BATCH_SIZE_RX and Z_BATCH_SIZE_TX to 1024 bytes and it works with this config.

@dcorbeil
Copy link
Contributor Author

@cguimaraes

I do not remember having this issue. I need to investigate it a bit with a similar board that we have.
I am wondering if it might be related with any additional zephyr/prj.conf configuration that you might be missing.

Have you had time to give this a try? I tried to keep the prj.conf as close to the example so I don't think that it would be an issue but who knows.

Zenoh-Pico, as of today, fully supports client mode, but the peer mode is limited to Zenoh multicast transports.
Which means that you cannot have the same peer mode as in the Rust implementation of Zenoh which operates on unicast transports.

Thank you for those clarifications.

Ok I got something that works, kinda. In my current subnet setup I have:

  • zenohd running on Ubuntu 20.04 from the docker container:
    • docker run --init -p 7447:7447/tcp -p 8000:8000/tcp eclipse/zenoh
  • zenoh-pico's z_sub built and ran on the same Ubuntu 20.04 host
  • One F429zi
    • Running as a publisher using the same code as examples/zephyr/z_pub.c
    • Using the same prj.conf as specified in the first post

Note: zenoh-pico used for the publisher on the F429 and the subscriber on the host is the exact same code.

I ended up turning on debug prints (ZENOH_DEBUG=3) and I saw something interesting. Here are the first few logs until the publisher starts publishing periodically:

uart:~$ Opening Zenoh Session...Sending Z_INIT(Syn)
Encoding _Z_MID_INIT
Decoding _Z_MID_INIT
Received Z_INIT(Ack)
Sending Z_OPEN(Syn)
Encoding _Z_MID_OPEN
Decoding _Z_MID_OPEN
Received Z_OPEN(Ack)
Checking Zenoh Session...OK
Declaring publisher for 'demo/example/zenoh-pico-pub'...>>> Allocating res decl for (1,0:demo/example/zenoh-pico-pub)
>> send zenoh message
>> send zenoh message
Encoding _Z_MID_FRAME
Encoding _Z_MID_DECLARE
Encoding _Z_DECL_RESOURCE
Encoding _RESKEY
>> send zenoh message
>> send zenoh message
Encoding _Z_MID_FRAME
Encoding _Z_MID_DECLARE
Encoding _Z_DECL_PUBLISHER
Encoding _RESKEY
OK
Decoding _Z_MID_FRAME
Decoding _Z_MID_DECLARE
Decoding _Z_DECL_RESOURCE
Decoding _RESKEY
Decoding _Z_MID_DECLARE
Decoding _Z_DECL_SUBSCRIBER
Decoding _RESKEY
Decoding _Z_MID_DECLARE
Decoding _Z_DECL_RESOURCE
Decoding _RESKEY
Decoding _Z_MID_DECLARE
Decoding _Z_DECL_SUBSCRIBER
Decoding _RESKEY
Decoding _Z_MID_DECLARE
Decoding _Z_DECL_RESOURCE
Decoding _RESKEY
Decoding _Z_MID_DECLARE
Decoding _Z_DECL_QUERYABLE
Decoding _RESKEY
Received Z_FRAME message
Received _Z_DECLARE message
Received declare-resource message
>>> Allocating res decl for (1,0:@/router/E0B5A42681C5454A89C13CBD77ADD503/config)
Received _Z_DECLARE message
Received declare-subscriber message
>>> Allocating sub decl for (0:@/router/E0B5A42681C5454A89C13CBD77ADD503/config/**)
Received _Z_DECLARE message
Received declare-resource message
>>> Allocating res decl for (2,0:demo/example)
Received _Z_DECLARE message
Received declare-subscriber message
>>> Allocating sub decl for (0:demo/example/**)
Received _Z_DECLARE message
Received declare-resource message
>>> Allocating res decl for (3,0:@/router/E0B5A42681C5454A89C13CBD77ADD503)
Received _Z_DECLARE message
Received declare-queryable message
Decoding _Z_MID_FRAME
Decoding _Z_MID_DECLARE
Decoding _Z_DECL_RESOURCE
Decoding _RESKEY
Received Z_FRAME message
Received _Z_DECLARE message
Received declare-resource message
>>> Allocating res decl for (4,0:demo/example/zenoh-pico-pub)
>> send session message
Encoding _Z_MID_KEEP_ALIVE
>> send session message
Encoding _Z_MID_KEEP_ALIVE
>> send session message
Encoding _Z_MID_KEEP_ALIVE
Closing session because it has expired after 10000ms
>> send session message
Encoding _Z_MID_CLOSE
Putting Data ('demo/example/zenoh-pico-pub': '[   0] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
>> send zenoh message
>> send zenoh message
Encoding _Z_MID_FRAME
Encoding _Z_MID_DATA
Encoding _RESKEY
Encoding _Z_DATA_INFO
Encoding _PAYLOAD
.... then proceeds to send periodic messages

As you can see, the publisher ends up terminating the session before it even starts sending data. Why is that happening?

Also, just for fun, I commented out the else statement in src/transport/unicast/link/task/lease.c:_zp_unicast_lease_task() that handles the closing of the session and it started working. The subscriber on my machine started receiving data. That's why I said earlier that it works kinda. I'm pretty sure it is not the right approach to fix the problem.

Thank you for your help.

@dcorbeil
Copy link
Contributor Author

I found the problem. Turns out that z_sleep_ms was failing because of this zephyr commit. In an effort to be more compliant with the POSIX spec, they broke the ability to sleep for more than 1 second when using usleep which z_sleep_ms is using.

I still need to do further testing but it seems like my previous hack of fiddling with the socket timeout in zephyr is not necessary. I suspect that since z_sleep_ms wasn't actually sleeping it caused a bunch of out-of-sync problems in the zenoh session handing.

@cguimaraes
Copy link
Member

zenohd running on Ubuntu 20.04 from the docker container:
- docker run --init -p 7447:7447/tcp -p 8000:8000/tcp eclipse/zenoh

In the past, we had issues on docker when handling multicast traffic.
In any case, you also need to expose port 7446 for the Zenoh multicast scouting mechanism.

I suspect that since z_sleep_ms wasn't actually sleeping it caused a bunch of out-of-sync problems in the zenoh session handing.

Indeed, it seems to be the issue. Thanks @dcorbeil for also trying to understand the root cause.
Also, it makes sense why my tests are passing and yours are not....I was testing with a Zephyr version prior to that commit.

This can be easily fixed on Zenoh-Pico. Also, it seems clear that we need:

  1. Fix the sleep in the Zephyr port
  2. Include timestamps to the logging messages (it had been easier to identify that the sleep was not taking effect).
  3. Increase the test coverage to the system ports.

You can expect it in the following days.

@cguimaraes
Copy link
Member

@dcorbeil can you help us validate this fix: #158
It is tackling point 1. Fix the sleep in the Zephyr port.

@dcorbeil
Copy link
Contributor Author

dcorbeil commented Feb 1, 2023

@cguimaraes

@dcorbeil can you help us validate this fix: #158
It is tackling point 1. Fix the sleep in the Zephyr port.

I will gladly help you. As a matter of fact I had a branch ready for review with that fix but you beat me to it.

@cguimaraes
Copy link
Member

As I wrote in the corresponding PR, we would be happy to receive your contribution in any case.

@cguimaraes
Copy link
Member

#159 is tackling point 2. Include timestamps to the logging messages.

cguimaraes added a commit that referenced this issue Feb 8, 2023
)

* z_sleep_ms use zephyr's nanosleep because the previously used usleep doesn't support sleeps for more than 1 second

* Improve sleep resilience considering only their granularity

* Add native zephyr sleep api instead of its posix compatible

* Minimize oversleeping

* Missing the fix on esp32

* Where the start gone?

---------

Co-authored-by: David Corbeil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants