Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication problem on arm64 Ubuntu #161

Open
leonfg opened this issue Nov 24, 2020 · 2 comments
Open

Communication problem on arm64 Ubuntu #161

leonfg opened this issue Nov 24, 2020 · 2 comments

Comments

@leonfg
Copy link

leonfg commented Nov 24, 2020

Environment

Opensplice community 6.9.181127OSS in DDSI peer mode. Config, QoS
Node1(192.168.13.11): Ubuntu 18.04, Python3.6, Intel x64
Node2(192.168.13.101): Ubuntu 18.04 (JetPack 4.4), Python3.6, NVIDIA Xavier (ARM64)
Node3(192.168.13.201): Ubuntu 16.04 (JetPack 3.3), Python3.5, NVIDIA TX2 (ARM64)

Problem Description

Node1 and node2 can communicate with each other, but Node3 can not communicate others. Ping is OK.
In ospl-info.log when node1 and 3 running, there are warnings like "thread tev failed to make progress" and "thread dq.builtins failed to make progress", are their reasons for this problem? How can I solve it?
Node1 log
Node3 log

@leonfg leonfg closed this as completed Nov 25, 2020
@leonfg leonfg reopened this Nov 25, 2020
@vivekpandey02
Copy link

It's a bit difficult to judge the logs without knowing exactly what actions triggered it.

Looks like we have a connection (socket) being used for both read and write. Think a write is started on the socket after which a read fails and the reader closes the socket. The writer then tries to use a closed socket and errors. Simplest way to eliminate the error is to not generate it if connection has been closed as is expected behaviour. If connections are not closed nicely then may always get tev warnings as when tcp read/writes block we try and hold onto the connection as long as possible before cleaning it out. Can always reduce the configurable read/write connection timeouts.

I don't think it's really a 'known issue' in DDSI, but more like regular TCP behaviour. You can change operating-system defaults (i.e. stuff in /proc/sys/net/ipv4/tcp_* on Linux) but these defaults are usually quite sane and messing with them risks causing all kinds of weird symptoms. You can also increase the DDSI lease-time to outlive TCP timeouts. But you can imagine possibility for multiple hosts timing out at roughly the same time, lease-renew thread getting randomly scheduled in etc. so a good number is difficult to pick (and the higher the lease-timeout, the less responsive the system becomes).

@leonfg
Copy link
Author

leonfg commented Nov 25, 2020

It's a bit difficult to judge the logs without knowing exactly what actions triggered it.

Looks like we have a connection (socket) being used for both read and write. Think a write is started on the socket after which a read fails and the reader closes the socket. The writer then tries to use a closed socket and errors. Simplest way to eliminate the error is to not generate it if connection has been closed as is expected behaviour. If connections are not closed nicely then may always get tev warnings as when tcp read/writes block we try and hold onto the connection as long as possible before cleaning it out. Can always reduce the configurable read/write connection timeouts.

I don't think it's really a 'known issue' in DDSI, but more like regular TCP behaviour. You can change operating-system defaults (i.e. stuff in /proc/sys/net/ipv4/tcp_* on Linux) but these defaults are usually quite sane and messing with them risks causing all kinds of weird symptoms. You can also increase the DDSI lease-time to outlive TCP timeouts. But you can imagine possibility for multiple hosts timing out at roughly the same time, lease-renew thread getting randomly scheduled in etc. so a good number is difficult to pick (and the higher the lease-timeout, the less responsive the system becomes).

Thansk for replying. Do you mean the warnings are result but not cause, the real reason is communication failure? Is that mean modifying system TCP configuration or DDSI lease-time may solve the warning but communication may still fail?
Anyway, could you please be more specify about about how to modify the DDSI lease-time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants