Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue: 4260620 infinite loop latency spec #1104

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tomerdbz
Copy link
Collaborator

Description

When there are no offloaded sockets,
we mistakenly poll for offloaded sockets in case VMA_SELECT_POLL_OS_FORCE=1.

This behavior makes us getting stuck in an infinite loop.

This patch fixes the issue by polling OS sockets
in the case there are no offloaded sockets
regardless to VMA_SELECT_POLL_OS_FORCE value.

What

Fix latency spec bug of getting stuck in an infinite loop when there are no offloaded sockets.

Why ?

Customer bug.
Solves https://redmine.mellanox.com/issues/4260620

Change type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Check list

  • Code follows the style de facto guidelines of this project
  • Comments have been inserted in hard to understand places
  • Documentation has been updated (if necessary)
  • Test has been added (if possible)

@@ -469,8 +469,7 @@ int io_mux_call::call()

__log_funcall("");

if (!m_b_sysvar_select_poll_os_force // TODO: evaluate/consider this logic
&& (*m_p_num_all_offloaded_fds == 0))
if (*m_p_num_all_offloaded_fds == 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no offlaoded sockets are mapped then we poll the OS fd anyway, in this case
this PARAM documentation should also be changed, so if this PARAM is enabled
it will modify POLL_OS_RATIO and SKIP_OS to 1 to POLL the OS everytime we poll Offloaded soockets.
also seeing same issue in XLIO, need to fix there too, what you think?
"VMA_SELECT_POLL_OS_FORCE
This flag forces to poll the OS file descriptors while user thread calls
select(), poll() or epoll_wait() even when no offloaded sockets are mapped.
Enabling this flag causes VMA to set VMA_SELECT_POLL_OS_RATIO and
VMA_SELECT_SKIP_OS to 1. This will result in VMA_SELECT_POLL number of
times VMA will poll the OS file descriptors, along side with offloaded
sockets, if such sockets exists.
Note that setting VMA_SELECT_SKIP_OS and VMA_SELECT_POLL_OS_RATIO
directly will override the values these parameters gets while
VMA_SELECT_POLL_OS_FORCE is enabled."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is single place in the code for VMA/XLIO that uses m_b_sysvar_select_poll_os_force.
In case it is removed here so what is a meaning of VMA_SELECT_POLL_OS_FORCE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you both - I'll remove it and open a Task to check the same bug in XLIO (probably the same bug reoccurs there as well if libxlio indeed inherited this logic)

Copy link
Collaborator Author

@tomerdbz tomerdbz Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a second thought - this flag has the meaning of setting
"VMA_SELECT_POLL_OS_RATIO and
VMA_SELECT_SKIP_OS to 1."

I think the correct way is to keep it and just update the documentation. updated the commit. wdyt @igor-ivanov @BasharRadya ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BasharRadya - FYI
[XLIO] Task #4262473: [VMA inheritance] infinite loop when no offloaded sockets | Assignee: Gal N. | Status: New

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
igor please notice that poll_os_force meaning is to set
VMA_SELECT_POLL_OS_RATIO and VMA_SELECT_SKIP_OS to 1 as i mentioned in my first review (so it will override their default values).
you can claim that we can simply set this adjusted params to 1 instead of using different ENV PARAM to set them, but removing this OLD param is a bit problematic in POV of compatibility (if our many clients already use this param and expect it to exists in newer xlio versions).

Copy link
Collaborator

@igor-ivanov igor-ivanov Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, igor please notice that poll_os_force meaning is to set VMA_SELECT_POLL_OS_RATIO and VMA_SELECT_SKIP_OS to 1 as i mentioned in my first review (so it will override their default values). you can claim that we can simply set this adjusted params to 1 instead of using different ENV PARAM to set them, but removing this OLD param is a bit problematic in POV of compatibility (if our many clients already use this param and expect it to exists in newer xlio versions).

Actually I think the same if VMA_SELECT_POLL_OS_FORCE behavior can be controlled one to one by VMA_SELECT_POLL_OS_RATIO and VMA_SELECT_SKIP_OS , so it is better to set VMA_SELECT_POLL_OS_RATIO and VMA_SELECT_SKIP_OS to the right values internally during initialization if user explicitly passes VMA_SELECT_POLL_OS_FORCE and use identical logic (probably handle_os_countdown())
In addition we might consider following

  • remove VMA_SELECT_POLL_OS_FORCE from USER MANUAL and README keeping support just for compatibility only
  • describe behavior VMA_SELECT_POLL_OS_FORCE basing on VMA_SELECT_POLL_OS_RATIO and VMA_SELECT_SKIP_OS in README.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - wdyt @igor-ivanov ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" but removing this OLD param is a bit problematic in POV of compatibility (if our many clients already use this param and expect it to exists in newer xlio versions)."?
WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please notice to update related code if select_poll_os_force is removed
"if ((env_ptr = getenv(SYS_VAR_SELECT_POLL_OS_FORCE)) != NULL)
select_poll_os_force = (uint32_t)atoi(env_ptr);

if (select_poll_os_force) {
	select_poll_os_ratio = 1;
	select_skip_os_fd_check = 1;
}

"

@tomerdbz tomerdbz force-pushed the 4260620_vma_latency_spec_infinite_loop branch 2 times, most recently from dd7d5d3 to 6d89095 Compare January 21, 2025 08:02
When there are no offloaded sockets, we mistakingly poll for
offloaded sockets in case VMA_SELECT_POLL_OS_FORCE=1.

This behavior makes us getting stuck in an infinite loop.

This patch fixes the issue by polling OS sockets
in the case there are no offloaded sockets
regardless to VMA_SELECT_POLL_OS_FORCE value.

Signed-off-by: Tomer Cabouly <[email protected]>
@tomerdbz tomerdbz force-pushed the 4260620_vma_latency_spec_infinite_loop branch from 6d89095 to e221d58 Compare January 21, 2025 12:52
@igor-ivanov
Copy link
Collaborator

@tomerdbz xlio has the same fix Mellanox/libxlio@0377f77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants