Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestPingMonitor crashes #921

Open
ansiwen opened this issue Sep 25, 2023 · 6 comments
Open

TestPingMonitor crashes #921

ansiwen opened this issue Sep 25, 2023 · 6 comments

Comments

@ansiwen
Copy link
Collaborator

ansiwen commented Sep 25, 2023

TestPingMonitor has a race condition that often causes a segmentation fault.

@ansiwen
Copy link
Collaborator Author

ansiwen commented Sep 25, 2023

First I thought this is again because of our connection finalizer, but disabling the finalizer does not make that issue disappear.

ansiwen added a commit that referenced this issue Sep 25, 2023
Related to: #921

Signed-off-by: Sven Anderson <[email protected]>
mergify bot pushed a commit that referenced this issue Sep 25, 2023
Related to: #921

Signed-off-by: Sven Anderson <[email protected]>
@ansiwen
Copy link
Collaborator Author

ansiwen commented Sep 25, 2023

(gdb) bt
#0  runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154
#1  0x00000000004585db in runtime.raisebadsignal (sig=11, c=0x7f5c0e5e5f90)
    at /opt/go/src/runtime/signal_unix.go:967
#2  0x0000000000458a27 in runtime.badsignal (sig=11, c=0x7f5c0e5e5f90)
    at /opt/go/src/runtime/signal_unix.go:1076
#3  0x0000000000457368 in runtime.sigtrampgo (sig=11, info=0x7f5c0e5e6130, 
    ctx=0x7f5c0e5e6000) at /opt/go/src/runtime/signal_unix.go:468
#4  0x0000000000478a66 in runtime.sigtramp ()
    at /opt/go/src/runtime/sys_linux_amd64.s:352
#5  <signal handler called>
#6  MonConnection::get_auth_request (this=0x0, method=0x7f5c04000ed0, 
    preferred_modes=0x7f5c0e5e6680, bl=0x7f5c0e5e66a0, entity_name=..., 
    want_keys=0, keyring=0x7ffe25ab3a30)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/mon/MonClient.cc:1702
#7  0x00007f5c721375eb in non-virtual thunk to MonClientPinger::get_auth_request(Connection*, AuthConnectionMeta*, unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> >*, ceph::buffer::v15_2_0::list*) ()
   from /usr/lib64/ceph/libceph-common.so.2
#8  0x00007f5c720aa672 in ProtocolV2::send_auth_request (this=0x1cd2670, 
    allowed_methods=...)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1692
#9  0x00007f5c720aaddc in ProtocolV2::send_auth_request (this=0x1cd2670)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.h:217
#10 ProtocolV2::post_client_banner_exchange (this=0x1cd2670)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1680
#11 0x00007f5c720a462c in ProtocolV2::run_continuation (this=0x1cd2670, 
    continuation=...)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47
#12 0x00007f5c7206cae9 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, 
    this=0x7f5bfc031590) at /usr/include/c++/8/bits/std_function.h:682
#13 AsyncConnection::process (this=0x7f5bfc0311f0)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/AsyncConnection.cc:454
#14 0x00007f5c720c6c87 in EventCenter::process_events (
    this=this@entry=0x125e370, timeout_microseconds=<optimized out>, 
    timeout_microseconds@entry=30000000, 
    working_dur=working_dur@entry=0x7f5c0e5e6ee8)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/build/boost/include/boost/container/new_allocator.hpp:165
#15 0x00007f5c720cd19c in NetworkStack::<lambda()>::operator() (
    __closure=0x125b158, __closure=0x125b158)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/Stack.cc:53
#16 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/8/bits/std_function.h:297
#17 0x00007f5c703fab23 in execute_native_thread_routine ()
   from /lib64/libstdc++.so.6
#18 0x00007f5c7afdc1ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007f5c7aa30e73 in clone () from /lib64/libc.so.6

@ansiwen
Copy link
Collaborator Author

ansiwen commented Sep 26, 2023

from CI runner, seems identical:

Core was generated by `./rados.test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154
154		RET
[Current thread is 1 (Thread 0x7ff5aeffe700 (LWP 6184))]
To enable execution of this file add
	add-auto-load-safe-path /opt/go/src/runtime/runtime-gdb.py
line to your configuration file "/root/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
#0  runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154
#1  0x00000000004585db in runtime.raisebadsignal (sig=11, c=0x7ff5aeffa850)
    at /opt/go/src/runtime/signal_unix.go:967
#2  0x0000000000458a27 in runtime.badsignal (sig=11, c=0x7ff5aeffa850)
    at /opt/go/src/runtime/signal_unix.go:1076
#3  0x0000000000457368 in runtime.sigtrampgo (sig=11, info=0x7ff5aeffa9f0, 
    ctx=0x7ff5aeffa8c0) at /opt/go/src/runtime/signal_unix.go:468
#4  0x0000000000478a66 in runtime.sigtramp ()
    at /opt/go/src/runtime/sys_linux_amd64.s:352
#5  <signal handler called>
#6  MonConnection::get_auth_request (this=0x0, method=0x17257a0, 
    preferred_modes=0x7ff5aeffb680, bl=0x7ff5aeffb6a0, entity_name=..., 
    want_keys=0, keyring=0x7ffd722b3130)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/mon/MonClient.cc:1702
#7  0x00007ff6147305eb in non-virtual thunk to MonClientPinger::get_auth_request(Connection*, AuthConnectionMeta*, unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> >*, ceph::buffer::v15_2_0::list*) ()
   from /usr/lib64/ceph/libceph-common.so.2
#8  0x00007ff6146a3672 in ProtocolV2::send_auth_request (this=0x2264020, 
    allowed_methods=...)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1692
#9  0x00007ff6146a3ddc in ProtocolV2::send_auth_request (this=0x2264020)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.h:217
#10 ProtocolV2::post_client_banner_exchange (this=0x2264020)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1680
#11 0x00007ff61469d62c in ProtocolV2::run_continuation (this=0x2264020, 
    continuation=...)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47
#12 0x00007ff614665ae9 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, 
    this=0x2261fd0) at /usr/include/c++/8/bits/std_function.h:682
#13 AsyncConnection::process (this=0x2261c30)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/AsyncConnection.cc:454
#14 0x00007ff6146bfc87 in EventCenter::process_events (
    this=this@entry=0x17f15e0, timeout_microseconds=<optimized out>, 
    timeout_microseconds@entry=30000000, 
    working_dur=working_dur@entry=0x7ff5aeffbee8)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/build/boost/include/boost/container/new_allocator.hpp:165
#15 0x00007ff6146c619c in NetworkStack::<lambda()>::operator() (
    __closure=0x171f938, __closure=0x171f938)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/Stack.cc:53
#16 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/8/bits/std_function.h:297
#17 0x00007ff6129f3b23 in execute_native_thread_routine ()
   from /lib64/libstdc++.so.6
#18 0x00007ff61d5d51ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007ff61d029e73 in clone () from /lib64/libc.so.6

@ansiwen
Copy link
Collaborator Author

ansiwen commented Sep 27, 2023

I could reproduce it with a pure C loop like this:

inline int test_ping(rados_t c) {
	char* outstr;
	size_t outlen;
	int ret;
	for (int i=0; i<10000; ++i) {
		printf("X%d ", i);
		ret = rados_ping_monitor(c, "a", &outstr, &outlen);
		rados_buffer_free(outstr);
	}
	return ret;
}

which I guess is enough proof, that this is an upstream issue.

The MonConnection pointer (mc) in the MonClientPinger object, that comes in as an event, is 0x0. It's unclear if it is not yet set, or has been reset to 0x0:

(gdb) fr 11
#11 0x00007f99544cd62c in ProtocolV2::run_continuation (this=0x1e89930, 
    continuation=...)
    at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47
47	    CONTINUATION_RUN(continuation)
(gdb) print ((MonClientPinger*)(messenger->auth_client)).mc
$1 = std::unique_ptr<MonConnection> = {get() = 0x0}

@phlogistonjohn
Copy link
Collaborator

Sounds like a good enough find to file a tracker at ceph to me! Please link it here once you've filed it.

@ansiwen
Copy link
Collaborator Author

ansiwen commented Sep 27, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants