-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FIOFFS on illumos instead of fsync (again) #1199
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
- Did you confirm with dtrace that this is doing what we expect? I don't see how it couldn't but it may be safe to be paranoid.
- What about any performance impact? I remember you saying in chat that there wasn't one anymore and I'm curious if you reran any of your tests.
Testing using
I see it firing 2x per second with the right value for
To make this easier for future readers, I added a comment in bbb6cfe tracing the execution path. After chatting in person last week, @jclulow is also going to socialize the idea of making this IOCTL committed, so we could rely on it in the future. The performance question is interesting. Running on Propolis + Crucible
Running on this branch (
However, with
This is puzzling; I don't expect the |
This has the Also, for each grouping I see two rows that start with To make it a little easier for me to see the differences, I changed the groupings so you see all
Clear win here, but like you said,
All the same here really.
The FIOFFS could be faster, or just in the noise, but NOCORN is indeed a big bump here.
FFIOS faster, but NOCORN a step faster from it.
A little slower here, but that could be in the noise as well.
Faster, but probably in the noise again.
FIOFFS slower, but maybe that ~4% is just in the noise? NOCORN slower still. For |
There are two independent 4M runs that happen at different times during the benchmarking (first and last). I was noticing that the initial 4K run was misleadingly fast because it was writing to an empty disk, so I added an initial 4M random write step to make the disk not completely empty.
Correct, it's just a convenient way to separate out the FIOFFS effects versus the rest of the refactoring. I've been digging into performance mysteries, and the plot continues to thicken. The new, even-weirder development is that the slowness seems to depend on
This is true despite the fact that all of the changes in
(The This is very weird. I see significant differences in method sizes and implementations in the disassembly. For example, on 00000000012a8b00 <crucible::Buffer::into_bytes>:
12a8b00: 55 push %rbp
12a8b01: 48 89 e5 mov %rsp,%rbp
12a8b04: 41 57 push %r15
12a8b06: 41 56 push %r14
12a8b08: 41 54 push %r12
12a8b0a: 53 push %rbx
12a8b0b: 48 81 ec 80 00 00 00 sub $0x80,%rsp
12a8b12: 49 89 f6 mov %rsi,%r14
12a8b15: 48 89 fb mov %rdi,%rbx
12a8b18: 48 8b 36 mov (%rsi),%rsi
12a8b1b: 49 8b 56 08 mov 0x8(%r14),%rdx
12a8b1f: 4d 8b 7e 18 mov 0x18(%r14),%r15
12a8b23: 41 f6 c7 01 test $0x1,%r15b
12a8b27: 75 18 jne 12a8b41 <crucible::Buffer::into_bytes+0x41>
12a8b29: 48 89 73 08 mov %rsi,0x8(%rbx)
12a8b2d: 48 89 53 10 mov %rdx,0x10(%rbx)
12a8b31: 4c 89 7b 18 mov %r15,0x18(%rbx)
12a8b35: 48 8b 05 b4 d5 f1 00 mov 0xf1d5b4(%rip),%rax # 21c60f0 <_GLOBAL_OFFSET_TABLE_+0x6f0>
12a8b3c: 48 89 03 mov %rax,(%rbx)
12a8b3f: eb 55 jmp 12a8b96 <crucible::Buffer::into_bytes+0x96>
12a8b41: 49 8b 4e 10 mov 0x10(%r14),%rcx
12a8b45: 49 c1 ef 05 shr $0x5,%r15
12a8b49: 4c 8d a5 60 ff ff ff lea -0xa0(%rbp),%r12
12a8b50: 4c 89 e7 mov %r12,%rdi
12a8b53: 4d 89 f8 mov %r15,%r8
12a8b56: e8 65 24 94 00 call 1beafc0 <bytes::bytes_mut::rebuild_vec>
12a8b5b: 48 8d 7d b0 lea -0x50(%rbp),%rdi
12a8b5f: 4c 89 e6 mov %r12,%rsi
12a8b62: e8 19 01 94 00 call 1be8c80 <<bytes::bytes::Bytes as core::convert::From<alloc::vec::Vec<u8>>>::from>
12a8b67: 4c 89 7d d8 mov %r15,-0x28(%rbp)
12a8b6b: 48 8b 45 c0 mov -0x40(%rbp),%rax
12a8b6f: 48 89 c1 mov %rax,%rcx
12a8b72: 4c 29 f9 sub %r15,%rcx
12a8b75: 72 3e jb 12a8bb5 <crucible::Buffer::into_bytes+0xb5>
12a8b77: 48 89 4d c0 mov %rcx,-0x40(%rbp)
12a8b7b: 4c 01 7d b8 add %r15,-0x48(%rbp)
12a8b7f: 48 8b 45 c0 mov -0x40(%rbp),%rax
12a8b83: 48 89 43 10 mov %rax,0x10(%rbx)
12a8b87: 48 8b 45 c8 mov -0x38(%rbp),%rax
12a8b8b: 48 89 43 18 mov %rax,0x18(%rbx)
12a8b8f: 0f 10 45 b0 movups -0x50(%rbp),%xmm0
12a8b93: 0f 11 03 movups %xmm0,(%rbx)
12a8b96: 49 83 c6 20 add $0x20,%r14
12a8b9a: 4c 89 f7 mov %r14,%rdi
12a8b9d: e8 1e 1b 94 00 call 1bea6c0 <<bytes::bytes_mut::BytesMut as core::ops::drop::Drop>::drop>
12a8ba2: 48 89 d8 mov %rbx,%rax
12a8ba5: 48 81 c4 80 00 00 00 add $0x80,%rsp
12a8bac: 5b pop %rbx
12a8bad: 41 5c pop %r12
12a8baf: 41 5e pop %r14
12a8bb1: 41 5f pop %r15
12a8bb3: 5d pop %rbp
12a8bb4: c3 ret
12a8bb5: 48 89 45 d0 mov %rax,-0x30(%rbp)
12a8bb9: 48 8d 45 d8 lea -0x28(%rbp),%rax
12a8bbd: 48 89 45 90 mov %rax,-0x70(%rbp)
12a8bc1: 48 8d 05 f8 eb fe ff lea -0x11408(%rip),%rax # 12977c0 <core::fmt::num::<impl core::fmt::Debug for usize>::fmt>
12a8bc8: 48 89 45 98 mov %rax,-0x68(%rbp)
12a8bcc: 48 8d 4d d0 lea -0x30(%rbp),%rcx
12a8bd0: 48 89 4d a0 mov %rcx,-0x60(%rbp)
12a8bd4: 48 89 45 a8 mov %rax,-0x58(%rbp)
12a8bd8: 48 8d 35 29 2a f6 00 lea 0xf62a29(%rip),%rsi # 220b608 <anon.65e99b03176a3e062032dedb025b3946.56.llvm.11851335155975330580+0x138>
12a8bdf: 48 8d 9d 60 ff ff ff lea -0xa0(%rbp),%rbx
12a8be6: 48 8d 4d 90 lea -0x70(%rbp),%rcx
12a8bea: ba 02 00 00 00 mov $0x2,%edx
12a8bef: 41 b8 02 00 00 00 mov $0x2,%r8d
12a8bf5: 48 89 df mov %rbx,%rdi
12a8bf8: e8 13 ec fe ff call 1297810 <core::fmt::Arguments::new_v1>
12a8bfd: 48 8d 35 24 2a f6 00 lea 0xf62a24(%rip),%rsi # 220b628 <anon.65e99b03176a3e062032dedb025b3946.56.llvm.11851335155975330580+0x158>
12a8c04: 48 89 df mov %rbx,%rdi
12a8c07: e8 84 0b c6 00 call 1f09790 <core::panicking::panic_fmt>
12a8c0c: 0f 0b ud2
12a8c0e: 90 nop
12a8c0f: 90 nop In 0000000001262c40 <crucible::Buffer::into_bytes>:
1262c40: 55 push %rbp
1262c41: 48 89 e5 mov %rsp,%rbp
1262c44: 41 56 push %r14
1262c46: 53 push %rbx
1262c47: 49 89 f6 mov %rsi,%r14
1262c4a: 48 89 fb mov %rdi,%rbx
1262c4d: e8 2e 34 98 00 call 1be6080 <<bytes::bytes::Bytes as core::convert::From<alloc::vec::Vec<u8>>>::from>
1262c52: 49 8b 76 20 mov 0x20(%r14),%rsi
1262c56: 48 85 f6 test %rsi,%rsi
1262c59: 74 0e je 1262c69 <crucible::Buffer::into_bytes+0x29>
1262c5b: 49 8b 7e 18 mov 0x18(%r14),%rdi
1262c5f: ba 01 00 00 00 mov $0x1,%edx
1262c64: e8 67 7a c4 ff call eaa6d0 <__rust_dealloc>
1262c69: 48 89 d8 mov %rbx,%rax
1262c6c: 5b pop %rbx
1262c6d: 41 5e pop %r14
1262c6f: 5d pop %rbp
1262c70: c3 ret
1262c71: 90 nop
1262c72: 90 nop
1262c73: 90 nop
1262c74: 90 nop
1262c75: 90 nop
1262c76: 90 nop
1262c77: 90 nop
1262c78: 90 nop
1262c79: 90 nop
1262c7a: 90 nop
1262c7b: 90 nop
1262c7c: 90 nop
1262c7d: 90 nop
1262c7e: 90 nop
1262c7f: 90 nop I'm extremely puzzled, and will keep digging into this; hopefully, pulling on this thread will clear up some of the non-reproducible performance changes that I've seen in the past. |
Looking at this assembly, the "fast" When opening this PR, I rebased the Sure enough, building This indicates that something has changed in Crucible's The changelog is as follows:
I suppose the next step is bisecting... |
See #1208 for the performance explanation; I was accidentally running an older upstairs without a performance regression. With that fixed, the FIOFFS changes are no longer an obvious performance win. Here's
Cherry-picking this PR as well:
It's hard to tell if this is an actual regression or just normal run-to-run variation; the fact that reads also slowed down makes me suspicious of the latter. |
30cdad6
to
2d2549e
Compare
Current state: we're hoping to stabilize this interface before merging, e.g. by implementing a |
See https://code.oxide.computer/c/illumos-gate/+/382 , which adds |
This is a cleaned-up version of #1148, which has fallen well behind
main
(rebasing was awkward, so I just copy-pasted the relevant code)It splits the per-extent flush operation into
pre_flush
,flush_inner
, andpost_flush
, then skipsflush_inner
in favor of theioctl
if we're running with theomicron-build
feature.