-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FIOFFS on illumos instead of fsync #1148
Conversation
illumos supports sending a FIOFFS ioctl to a ZFS dataset in order to sync _all_ outstanding IO. Use this ioctl in `region_flush` for the whole region dataset instead of calling `sync_all` for each extent file. This is hidden behind a new `omicron-build` feature, which is true for our production builds. This necessitated separating flush (which commits bits to durable storage and is a no-op if `omicron-build` is used) and `post_flush` (which performs clean up or any other kind of accounting). Splitting this up exposed a bug in the sqlite implementation of ExtentInner: it wasn't checking to see if the extent was dirty and wasn't calling `set_flush_number` before calling `sync_all` on the user data file.
For posterity, the behaviour of #include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/filio.h>
#include <unistd.h>
int main() {
int fd = open("/oxp_3d7b3f53-6487-4f0b-8a73-ba7f95a66d76/testpost", O_DIRECTORY | O_RDONLY, 0);
if (fd == -1) {
int e = errno;
printf("could not open: %s\n", strerror(e));
return 1;
}
printf("submitting %i %x %x:%x\n", _FIOFFS, _FIOFFS, 'f', 66);
if (ioctl(fd, _FIOFFS, NULL) == -1) {
int e = errno;
printf("could not ioctl: %s\n", strerror(e));
}
if (close(fd) == -1) {
int e = errno;
printf("could not close: %s\n", strerror(e));
return 1;
}
return 0;
} Running this program to submit a
|
Also noting: this is true because we currently build with |
Ok(()) | ||
} | ||
|
||
fn post_flush( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It weirds me out that Extent::flush
doesn't actually flush, depending on omicron-build
. I understand why that's the case, but would like to propose an alternative API. I think we should break flushing behavior into three functions:
pre_flush
contains everything throughself.set_flush_number(..)
in the current implementationflush_inner
implements theelse
branch, i.e. callingself.file.sync_all
(unconditionally!)post_flush
is the same as in this PR
Then, we can have a default implementation of flush
on trait ExtentInner
which calls these functions one after the other, meaning unit tests that want to flush a single extent can just call flush()
. Right now, those unit tests are no longer actually syncing the file to disk!
More importantly, the decision to skip calling flush_inner
(i.e. just calling pre_flush
+ post_flush
) is moved into the Region::region_flush
implementation, which is right next to the FIOFFS call, which makes it more obvious what's going on.
} | ||
|
||
cdt::extent__flush__done!(|| { (job_id.get(), self.extent_number, 0) }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I weakly feel like this should be at the beginning of post_flush
, so that it accounts for the actual flush time (even though that's amortized across many extents)
|
||
// We put all of our metadata updates into a single write to make this | ||
// operation atomic. | ||
self.set_flush_number(new_flush, new_gen)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suspicious about this change.
If we lose power right after this line, the metadata DB will tell us that we're at a particular flush (new_flush
) and dirty = 0
, but the data in the extent file has not necessarily been persisted to disk!
I think the existing implementation – which updates flush number in the DB after syncing the extent file – is correct.
/* | ||
* XXX Retry? Mark extent as broken? | ||
* We must first fsync to get any outstanding data written to disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same vibes here about changing to pre_flush
/ flush_inner
/ post_flush
; you'll obviously have to fix both extent formats if you change the trait)
use std::io; | ||
use std::os::fd::AsRawFd; | ||
use std::ptr; | ||
|
||
// "file system flush", defined in illumos' sys/filio.h | ||
const FIOFFS: libc::c_int = 0x20000000 | (0x66 /*'f'*/ << 8) | 0x42 /*66*/; | ||
|
||
// Open the region's mountpoint | ||
let file = File::open(&self.dir)?; | ||
|
||
let rc = unsafe { | ||
libc::ioctl( | ||
file.as_raw_fd(), | ||
FIOFFS as _, | ||
ptr::null_mut::<i32>(), | ||
) | ||
}; | ||
|
||
if rc != 0 { | ||
return Err(io::Error::last_os_error().into()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using nix::ioctl
lets this be a little cleaner:
use std::io; | |
use std::os::fd::AsRawFd; | |
use std::ptr; | |
// "file system flush", defined in illumos' sys/filio.h | |
const FIOFFS: libc::c_int = 0x20000000 | (0x66 /*'f'*/ << 8) | 0x42 /*66*/; | |
// Open the region's mountpoint | |
let file = File::open(&self.dir)?; | |
let rc = unsafe { | |
libc::ioctl( | |
file.as_raw_fd(), | |
FIOFFS as _, | |
ptr::null_mut::<i32>(), | |
) | |
}; | |
if rc != 0 { | |
return Err(io::Error::last_os_error().into()); | |
} | |
use std::os::fd::AsRawFd; | |
// Open the region's mountpoint | |
let file = File::open(&self.dir)?; | |
// "file system flush", defined in illumos' sys/filio.h | |
const FIOFFS_MAGIC: u8 = b'f'; | |
const FIOFFS_TYPE_MODE: u8 = 66; | |
nix::ioctl_none!(zfs_fioffs, FIOFFS_MAGIC, FIOFFS_TYPE_MODE); | |
let rc = unsafe { zfs_fioffs(file.as_raw_fd()) }; | |
if let Err(e) = rc { | |
let e: std::io::Error = e.into(); | |
return Err(CrucibleError::from(e)); | |
} |
(we should double-check with DTrace that this actually hits the right function!)
@@ -905,6 +908,43 @@ impl Region { | |||
result??; | |||
} | |||
|
|||
if cfg!(feature = "omicron-build") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using if cfg!(..)
weirds me out because the code is still compiled even if the feature is disabled; I guess the ioctl
code builds just fine on MacOS, but I'd rather not!
What about something like this?
#[cfg(feature = "omicron-build")]
{
#[cfg(not(target_os = "illumos"))]
compile_error!("cannot use FIOFFS on non-illumos systems");
// .. normal code continues below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For LiveRepair
, we required the ability to flush some extents (and their meta-data and
their dirty bit) and not flush other extents.
Will this change still allow that ability?
I'm pretty sure that this ioctl is extremely private, FWIW. How did we arrive here? Is there some analysis and some description of why this is definitely safe? |
It shouldn't - |
This PR is obsolete now due to: #1199 |
Yup, closing now |
illumos supports sending a FIOFFS ioctl to a ZFS dataset in order to sync all outstanding IO. Use this ioctl in
region_flush
for the whole region dataset instead of callingsync_all
for each extent file. This is hidden behind a newomicron-build
feature, which is true for our production builds.This necessitated separating flush (which commits bits to durable storage and is a no-op if
omicron-build
is used) andpost_flush
(which performs clean up or any other kind of accounting). Splitting this up exposed a bug in the sqlite implementation of ExtentInner: it wasn't checking to see if the extent was dirty and wasn't callingset_flush_number
before callingsync_all
on the user data file.