Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FIOFFS on illumos instead of fsync #1148

Closed
wants to merge 2 commits into from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Feb 6, 2024

illumos supports sending a FIOFFS ioctl to a ZFS dataset in order to sync all outstanding IO. Use this ioctl in region_flush for the whole region dataset instead of calling sync_all for each extent file. This is hidden behind a new omicron-build feature, which is true for our production builds.

This necessitated separating flush (which commits bits to durable storage and is a no-op if omicron-build is used) and post_flush (which performs clean up or any other kind of accounting). Splitting this up exposed a bug in the sqlite implementation of ExtentInner: it wasn't checking to see if the extent was dirty and wasn't calling set_flush_number before calling sync_all on the user data file.

illumos supports sending a FIOFFS ioctl to a ZFS dataset in order to
sync _all_ outstanding IO. Use this ioctl in `region_flush` for the
whole region dataset instead of calling `sync_all` for each extent file.
This is hidden behind a new `omicron-build` feature, which is true for
our production builds.

This necessitated separating flush (which commits bits to durable
storage and is a no-op if `omicron-build` is used) and `post_flush`
(which performs clean up or any other kind of accounting). Splitting
this up exposed a bug in the sqlite implementation of ExtentInner: it
wasn't checking to see if the extent was dirty and wasn't calling
`set_flush_number` before calling `sync_all` on the user data file.
@jmpesp jmpesp requested review from mkeeter and leftwo February 6, 2024 21:46
@jmpesp
Copy link
Contributor Author

jmpesp commented Feb 6, 2024

For posterity, the behaviour of FIOFFS was confirmed by the following test:

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/filio.h>
#include <unistd.h>

int main() {
        int fd = open("/oxp_3d7b3f53-6487-4f0b-8a73-ba7f95a66d76/testpost", O_DIRECTORY | O_RDONLY, 0);

        if (fd == -1) {
                int e = errno;
                printf("could not open: %s\n", strerror(e));
                return 1;
        }

        printf("submitting %i %x %x:%x\n", _FIOFFS, _FIOFFS, 'f', 66);

        if (ioctl(fd, _FIOFFS, NULL) == -1) {
                int e = errno;
                printf("could not ioctl: %s\n", strerror(e));
        }

        if (close(fd) == -1) {
                int e = errno;
                printf("could not close: %s\n", strerror(e));
                return 1;
        }

        return 0;
}

Running this program to submit a FIOFFS for /oxp_3d7b3f53-6487-4f0b-8a73-ba7f95a66d76/testpost (the mountpoint of a dataset) and running dtrace shows that both zfs_sync and zil_commit are called:

james@dinnerbone:~$ gcc -o ioctl ioctl.c
james@dinnerbone:~$ pfexec dtrace -c 'pfexec ./ioctl' -n 'fbt::zfs_sync:entry' -n 'fbt::zfs_sync:return { print(arg1); }' -n 'fbt::zil_commit:entry'
dtrace: description 'fbt::zfs_sync:entry' matched 1 probe
dtrace: description 'fbt::zfs_sync:return ' matched 1 probe
dtrace: description 'fbt::zil_commit:entry' matched 1 probe
submitting 536897090 20006642 66:42
dtrace: pid 20032 has exited
CPU     ID                    FUNCTION:NAME
  0  49695                   zfs_sync:entry 
  0  47450                 zil_commit:entry 
  0  49696                  zfs_sync:return int64_t 0

@jmpesp
Copy link
Contributor Author

jmpesp commented Feb 6, 2024

This is hidden behind a new omicron-build feature, which is true for our production builds.

Also noting: this is true because we currently build with --all-features in our build-release buildomat job.

Ok(())
}

fn post_flush(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It weirds me out that Extent::flush doesn't actually flush, depending on omicron-build. I understand why that's the case, but would like to propose an alternative API. I think we should break flushing behavior into three functions:

  • pre_flush contains everything through self.set_flush_number(..) in the current implementation
  • flush_inner implements the else branch, i.e. calling self.file.sync_all (unconditionally!)
  • post_flush is the same as in this PR

Then, we can have a default implementation of flush on trait ExtentInner which calls these functions one after the other, meaning unit tests that want to flush a single extent can just call flush(). Right now, those unit tests are no longer actually syncing the file to disk!

More importantly, the decision to skip calling flush_inner (i.e. just calling pre_flush + post_flush) is moved into the Region::region_flush implementation, which is right next to the FIOFFS call, which makes it more obvious what's going on.

}

cdt::extent__flush__done!(|| { (job_id.get(), self.extent_number, 0) });
Copy link
Contributor

@mkeeter mkeeter Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I weakly feel like this should be at the beginning of post_flush, so that it accounts for the actual flush time (even though that's amortized across many extents)


// We put all of our metadata updates into a single write to make this
// operation atomic.
self.set_flush_number(new_flush, new_gen)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suspicious about this change.

If we lose power right after this line, the metadata DB will tell us that we're at a particular flush (new_flush) and dirty = 0, but the data in the extent file has not necessarily been persisted to disk!

I think the existing implementation – which updates flush number in the DB after syncing the extent file – is correct.

/*
* XXX Retry? Mark extent as broken?
* We must first fsync to get any outstanding data written to disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same vibes here about changing to pre_flush / flush_inner / post_flush; you'll obviously have to fix both extent formats if you change the trait)

Comment on lines +918 to +938
use std::io;
use std::os::fd::AsRawFd;
use std::ptr;

// "file system flush", defined in illumos' sys/filio.h
const FIOFFS: libc::c_int = 0x20000000 | (0x66 /*'f'*/ << 8) | 0x42 /*66*/;

// Open the region's mountpoint
let file = File::open(&self.dir)?;

let rc = unsafe {
libc::ioctl(
file.as_raw_fd(),
FIOFFS as _,
ptr::null_mut::<i32>(),
)
};

if rc != 0 {
return Err(io::Error::last_os_error().into());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using nix::ioctl lets this be a little cleaner:

Suggested change
use std::io;
use std::os::fd::AsRawFd;
use std::ptr;
// "file system flush", defined in illumos' sys/filio.h
const FIOFFS: libc::c_int = 0x20000000 | (0x66 /*'f'*/ << 8) | 0x42 /*66*/;
// Open the region's mountpoint
let file = File::open(&self.dir)?;
let rc = unsafe {
libc::ioctl(
file.as_raw_fd(),
FIOFFS as _,
ptr::null_mut::<i32>(),
)
};
if rc != 0 {
return Err(io::Error::last_os_error().into());
}
use std::os::fd::AsRawFd;
// Open the region's mountpoint
let file = File::open(&self.dir)?;
// "file system flush", defined in illumos' sys/filio.h
const FIOFFS_MAGIC: u8 = b'f';
const FIOFFS_TYPE_MODE: u8 = 66;
nix::ioctl_none!(zfs_fioffs, FIOFFS_MAGIC, FIOFFS_TYPE_MODE);
let rc = unsafe { zfs_fioffs(file.as_raw_fd()) };
if let Err(e) = rc {
let e: std::io::Error = e.into();
return Err(CrucibleError::from(e));
}

(we should double-check with DTrace that this actually hits the right function!)

@@ -905,6 +908,43 @@ impl Region {
result??;
}

if cfg!(feature = "omicron-build") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using if cfg!(..) weirds me out because the code is still compiled even if the feature is disabled; I guess the ioctl code builds just fine on MacOS, but I'd rather not!

What about something like this?

#[cfg(feature = "omicron-build")]
{
    #[cfg(not(target_os = "illumos"))]
    compile_error!("cannot use FIOFFS on non-illumos systems");
    
    // .. normal code continues below

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For LiveRepair, we required the ability to flush some extents (and their meta-data and
their dirty bit) and not flush other extents.

Will this change still allow that ability?

@jclulow
Copy link
Collaborator

jclulow commented Feb 6, 2024

I'm pretty sure that this ioctl is extremely private, FWIW. How did we arrive here? Is there some analysis and some description of why this is definitely safe?

@jmpesp
Copy link
Contributor Author

jmpesp commented Feb 7, 2024

For LiveRepair, we required the ability to flush some extents (and their meta-data and their dirty bit) and not flush other extents.

Will this change still allow that ability?

It shouldn't - extent_limit should still be honoured. What this PR changes is instead of iterating through the dirty extents (where we noted that writes occurred) and calling fsync on those file handles, we now issue a zfs_sync for the whole region dataset. I believe this is equivalent: writes should not have been occurring on extents higher than the extent limit independent of which flush technique is used.

@leftwo
Copy link
Contributor

leftwo commented Mar 12, 2024

This PR is obsolete now due to: #1199
Correct?

@mkeeter
Copy link
Contributor

mkeeter commented Mar 12, 2024

Yup, closing now

@mkeeter mkeeter closed this Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants