Use FIOFFS on illumos instead of fsync #1148

jmpesp · 2024-02-06T21:46:13Z

illumos supports sending a FIOFFS ioctl to a ZFS dataset in order to sync all outstanding IO. Use this ioctl in region_flush for the whole region dataset instead of calling sync_all for each extent file. This is hidden behind a new omicron-build feature, which is true for our production builds.

This necessitated separating flush (which commits bits to durable storage and is a no-op if omicron-build is used) and post_flush (which performs clean up or any other kind of accounting). Splitting this up exposed a bug in the sqlite implementation of ExtentInner: it wasn't checking to see if the extent was dirty and wasn't calling set_flush_number before calling sync_all on the user data file.

illumos supports sending a FIOFFS ioctl to a ZFS dataset in order to sync _all_ outstanding IO. Use this ioctl in `region_flush` for the whole region dataset instead of calling `sync_all` for each extent file. This is hidden behind a new `omicron-build` feature, which is true for our production builds. This necessitated separating flush (which commits bits to durable storage and is a no-op if `omicron-build` is used) and `post_flush` (which performs clean up or any other kind of accounting). Splitting this up exposed a bug in the sqlite implementation of ExtentInner: it wasn't checking to see if the extent was dirty and wasn't calling `set_flush_number` before calling `sync_all` on the user data file.

jmpesp · 2024-02-06T21:48:41Z

For posterity, the behaviour of FIOFFS was confirmed by the following test:

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/filio.h>
#include <unistd.h>

int main() {
        int fd = open("/oxp_3d7b3f53-6487-4f0b-8a73-ba7f95a66d76/testpost", O_DIRECTORY | O_RDONLY, 0);

        if (fd == -1) {
                int e = errno;
                printf("could not open: %s\n", strerror(e));
                return 1;
        }

        printf("submitting %i %x %x:%x\n", _FIOFFS, _FIOFFS, 'f', 66);

        if (ioctl(fd, _FIOFFS, NULL) == -1) {
                int e = errno;
                printf("could not ioctl: %s\n", strerror(e));
        }

        if (close(fd) == -1) {
                int e = errno;
                printf("could not close: %s\n", strerror(e));
                return 1;
        }

        return 0;
}

Running this program to submit a FIOFFS for /oxp_3d7b3f53-6487-4f0b-8a73-ba7f95a66d76/testpost (the mountpoint of a dataset) and running dtrace shows that both zfs_sync and zil_commit are called:

james@dinnerbone:~$ gcc -o ioctl ioctl.c
james@dinnerbone:~$ pfexec dtrace -c 'pfexec ./ioctl' -n 'fbt::zfs_sync:entry' -n 'fbt::zfs_sync:return { print(arg1); }' -n 'fbt::zil_commit:entry'
dtrace: description 'fbt::zfs_sync:entry' matched 1 probe
dtrace: description 'fbt::zfs_sync:return ' matched 1 probe
dtrace: description 'fbt::zil_commit:entry' matched 1 probe
submitting 536897090 20006642 66:42
dtrace: pid 20032 has exited
CPU     ID                    FUNCTION:NAME
  0  49695                   zfs_sync:entry 
  0  47450                 zil_commit:entry 
  0  49696                  zfs_sync:return int64_t 0

jmpesp · 2024-02-06T21:51:08Z

This is hidden behind a new omicron-build feature, which is true for our production builds.

Also noting: this is true because we currently build with --all-features in our build-release buildomat job.

mkeeter · 2024-02-06T22:06:49Z

downstairs/src/extent_inner_raw.rs

+        Ok(())
+    }
+
+    fn post_flush(


It weirds me out that Extent::flush doesn't actually flush, depending on omicron-build. I understand why that's the case, but would like to propose an alternative API. I think we should break flushing behavior into three functions:

pre_flush contains everything through self.set_flush_number(..) in the current implementation

flush_inner implements the else branch, i.e. calling self.file.sync_all (unconditionally!)

post_flush is the same as in this PR

Then, we can have a default implementation of flush on trait ExtentInner which calls these functions one after the other, meaning unit tests that want to flush a single extent can just call flush(). Right now, those unit tests are no longer actually syncing the file to disk!

More importantly, the decision to skip calling flush_inner (i.e. just calling pre_flush + post_flush) is moved into the Region::region_flush implementation, which is right next to the FIOFFS call, which makes it more obvious what's going on.

mkeeter · 2024-02-06T22:07:56Z

downstairs/src/extent_inner_raw.rs

        }
+
+        cdt::extent__flush__done!(|| { (job_id.get(), self.extent_number, 0) });


I weakly feel like this should be at the beginning of post_flush, so that it accounts for the actual flush time (even though that's amortized across many extents)

mkeeter · 2024-02-06T22:15:49Z

downstairs/src/extent_inner_sqlite.rs

+
+        // We put all of our metadata updates into a single write to make this
+        // operation atomic.
+        self.set_flush_number(new_flush, new_gen)?;


I'm suspicious about this change.

If we lose power right after this line, the metadata DB will tell us that we're at a particular flush (new_flush) and dirty = 0, but the data in the extent file has not necessarily been persisted to disk!

I think the existing implementation – which updates flush number in the DB after syncing the extent file – is correct.

mkeeter · 2024-02-06T22:16:23Z

downstairs/src/extent_inner_sqlite.rs

            /*
-             * XXX Retry?  Mark extent as broken?
+             * We must first fsync to get any outstanding data written to disk.


(Same vibes here about changing to pre_flush / flush_inner / post_flush; you'll obviously have to fix both extent formats if you change the trait)

mkeeter · 2024-02-06T22:39:42Z

downstairs/src/region.rs

+            use std::io;
+            use std::os::fd::AsRawFd;
+            use std::ptr;
+
+            // "file system flush", defined in illumos' sys/filio.h
+            const FIOFFS: libc::c_int = 0x20000000 | (0x66 /*'f'*/ << 8) | 0x42 /*66*/;
+
+            // Open the region's mountpoint
+            let file = File::open(&self.dir)?;
+
+            let rc = unsafe {
+                libc::ioctl(
+                    file.as_raw_fd(),
+                    FIOFFS as _,
+                    ptr::null_mut::<i32>(),
+                )
+            };
+
+            if rc != 0 {
+                return Err(io::Error::last_os_error().into());
+            }


Using nix::ioctl lets this be a little cleaner:

Suggested change

use std::io;

use std::os::fd::AsRawFd;

use std::ptr;

// "file system flush", defined in illumos' sys/filio.h

const FIOFFS: libc::c_int = 0x20000000 | (0x66 /*'f'*/ << 8) | 0x42 /*66*/;

// Open the region's mountpoint

let file = File::open(&self.dir)?;

let rc = unsafe {

libc::ioctl(

file.as_raw_fd(),

FIOFFS as _,

ptr::null_mut::<i32>(),

)

};

if rc != 0 {

return Err(io::Error::last_os_error().into());

}

use std::os::fd::AsRawFd;

// Open the region's mountpoint

let file = File::open(&self.dir)?;

// "file system flush", defined in illumos' sys/filio.h

const FIOFFS_MAGIC: u8 = b'f';

const FIOFFS_TYPE_MODE: u8 = 66;

nix::ioctl_none!(zfs_fioffs, FIOFFS_MAGIC, FIOFFS_TYPE_MODE);

let rc = unsafe { zfs_fioffs(file.as_raw_fd()) };

if let Err(e) = rc {

let e: std::io::Error = e.into();

return Err(CrucibleError::from(e));

}

(we should double-check with DTrace that this actually hits the right function!)

mkeeter · 2024-02-06T22:40:28Z

downstairs/src/region.rs

@@ -905,6 +908,43 @@ impl Region {
            result??;
        }

+        if cfg!(feature = "omicron-build") {


Using if cfg!(..) weirds me out because the code is still compiled even if the feature is disabled; I guess the ioctl code builds just fine on MacOS, but I'd rather not!

What about something like this?

#[cfg(feature = "omicron-build")] { #[cfg(not(target_os = "illumos"))] compile_error!("cannot use FIOFFS on non-illumos systems"); // .. normal code continues below

leftwo

For LiveRepair, we required the ability to flush some extents (and their meta-data and
their dirty bit) and not flush other extents.

Will this change still allow that ability?

jclulow · 2024-02-06T22:48:18Z

I'm pretty sure that this ioctl is extremely private, FWIW. How did we arrive here? Is there some analysis and some description of why this is definitely safe?

jmpesp · 2024-02-07T19:51:32Z

For LiveRepair, we required the ability to flush some extents (and their meta-data and their dirty bit) and not flush other extents.

Will this change still allow that ability?

It shouldn't - extent_limit should still be honoured. What this PR changes is instead of iterating through the dirty extents (where we noted that writes occurred) and calling fsync on those file handles, we now issue a zfs_sync for the whole region dataset. I believe this is equivalent: writes should not have been occurring on extents higher than the extent limit independent of which flush technique is used.

leftwo · 2024-03-12T05:48:31Z

This PR is obsolete now due to: #1199
Correct?

mkeeter · 2024-03-12T13:08:54Z

Yup, closing now

jmpesp requested review from mkeeter and leftwo February 6, 2024 21:46

a comment

7fe08e7

mkeeter reviewed Feb 6, 2024

View reviewed changes

leftwo reviewed Feb 6, 2024

View reviewed changes

mkeeter mentioned this pull request Feb 14, 2024

Consider not using on_disk_hash #1161

Open

mkeeter mentioned this pull request Mar 11, 2024

Use FIOFFS on illumos instead of fsync (again) #1199

Open

mkeeter closed this Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use FIOFFS on illumos instead of fsync #1148

Use FIOFFS on illumos instead of fsync #1148

jmpesp commented Feb 6, 2024

jmpesp commented Feb 6, 2024

jmpesp commented Feb 6, 2024

mkeeter Feb 6, 2024

mkeeter Feb 6, 2024 •

edited

Loading

mkeeter Feb 6, 2024

mkeeter Feb 6, 2024

mkeeter Feb 6, 2024

mkeeter Feb 6, 2024

leftwo left a comment

jclulow commented Feb 6, 2024

jmpesp commented Feb 7, 2024

leftwo commented Mar 12, 2024

mkeeter commented Mar 12, 2024

		}

		cdt::extent__flush__done!(\|\| { (job_id.get(), self.extent_number, 0) });

Use FIOFFS on illumos instead of fsync #1148

Use FIOFFS on illumos instead of fsync #1148

Conversation

jmpesp commented Feb 6, 2024

jmpesp commented Feb 6, 2024

jmpesp commented Feb 6, 2024

mkeeter Feb 6, 2024

Choose a reason for hiding this comment

mkeeter Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

mkeeter Feb 6, 2024

Choose a reason for hiding this comment

mkeeter Feb 6, 2024

Choose a reason for hiding this comment

mkeeter Feb 6, 2024

Choose a reason for hiding this comment

mkeeter Feb 6, 2024

Choose a reason for hiding this comment

leftwo left a comment

Choose a reason for hiding this comment

jclulow commented Feb 6, 2024

jmpesp commented Feb 7, 2024

leftwo commented Mar 12, 2024

mkeeter commented Mar 12, 2024

mkeeter Feb 6, 2024 •

edited

Loading