Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to ZFSBootMenu for Arch #463

Open
meilon opened this issue Oct 22, 2023 · 38 comments
Open

Switch to ZFSBootMenu for Arch #463

meilon opened this issue Oct 22, 2023 · 38 comments

Comments

@meilon
Copy link

meilon commented Oct 22, 2023

@ne9z, I have the following issue with the Arch Linux ZFS HOWTO:

I updated my install last night to the current kernel and zfs combo and now my system won't boot anymore. Grub complains that the compression algorithm inherit not supported. The pools mount fine in arch live or alpine live. I guess I must've accidentally enabled a feature or something and GRUB won't work anymore.

From what I read ZFS and GRUB where never that good together, so what do you think about a switch to ZFSBootMenu?

@ghost
Copy link

ghost commented Oct 22, 2023 via email

@ghost
Copy link

ghost commented Oct 22, 2023 via email

@meilon
Copy link
Author

meilon commented Oct 22, 2023

I just updated linux, zfs and zfs-utils and did a reboot as always. Worked with 6.5.7 (and a lot of previous kernel releases), not with 6.5.8. Didn't do a zpool upgrade.

I didn't do anything weird between those two boots (6.5.7 update was the last one), especially doing anything zfs related.

Here's a paste of grub-probe: https://pastebin.com/dJWrj482

And yeah, I meant the supported features. Like booting snapshots for instance.

@ghost
Copy link

ghost commented Oct 22, 2023

@gmelikov ZFS package update without zpool upgrade breaks GRUB. What's your take on this?

@gmelikov
Copy link
Member

Hm, there are some history about external problems https://forum.proxmox.com/threads/grub-error-compression-algorithm-inherit-not-supported-on-reboot.38629/ pbatard/EfiFs#27

We need a reproducer, @ne9z maybe you have already tried? And it's just to try to use Arch instruction? I didn't try it yet. We didn't have large changes on compression side in 2.2.

@ghost
Copy link

ghost commented Oct 23, 2023 via email

@gmelikov
Copy link
Member

gmelikov commented Oct 23, 2023

@meilon @ne9z I've found issue about that openzfs/zfs#15261

EDIT: oops, @meilon you was faster :)

@ghost
Copy link

ghost commented Oct 23, 2023 via email

@meilon
Copy link
Author

meilon commented Oct 23, 2023

Thank you, I don't know why I clung to keeping the bpool. Without it it works great!

@ghost
Copy link

ghost commented Oct 23, 2023 via email

@ghost
Copy link

ghost commented Oct 23, 2023 via email

@ahesford
Copy link

You are free to switch to ZFSBootMenu for your computers. For me, it is not nearly as universal or battle-tested as GRUB, and addes unnecessary complication to the guide. GRUB is actively developed and its security is constantly scrutinized by companies around the world.

This is the funniest thing I've read all week. Thanks for that.

@ghost
Copy link

ghost commented Oct 24, 2023 via email

@ahesford
Copy link

Note that, if you are switching to ZFSBootMenu, you shouldn't destroy your boot pool or reinstall your kernel packages. Just copy the contents of /boot to /boot.new, unset the mountpoint on your existing /boot and confirm that it's unmounted, rmdir /boot && mv /boot.new /boot.

Needless destruction before trying something new is a terrible practice. If you migrate to ZFSBootMenu and like it, you can then decide what you want to do with the vestigial boot pool.

@ghost
Copy link

ghost commented Oct 24, 2023 via email

@rlaager
Copy link
Member

rlaager commented Nov 4, 2023

Is this an actual problem or just a spurious message?

@ghost
Copy link

ghost commented Nov 4, 2023 via email

@colmbuckley
Copy link
Contributor

Note: your reproduction above is not sound - you are creating the bpool pool on a zeroed-out file inside a (presumably not grub-compatible) dataset, then snapshotting that dataset. GRUB is trying to map from this back into raw disk offsets, which I expect would be highly unreliable. I think that the grub-probe errors here are red herrings (I'm frankly surprised it works even with compatibility=legacy).

Has anyone been able to reproduce this problem when creating bpool on a separate disk partition as the guide suggests?

@colmbuckley
Copy link
Contributor

Ah, I see that the original reporter did. Hmm. @meilon do you still have access to the bpool you originally had trouble with? Could you show the output of zpool get all bpool if so?

@meilon
Copy link
Author

meilon commented Nov 8, 2023

I destroyed the pool, but didn't delete the partitions. I re-imported it just now, here's the requested output:

$ zpool get all bpool
NAME   PROPERTY                       VALUE                          SOURCE
bpool  size                           3.75G                          -
bpool  capacity                       0%                             -
bpool  altroot                        -                              default
bpool  health                         ONLINE                         -
bpool  guid                           7872762533682099877            -
bpool  version                        -                              default
bpool  bootfs                         -                              default
bpool  delegation                     on                             default
bpool  autoreplace                    off                            default
bpool  cachefile                      -                              default
bpool  failmode                       wait                           default
bpool  listsnapshots                  off                            default
bpool  autoexpand                     off                            default
bpool  dedupratio                     1.00x                          -
bpool  free                           3.75G                          -
bpool  allocated                      988K                           -
bpool  readonly                       off                            -
bpool  ashift                         12                             local
bpool  comment                        -                              default
bpool  expandsize                     -                              -
bpool  freeing                        0                              -
bpool  fragmentation                  0%                             -
bpool  leaked                         0                              -
bpool  multihost                      off                            default
bpool  checkpoint                     -                              -
bpool  load_guid                      14349335285769238702           -
bpool  autotrim                       on                             local
bpool  compatibility                  off                            default
bpool  bcloneused                     0                              -
bpool  bclonesaved                    0                              -
bpool  bcloneratio                    1.00x                          -
bpool  feature@async_destroy          enabled                        local
bpool  feature@empty_bpobj            active                         local
bpool  feature@lz4_compress           active                         local
bpool  feature@multi_vdev_crash_dump  disabled                       local
bpool  feature@spacemap_histogram     active                         local
bpool  feature@enabled_txg            active                         local
bpool  feature@hole_birth             active                         local
bpool  feature@extensible_dataset     enabled                        local
bpool  feature@embedded_data          active                         local
bpool  feature@bookmarks              enabled                        local
bpool  feature@filesystem_limits      enabled                        local
bpool  feature@large_blocks           enabled                        local
bpool  feature@large_dnode            disabled                       local
bpool  feature@sha512                 disabled                       local
bpool  feature@skein                  disabled                       local
bpool  feature@edonr                  disabled                       local
bpool  feature@userobj_accounting     disabled                       local
bpool  feature@encryption             disabled                       local
bpool  feature@project_quota          disabled                       local
bpool  feature@device_removal         disabled                       local
bpool  feature@obsolete_counts        disabled                       local
bpool  feature@zpool_checkpoint       disabled                       local
bpool  feature@spacemap_v2            disabled                       local
bpool  feature@allocation_classes     disabled                       local
bpool  feature@resilver_defer         disabled                       local
bpool  feature@bookmark_v2            disabled                       local
bpool  feature@redaction_bookmarks    disabled                       local
bpool  feature@redacted_datasets      disabled                       local
bpool  feature@bookmark_written       disabled                       local
bpool  feature@log_spacemap           disabled                       local
bpool  feature@livelist               disabled                       local
bpool  feature@device_rebuild         disabled                       local
bpool  feature@zstd_compress          disabled                       local
bpool  feature@draid                  disabled                       local
bpool  feature@zilsaxattr             disabled                       local
bpool  feature@head_errlog            disabled                       local
bpool  feature@blake3                 disabled                       local
bpool  feature@block_cloning          disabled                       local
bpool  feature@vdev_zaps_v2           disabled                       local

@ghost
Copy link

ghost commented Nov 8, 2023 via email

@colmbuckley
Copy link
Contributor

You might be right about the part of testing a file based pool. But now I actually use my own guides on my physical laptop, I can conclusively demonstrate that compat=legacy on a baremental machine fixes the "bpool snapshot breaks GRUB compatibility" issue

Can you provide the same output from zpool get all and grub-probe in the case where it fails - ie: with compatibility=grub2? The above is expected behavior.

@ghost
Copy link

ghost commented Nov 8, 2023 via email

@ghost
Copy link

ghost commented Nov 8, 2023 via email

@colmbuckley
Copy link
Contributor

Here is the bpool recreated with compat=legacy

This is very mysterious; something in the feature set enabled by compatibility=grub2 is clearly triggering a GRUB bug (and it is a bug, because somehow it's misinterpreting an inherited zpool attribute as being the name of a compression algorithm and barfing), but none of the features in that compatibility set should be even vaguely related.

I have definitely installed dozens of Debian systems using a bpool with compatibility=grub2 and this has been stable through kernel upgrades (to 6.4 though - I haven't tried 6.5), ZFS upgrades (to at least 2.1.11), GRUB reinstalls, and snapshots both automatic and manual of bpool and rpool; I have not encountered this issue. It might be a recent reversion with 2.1.13 or kernel 6.5.

Regardless, the correct long-term fix is definitely to try to identify which feature(s) trigger the behavior and remove them from the grub2 compatibility set, rather than to avoid using it completely (this is literally its purpose). I will try to bisect the problem tomorrow on a suitable sacrificial VM.

@ghost
Copy link

ghost commented Nov 8, 2023 via email

@colmbuckley
Copy link
Contributor

I agree; GRUB is a total bear, and very difficult to understand. Like others, I've switched to ZFSBootMenu for most of my systems; although I'd like to see it get another year or two of solid production experience before making it a formal recommendation.

We'll see if there's any movement on the upstream bug also.

@ghost
Copy link

ghost commented Nov 9, 2023 via email

@ahesford
Copy link

ahesford commented Nov 9, 2023

Just my two cents: with the transition to UEFI, the situation seems to become the following: BIOS (1980-era, very simplistic) + GRUB (small OS) to UEFI (overengineered, considered bloated by some, its specification spans thousands of pages, buggy implementation abound) + relative simple bootloader

The popular criticisms of UEFI are not unreasonable, but calling GRUB complex and ZFSBootMenu relatively simple is an inversion of reality. The problems people have with GRUB are caused because it is too simple, which pushes complexity and error handling onto the user. Half of the GRUB woes experienced by ZFS users would vanish if GRUB took on added complexity to properly support the file system. Likewise, configuration is a hassle because there are layers of abstraction between what the executable expects and what users are expected to write. Furthermore, GRUB needs static configuration for all of the bits that users want to be dynamic. If GRUB were capable of parsing more convenient formats and could be more intelligent about discovering what it offers to boot, configuration for the user could be made straightforward.

ZFSBootMenu, on the other hand, is literally a Linux operating system with a custom init and user shell (written in bash!) added on top. It's easy to interact with because there is complex behavior to walk ZFS systems and find what it presents you, and driver support that lets us push the configurable bits into ZFS properties where they feel kind of natural. It's simple to develop because we don't manage all of the complexity of the kernel, ZFS drivers and runtime; we just copy them into the image and run them.

On UEFI systems, ZBM is easy to install because the complexity of the firmware means you can just drop a single file somewhere and the system will bring it up. On BIOS systems, launching ZFSBootMenu requires an intermediate loader (syslinux is a convenient choice) that requires configuration and user interaction because the firmware lacks these niceties.

@colmbuckley
Copy link
Contributor

I am unable to replicate this, unfortunately:

  • Linux kernel: 6.5.3-1~bpo12+1 (2023-10-08) x86_64 GNU/Linux
  • ZFS version: zfs-2.1.13-1~bpo12+1
  • GRUB version: 2.06-13+deb12u1

All of these are the most up to date versions from Debian bookworm plus backports.

This is a minimal session to create a disk containing the bpool pool plus GRUB's infrastructure. $DISK is a full virtual disk attached to a (Google Cloud) VM; I have used an identical setup to install numerous full systems.

root@bpooltest:/boot# sgdisk --zap-all $DISK
Creating new GPT entries in memory.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
root@bpooltest:/boot# sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK
Creating new GPT entries in memory.
Warning: Setting alignment to a value that does not match the disk's
physical block size! Performance degradation may result!
Physical block size = 4096
Logical block size = 512
Optimal alignment = 8 or multiples thereof.
The operation has completed successfully.
root@bpooltest:/boot# sgdisk     -n3:0:+1G      -t3:BF01 $DISK
The operation has completed successfully.
root@bpooltest:/boot# zpool create \
    -o ashift=12 \
    -o autotrim=on \
    -o compatibility=grub2 \
    -o cachefile=/etc/zfs/zpool.cache \
    -O devices=off \
    -O acltype=posixacl -O xattr=sa \
    -O compression=lz4 \
    -O normalization=formD \
    -O relatime=on \
    -O canmount=off -O mountpoint=/boot -R /mnt \
    bpool ${DISK}-part3
root@bpooltest:/boot# zfs create -o canmount=off -o mountpoint=none bpool/BOOT
root@bpooltest:/boot# zfs create -o mountpoint=/boot bpool/BOOT/debian
root@bpooltest:/boot# mount | grep mnt
bpool/BOOT/debian on /mnt/boot type zfs (rw,nodev,relatime,xattr,posixacl)
root@bpooltest:/boot# cp /boot/*6.5* /mnt/boot
root@bpooltest:/boot# grub-probe /mnt/boot
zfs
root@bpooltest:/boot# grub-install --boot-directory=/mnt/boot $DISK
Installing for i386-pc platform.
Installation finished. No error reported.
root@bpooltest:/boot# zfs snap bpool@test
root@bpooltest:/boot# grub-probe /mnt/boot
zfs
root@bpooltest:/boot# grub-install --boot-directory=/mnt/boot $DISK
Installing for i386-pc platform.
Installation finished. No error reported.
root@bpooltest:/boot# zfs snap -r bpool@test2
root@bpooltest:/boot# grub-install  --boot-directory=/mnt/boot $DISK
Installing for i386-pc platform.
Installation finished. No error reported.
root@bpooltest:/boot# grub-probe /mnt/boot
zfs

Is anyone able to trigger the GRUB bug with a setup similar to this?

@ghost
Copy link

ghost commented Nov 9, 2023 via email

@colmbuckley
Copy link
Contributor

But can you replicate the error with a sequence similar to what I used? Your earlier replication seems to bring a lot more data into the bpool with the nixos root dataset on it?

@gmelikov
Copy link
Member

gmelikov commented Nov 9, 2023

Sidenote: looks like there may be differences in distro-specific grub2 builds, I've had problems even with root pool with Suse several years ago while tested one of our guides #258

@colmbuckley
Copy link
Contributor

I wonder if this is a regression in the newest versions of GRUB; Debian's version is fairly old.

@J4gQBqqR
Copy link

J4gQBqqR commented Nov 20, 2023

Hi, active zfs user here. Following exactly the documentation's installation method. And I snapshot the root of bpool and rpool the other day. I "reproduced/confirmed" this reported failure of GRUB not able to boot my bpool.

图片

Now I ends up with a system which cannot be booted, and I spend hours today trying to fix it without success.

Is there any fix that I can carry out on my system to mitigate my mis-operation of snapshotting my bpool?

Any suggestion?

@ghost
Copy link

ghost commented Nov 20, 2023 via email

@colmbuckley
Copy link
Contributor

This is almost definitely due to a regression in GRUB; as -o compatibility=grub2 was confirmed to be working (snapshots and all) as of a couple of months ago. It's unfortunately very time-consuming to figure out which zpool feature is triggering the GRUB bug (it should be removed from compatibility.d/grub2), and I have not yet been able to do this.

In the meantime, yes - recreating bpool with -o compatibility=legacy or migrating to ZBM are suitable workarounds.

@J4gQBqqR
Copy link

Thank you for your help. I will employ both methods and keep one method as redundancy for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants