-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backup.sh seems to mess zfs pool (multiple datasets). #7
Comments
'sudo zfs mount -a' will expose all the original files for you, so no
worries about missing data.
Was there a very short delay between 'snapback' runs? I haven't tried it
less than about 30s so we may be hitting an issue with the dataset still
being written to.
Thanks for the detailed report!
…On 12 June 2017 at 00:29, kathode ***@***.***> wrote:
Hi,
Thanks for the nice guide. I've followed most of the steps exactly as
described, up to the point where I got the initial backup from my FreeNAS
installation to an intel NUC running 11.0-RELEASE-p9. I observe that I have
more than one dataset, but only one volume.
I then downloaded your script backup.sh, and after executing it a few
times eventually the remote system becomes unable to gracefully export the
pool. For example:
***@***.***:~ % echo ***@***.***:13:19Z" > .sunflower-last-sent
***@***.***:~ % ./backup.sh snapback
Estimated Size: 47.6M
46.8MiB 0:00:04 [11.6MiB/s] [==============================================================================================================================> ] 98%
***@***.***:~ % ./backup.sh snapback
Estimated Size: 30.5K
74.8KiB 0:00:00 [82.2KiB/s] [=================================================================================================================================] 245%
cannot export 'master': pool is busy
geli: Cannot destroy device da0.eli (error=16).
I've tried changing geli's command not to 'auto-detach' following this
report <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=117158> but
even then I still get the same issue.
Following a reboot of the target system I then see 'UNAVAIL' for the pool.
***@***.***:~ # zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
master - - - - - - - UNAVAIL -
***@***.***:~ # zpool status
pool: master
state: UNAVAIL
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-3C
scan: none requested
config:
NAME STATE READ WRITE CKSUM
master UNAVAIL 0 0 0
1568430436390657639 UNAVAIL 0 0 0 was /dev/da0.eli
I then manually copy the key to the target system and ask geli to attach
the volume.
It then comes back online:
***@***.***:~ # zpool status
pool: master
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
master ONLINE 0 0 0
da0.eli ONLINE 0 0 0
However, if I then mount the volume by doing zfs mount master I can then
only see the top-level folders corresponding to each of the original
datasets, but not any files or folders saved within this zfs volume, which
seems to indicate to me that something is seriously messed up. I'm fairly
new to ZFS so I would appreciate some guidance as to whether I'm doing
something wrong with multiple datasets and your backup.sh script.
Thanks
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ADPj8tNv9NAo8O6UqpDoZek4VdnBfYXVks5sDHhLgaJpZM4N2i09>
.
|
Sure, I'm just testing this with random data, no serious issues with real data yet :-) Regarding the 'snapback' runs, yes, I ran them not that far apart.. so that may be part of the problem, I'm not sure. But again, right now there isn't much data in the pool and I'm doing everything locally over my LAN. It's just that I still cannot export the pool (no matter how long I wait), unless I reboot the system. I've been Googling around, but haven't found anything quite like this. At some point I started doubting that there could be a problem with the fact I'm using an external USB 3.0 hard-drive for this purpose, but then again, there are no obvious error messages in I've also activated AES-NI hardware offloading for geli, but don't suppose that's a source of problems.. Thanks for the quick response at this time of the day, well appreciated :-) |
I've done some further testing and..
I seriously do not know why importing/exporting too often makes it impossible to then export it. Why is FreeBSD having these kinds of issues? Makes me rather worried about the stability of things going forward.. |
I have now posted this problem on the FreeBSD forums https://forums.freebsd.org/threads/61291/ because I seem to be getting nowhere while trying to debug this. There aren't any obvious errors in the logs that I can see, and sadly I haven't been able to find anyone with the same problem :( |
Further debugging, if I may. I decided to try and destroy the pool, and try with it empty, both without and with geli, and things work so far:
So I then destroyed the pool once more, and created the pool without geli (ie, using da0 directly). I then modify
I then follow your instructions to create a fresh snapshot and send it:
Question: I don't suppose the datasets prefixed with '.' such as I then waited some time as up to 1.26GB were transferred. And following the
Also no obvious errors on dmesg or anything suspicious in /var/log/messages:
I then follow with some runs:
All good so far. So I now save the latest snapshot identifier to .sunflower-last-sent:
I am now in a position to run
I then do:
And.. bang, the problem is back. Now there must be something special about the data received, or how it is received through Next steps I intend to try:
Thanks |
Further debugging, part 2. I went ahead and destroyed the pool at the remote location and recreated it again:
I then resent the snapshot manually again:
This time I note that I used the parameter I then ran my usual sequence of test commands:
And to my surprise we're bang on the problem again, despite not having used Further debugging, part 2.
Further debugging, part 3. Further debugging, part 4.
I cannot ascertain for sure this |
Further debugging, part 5. Ok, this problem is now completely arbitrary...
Unless there are any good sources of logging I can tap into then this is becoming increasingly difficult to debug. There is obviously something wrong, somewhere. |
Small update, I've now reproduced this bug on the same machine between my external USB device and a USB stick. It's looking more and more likely to be a bug in FreeBSD. I wonder if you could help me try and reproduce it on your machine or FreeBSD version? Thanks |
Sure, sounds like a good one if it turns out to be a bug
Let me give it a shot.
In the mean time, try dottting a call to 'sync' in between the imports and
exports.
…On 14 June 2017 at 15:28, kathode ***@***.***> wrote:
Small update, I've now reproduced this bug
<https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219972#c1> on the same
machine between my external USB device and a USB stick. It's looking more
and more likely to be a bug in FreeBSD.
I wonder if you could help me try and reproduce it on your machine or
FreeBSD version?
Thanks
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADPj8n6lxx_NoSFyyD03UO5Sl1hzJphFks5sD-4AgaJpZM4N2i09>
.
|
Hi @hughobrien, Just to update you, I've now ruled out usb devices as the problem sources. I've created a virtual devices out of files on the same machine, and following a Further details available at: https://forums.freebsd.org/threads/61291/#post-353225 I would very much appreciate if you could confirm whether you can reproduce this issue by following the instructions in my forum post. Thanks! |
Quick follow up.. to reproduce this bug you need at least two cores, one of which should be mostly idle. I've now reproduced this on a FreeBSD VM (both 11.0-RELEASE and 11.1-BETA-2) on ESXi as well. Both my FreeBSD bug report and the linked forum post contain further details about my set-up (including a testing script). |
Hi there, I'm just getting back around to updating this project -- did you have any success filing a bug with FreeBSD for the above issue? Hugh |
Hi Hugh, I did report it at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219972 however there hasn't been much progress, and I myself haven't had time to tinker with the issue again. At the time I narrowed it down to a race condition of sorts, but as I am not familiar with the internals of ZFS nor FreeBSD it became impossible for me to probe further. It would be amazing if someone could take it further :-) |
Hi,
Thanks for the nice guide. I've followed most of the steps exactly as described, up to the point where I got the initial backup from my FreeNAS installation to an intel NUC running 11.0-RELEASE-p9. I observe that I have more than one dataset, but only one volume.
I then downloaded your script backup.sh, and after executing it a few times eventually the remote system becomes unable to gracefully export the pool. For example:
I've tried changing geli's command not to 'auto-detach' following this report but even then I still get the same issue.
Following a reboot of the target system I then see 'UNAVAIL' for the pool.
I then manually copy the key to the target system and ask geli to attach the volume.
It then comes back online:
However, if I then mount the volume by doing
zfs mount master
I can then only see the top-level folders corresponding to each of the original datasets, but not any files or folders saved within this zfs volume, which seems to indicate to me that something is seriously messed up.I then do
zfs umount master
followed byzpool export master
and `zpool import master```. At this point the contents in the datasets are visible again at the mount point.I am not sure what is causing the problem with the pool becoming 'unexportable' upon the first time altogether. I'm fairly new to ZFS so I would appreciate some guidance as to whether I'm doing something wrong with multiple datasets and your backup.sh script. For reference, here is my current ```zfs-receive.sh`` script:
Thanks
The text was updated successfully, but these errors were encountered: