Could a random bit flip in memory lead to an incorrect self-healing, and hence to a file corruption in ZFS? #14613

wq9578 · 2023-03-12T03:13:44Z

wq9578
Mar 12, 2023

Could a random bit flip in memory lead to an incorrect self-healing, and hence to a file corruption in ZFS?

I wonder whether a random bit flip in memory might cause a checksum mismatch during reading or scrubbing, causing ZFS to prefer an alternative data block and corrupt an actually correct file on the hard disk by modifying it (self-healing).
Many computers nowadays don't have ECC, and there should be mechanism against that, such as maybe re–calculating the checksum a second time in the rare case of a checksum mismatch.

Main thread: https://forums.raspberrypi.com/viewtopic.php?p=2089368#p2089368

ryao · 2023-03-12T05:40:13Z

ryao
Mar 12, 2023
Collaborator

You would need multiple but flips to do this offhand. One bit flip in a buffer being checksummed is not enough to get ZFS to do a bad repair write. That said, bitflips can happen anywhere. One place that is interesting would be a buffer for a write before the checksum is done. That would cause bad data to be written with a good checksum and if it is metadata, then in an extreme case, a pool could become unimportable. This happened to one of the developers on his personal machine. He then debugged it, found the bitflip and wrote a custom tool to fix it.

Another interesting place would be the machine code itself. What bit flips do there is dependent on how the ISA works, but conceivably, you could have one instruction morph into another, or a operand be changed to indicate the wrong register, provided that the bit flip does not cause the instruction to become an illegal instruction.

1 reply

pansila Dec 26, 2023

Interested the repair tool and the story of it, can you share it further here? Thanks.

wq9578 · 2023-03-12T17:17:54Z

wq9578
Mar 12, 2023
Author

Thanks!

So for important files, I conclude, comparing the stored file (after emptying the read cache) with the original file rules out file corruption.

Concerning writing metadata, I wonder whether is makes sense to implement an option to double-check metadata writes in order to rule out that problem. There is a chance that ZFS will become more and more popular and used on systems with non-ECC memory.

A general question is whether to what extend OpenZFS is ready for use in production. There are more open issues for ZFS than for ext4, but I have no experience in order to interpret these figures adequately. I raised this question

at Configuring stops with error message: This kernel is unable to compile object files. #14608 ("General question: Is OpenZFS ready for use in production? / Currently there are 500 open issues with "Type: Defect" for ZFS, compared with 137 bugs total for ext4.") and
at https://forums.raspberrypi.com/viewtopic.php?p=2089261#p2089261.

0 replies

ryao · 2023-03-12T19:28:59Z

ryao
Mar 12, 2023
Collaborator

Concerning writing metadata, I wonder whether is makes sense to implement an option to double-check metadata writes in order to rule out that problem.

That will not work. First, we do not have a check to know whether metadata being written out is corrupt or not. The assumption is that it is good. That is how you would get bad metadata with a good checksum. Second, when the only in memory copy of metadata being written out is corrupt and checksum generated from it says the metadata is not corrupt, verifying the checksum is not going to catch a problem.

That being said, this would be a problem for any filesystem on a machine that does not have ECC, even ones that do not do checksums.

A general question is whether to what extend OpenZFS is ready for use in production. There are more open issues for ZFS than for ext4, but I have no experience in order to interpret these figures adequately.

I would not consider that to be comparable for several reasons:

ZFS is typically explicitly chosen to be used in place of ext4 and others by users who are fairly technical, so they are more likely to report issues. At the same time, they moved their critical workloads to ZFS away from ext4, such that they are not likely to report issues with ext4.
For OpenZFS, the issue tracker is fairly well known, while few users know about the kernel.org bugzilla. Users would be far more likely to file ext4 issues against their distribution, which would not show at kernel.org. There could be some issues filed against distributions for ZFS, but they would be more likely to file them here directly, especially since distribution developers often encourage them to file issue reports here.
The OpenZFS issue tracker also handles reports regarding the userland utilities, while the kernel.org bugzilla only handles the ext4 kernel module. I only see one exception where a userland tools issue was filed against the kernel.org bugzilla. The userland programs for ext4 have their own github issue tracker. Most users do not know about this, so they would file issues against their distributions, and many issues therefore will not appear there. To add to this, the OpenZFS issue tracker also handles reports regarding the ZFS Test Suite. I am not sure if the XFS Test Suite even has an issue tracker. Admittedly, this is a small number of open issues marked defect.
ZFS does far more than ext4, so undoubtably, there would be issues against things that ZFS does, but ext4 does not do.
Defect is a default label and we do not always remove it when something is more feature request than a defect, as in the case of Datapool name cannot be Chinese #13310.
Contributors open bugs against ZFS to keep track of issues that they find when searching for bugs in the code, but those users have not necessarily affected users. Roughly 8% of those issues are from a single contributor:

https://github.com/openzfs/zfs/issues?q=is%3Aopen+is%3Aissue+author%3Arincebrain+label%3A%22Type%3A+Defect%22

Looking at the kernel.org bugzilla and e2fsprogs github, I do not see Ted T'so and other ext4 developers filing bugs against ext4 to track things that they found. I am not sure if they are actively bug hunting like a number of us are.

Also, the number of reports is not a full picture. In terms of QA, every pull request to OpenZFS is subject to a test suite, plus stochastic testing in userspace. That catches a number of things. I believe ext4 uses the XFS test suite, although it does not have stochastic testing in userspace. ext4 also does not have the same number of runs. If I recall, Ted T'so does a test run once a day, while the ZFS test suite is run dozens of times every day.

Lastly, both Linux and OpenZFS are using coverity scans to find potential bugs, and the defect densities are the opposite of the outstanding bug reports:

https://scan.coverity.com/projects/openzfs-zfs
https://scan.coverity.com/projects/linux-next-weekly-scan

Our kernel module is at 0.13 unresolved defects per 1000 lines while the ext* kernel modules are at 0.57 unresolved defects per 1000 lines.

4 replies

wq9578 Mar 13, 2023
Author

Concerning writing metadata, I wonder whether is makes sense to implement an option to double-check metadata writes in order to rule out that problem.

That will not work. First, we do not have a check to know whether metadata being written out is corrupt or not. The assumption is that it is good. That is how you would get bad metadata with a good checksum. Second, when the only in memory copy of metadata being written out is corrupt and checksum generated from it says the metadata is not corrupt, verifying the checksum is not going to catch a problem.

Of course, the idea is to create/modify the metadata block in memory twice and compare the two checksums before writing, which would practically eliminate the danger of metadata bit flips in memory. Of course, in the case of modification, a second memory copy of the the original metadata block should be made before verifying its checksum. (In order to cover the machine code itself you mentioned, all routines concerning metadata also should be present twice and run independently ...)
The number of metadata blocks compared to data blocks is rather small, and usually the hard disk writing operation is the slowest operation, not some memory operation.
So for little cost a potential great risk would be ruled out ("if it is metadata, then in an extreme case, a pool could become unimportable").

A similar argument was made here:

Why not rely on error correcting memory? Super computers containing ter[]abytes are built containing error correcting memory, but this does not make the problem go away, it ‘only’ reduces it by around two orders of magnitude. [...]

[...] Software only approaches include the compiler generating two or more independent machine code sequences for each source code sequence whose computed values are compared at various check points and running multiple copies of a program in different threads and comparing outputs. [...]

Developers don’t have to wait for compiler or hardware support, they can improve reliability by using algorithms that are robust in the presence of ‘faulty’ hardware. [...]

Some statistics:

For a processor with 4G of ram, I once calculated a bit-flip every 33 hours

So for the use of ZFS on private computers (without ECC) I would strongly encourage that.

That being said, this would be a problem for any filesystem on a machine that does not have ECC, even ones that do not do checksums.

But you reported an extreme case which seems to apply to ZFS only (at the first glance, and without being an expert on filesystems):

That would cause bad data to be written with a good checksum and if it is metadata, then in an extreme case, a pool could become unimportable. This happened to one of the developers on his personal machine. He then debugged it, found the bitflip and wrote a custom tool to fix it.

I am new to ZFS, so please excuse misunderstandings, but an unimportable pool in ZFS seems to me like, for example, an ext3 filesystem that can't be mounted, rendering all data inaccessible. I assume that an fsck equivalent with repair mode is not available for ZFS, so writing a custom tool was unavoidable. This would be a disadvantage of ZFS in comparison to other filesystems.

For my use case, a private backup server (https://forums.raspberrypi.com/viewtopic.php?t=348760), an inaccessible pool would be catastrophic. (A second private backup server cloning the data, together with automatic export and import on a regular basis to discover an unimportable pool, might remedy the situation.)

A general question is whether to what extend OpenZFS is ready for use in production. There are more open issues for ZFS than for ext4, but I have no experience in order to interpret these figures adequately.

I would not consider that to be comparable for several reasons:

ZFS is typically explicitly chosen to be used in place of ext4 and others by users who are fairly technical, so they are more likely to report issues. At the same time, they moved their critical workloads to ZFS away from ext4, such that they are not likely to report issues with ext4.

[...]

Our kernel module is at 0.13 unresolved defects per 1000 lines while the ext* kernel modules are at 0.57 unresolved defects per 1000 lines.

Thank you very much for the extensive reply. I really appreciate this!

So my conclusion it that ZFS is ready for use in production, and independent of the filesystem in any case one should have backup either online or offline, as in rare cases any filesystem may fail.

However, since the general approach of ZFS seems to be implementing integrity within the ZFS filesystem software only (given the redundancy of a disk array) and not relying on hardware, I would suggest duplicate computation of metadata blocks, similar to the practice of some compilers, in order to protect against memory bit flips, at least as an option that can be enabled (and that should be enabled by default on non-ECC computers).
Given the importance of metadata, even to the point that a ZFS filesystem may pass all scrubbing tests for years but has a possibly unimportable pool rendering all data inaccessible due to correct checksums of corrupted data, this seems adequate to me.

ryao Mar 13, 2023
Collaborator

Concerning writing metadata, I wonder whether is makes sense to implement an option to double-check metadata writes in order to rule out that problem.

That will not work. First, we do not have a check to know whether metadata being written out is corrupt or not. The assumption is that it is good. That is how you would get bad metadata with a good checksum. Second, when the only in memory copy of metadata being written out is corrupt and checksum generated from it says the metadata is not corrupt, verifying the checksum is not going to catch a problem.

Of course, the idea is to create/modify the metadata block in memory twice and compare the two checksums before writing, which would practically eliminate the danger of metadata bit flips in memory. Of course, in the case of modification, a second memory copy of the the original metadata block should be made before verifying its checksum. (In order to cover the machine code itself you mentioned, all routines concerning metadata also should be present twice and run independently ...) The number of metadata blocks compared to data blocks is rather small, and usually the hard disk writing operation is the slowest operation, not some memory operation. So for little cost a potential great risk would be ruled out ("if it is metadata, then in an extreme case, a pool could become unimportable").

How do you make sure that the bit flip only happens after the first time and not before?

[...] Software only approaches include the compiler generating two or more independent machine code sequences for each source code sequence whose computed values are compared at various check points and running multiple copies of a program in different threads and comparing outputs. [...]

That is usually a hardware RAS technique:

https://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability

For a processor with 4G of ram, I once calculated a bit-flip every 33 hours

I do not detect bitflips at such a frequency on my machines. It is more like one every few months, although it tends to vary. Some machines simply do not seem to see them, such that I wonder if the reporting function is broken. Even when bitflips occur, the probability of hitting something truly vital is fairly low, which is partly how the industry has avoided addressing the issue with end to end ECC thus far. That and most end users do not demand ECC since they do not know any better, so the industry and ship hardware without ECC to end users without repercussions.

Lately, rowhammer has somewhat changed things such that DDR5 now does on-die ECC, which gives a significant amount of the protection that end to end ECC gives,.

I am new to ZFS, so please excuse misunderstandings, but an unimportable pool in ZFS seems to me like, for example, an ext3 filesystem that can't be mounted, rendering all data inaccessible. I assume that an fsck equivalent with repair mode is not available for ZFS, so writing a custom tool was unavoidable. This would be a disadvantage of ZFS in comparison to other filesystems.

fsck is filesystem check. It really is not traditionally meant as a repair tool and the repair function that was put into fsck.ext4 is a fairly destructive repair. Provided that your filesystem is corrupt, you could find all of your files are in /lost+found after running it, without their file names, assuming that you have any files after running it. Its only concern is about making the filesystem consistent again and if wiping the entire filesystem will do that, the tool should have no qualms against doing that. I suspect this is not what you thought that it does. It is not what most people expect it to do, but if the issue could be resolved automatically, it would have been done on mount, which is when journalling filesystems resolve issues that they can handle automatically.

For my use case, a private backup server (https://forums.raspberrypi.com/viewtopic.php?t=348760), an inaccessible pool would be catastrophic. (A second private backup server cloning the data, together with automatic export and import on a regular basis to discover an unimportable pool, might remedy the situation.)

This is an incredibly rare issue and you have similar failure modes on other filesystems.

However, since the general approach of ZFS seems to be implementing integrity within the ZFS filesystem software only (given the redundancy of a disk array) and not relying on hardware, I would suggest duplicate computation of metadata blocks, similar to the practice of some compilers, in order to protect against memory bit flips, at least as an option that can be enabled (and that should be enabled by default on non-ECC computers). Given the importance of metadata, even to the point that a ZFS filesystem may pass all scrubbing tests for years but has a possibly unimportable pool rendering all data inaccessible due to correct checksums of corrupted data, this seems adequate to me.

This is non-trivial to implement from not having any thing that we can really trust to be correct when everything is at risk of a bit flip. Making this work typically requires having something that you trust so that you can leverage it to verify everything else. Work to improve reliability is currently focused on other areas where failures are more common.

That said, there actually are some extra checks that can be enabled via the zfs_flags module parameter. One can verify checksums on every cache access if I recall.

wq9578 Mar 13, 2023
Author

Of course, the idea is to create/modify the metadata block in memory twice and compare the two checksums before writing, which would practically eliminate the danger of metadata bit flips in memory. Of course, in the case of modification, a second memory copy of the the original metadata block should be made before verifying its checksum. (In order to cover the machine code itself you mentioned, all routines concerning metadata also should be present twice and run independently ...) The number of metadata blocks compared to data blocks is rather small, and usually the hard disk writing operation is the slowest operation, not some memory operation. So for little cost a potential great risk would be ruled out ("if it is metadata, then in an extreme case, a pool could become unimportable").

How do you make sure that the bit flip only happens after the first time and not before?

While in theory there is not absolute guarantee, having exactly the same bit flip (at the same position) in both metadata blocks in memory (computed independently) is so unlikely that it becomes statistically neglectable.
Without having thought through the mathematical details of probability theory here, I'd guess that similar to cryptographic hashes security will increase quadratic (https://en.wikipedia.org/wiki/Birthday_attack).
Let's assume that the chance of a bit flip is 1 / 2^32.
1 bit => 1 / 2^32 = 1 / 4,294,967,296
2 bit => 1 / 2^32^2 = 1 / 2^(32*2) = 1 / 2^64 = 1 / 18,446,744,073,709,551,616
So while the first is practical (this happened to one of the developers, so maybe one case per month worldwide), the second (maybe one case per one billion years worldwide) can be neglected.
The numbers are given just for the purpose of demonstration.
I'd guess that the improvement will increase rather from 50 to 100 bits (than from 32 to 64 bits).

[...] Software only approaches include the compiler generating two or more independent machine code sequences for each source code sequence whose computed values are compared at various check points and running multiple copies of a program in different threads and comparing outputs. [...]

That is usually a hardware RAS technique:

https://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability

Yes, but times are changing, and it would be desirable to have ZFS working reliably not only on expensive high-end servers, but also on consumer hardware (without ECC memory), such that the general public can enjoy it the same way.

For a processor with 4G of ram, I once calculated a bit-flip every 33 hours

I do not detect bitflips at such a frequency on my machines. It is more like one every few months, although it tends to vary. Some

Let's interpret "few months" as three months: 3 * 30 * 24 = 2160 hours.
The ratio is 2160 / 33 = ca. 65.4
On a logarithmic scale this is ca. 6 bits (2^6 = 64), which is very little.
In the example above, a relevant difference was 50 bits, and in cryptographic contexts we talk about 128 or 256 bits of security.
Already for commercial applications since 2016 the NSA requires a minimum of 192 bits of security ("Use Curve P-384 to protect up to TOP SECRET."), while in 1977 DES with 56 bits of security was approved by the NSA.
So 6 bits is practically nothing.

machines simply do not seem to see them, such that I wonder if the reporting function is broken. Even when bitflips occur, the probability of hitting something truly vital is fairly low, which is partly how the industry has avoided addressing the issue with end to

It happened to one of the developers, which means it is not statistically neglectable.
So it is obviously far below the 100 bits of security.

However, since the general approach of ZFS seems to be implementing integrity within the ZFS filesystem software only (given the redundancy of a disk array) and not relying on hardware, I would suggest duplicate computation of metadata blocks, similar to the practice of some compilers, in order to protect against memory bit flips, at least as an option that can be enabled (and that should be enabled by default on non-ECC computers). Given the importance of metadata, even to the point that a ZFS filesystem may pass all scrubbing tests for years but has a possibly unimportable pool rendering all data inaccessible due to correct checksums of corrupted data, this seems adequate to me.

This is non-trivial to implement from not having any thing that we can really trust to be correct when everything is at risk of a bit flip. Making this work typically requires having something that you trust so that you can leverage it to verify everything else. [...]

The whole concept of ZFS, which I find very appealing, seems to be a design that is robust against hardware failures. While originally intended for large data centers with high-end servers (with ECC memory), i.e., reliable memory but by nature unreliable hard disks, I don't see any reason not to extend the protection mechanisms based on redundancy and checksums from hard disks to memory. Simply constructing metadata blocks twice in memory and comparing the results (as done in the compiler example mentioned earlier) shouldn't be too difficult.

[...] Work to improve reliability is currently focused on other areas where failures are more common.

Are there any other failures known which are similarly catastrophic as the example you mentioned? Since you didn't reject my interpretation, I still assume that not being able to import the ZFS pool means not being able to access any data at all.

ryao Mar 13, 2023
Collaborator

Of course, the idea is to create/modify the metadata block in memory twice and compare the two checksums before writing, which would practically eliminate the danger of metadata bit flips in memory. Of course, in the case of modification, a second memory copy of the the original metadata block should be made before verifying its checksum. (In order to cover the machine code itself you mentioned, all routines concerning metadata also should be present twice and run independently ...) The number of metadata blocks compared to data blocks is rather small, and usually the hard disk writing operation is the slowest operation, not some memory operation. So for little cost a potential great risk would be ruled out ("if it is metadata, then in an extreme case, a pool could become unimportable").

How do you make sure that the bit flip only happens after the first time and not before?

While in theory there is not absolute guarantee, having exactly the same bit flip (at the same position) in both metadata blocks in memory (computed independently) is so unlikely that it becomes statistically neglectable. Without having thought through the mathematical details of probability theory here, I'd guess that similar to cryptographic hashes security will increase quadratic (https://en.wikipedia.org/wiki/Birthday_attack). Let's assume that the chance of a bit flip is 1 / 2^32. 1 bit => 1 / 2^32 = 1 / 4,294,967,296 2 bit => 1 / 2^32^2 = 1 / 2^(32*2) = 1 / 2^64 = 1 / 18,446,744,073,709,551,616 So while the first is practical (this happened to one of the developers, so maybe one case per month worldwide), the second (maybe one case per one billion years worldwide) can be neglected. The numbers are given just for the purpose of demonstration. I'd guess that the improvement will increase rather from 50 to 100 bits (than from 32 to 64 bits).

The source memory for the computations can be where you have a bitflip, so it is impossible to compute two copies of a metadata block independently. Both calculations would be dependent on the bitflip not happening in the source data.

The same could be said about being dependent on the absence of a bitflip in the checksum code. If a bitflip does happen there, it is possible that ZFS would keep writing out corrupt checksums without the system knowing until it is too late. I have never heard of that theoretical situation happening, but it probably will happen somewhere eventually. Interestingly, ZFS_DEBUG_MODIFY might be able to catch that by suddenly claiming that all buffers in ARC are corrupt, but that is off by default since it is expensive:

https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html#zfs_flags

There are probably other failure modes. Unfortunately enumerating all of them is likely an intractable task. There are just too many possibilities for things to go wrong for me to be able to examine them to find all of the ones that could cause an unimportable pool.

Interestingly, a significant portion of the issues that have caused unimportable pools in the past only affected writable pool import and a readonly pool import could be done to recover data. We also have zpool checkpoint that should allow for recovery from the issue that we detected in the wild, provided it is used:

https://openzfs.github.io/openzfs-docs/man/8/zpool-checkpoint.8.html

Even without checkpoints, we have the ability to rewind the pool by a small number of transaction groups, which can permit recovery from a number of worst case situations if the system does not stay up very long afterward.

It happened to one of the developers, which means it is not statistically neglectable. So it is obviously far below the 100 bits of security.

A onetime event can be considered to be an outlier. As far as I know, a bitflip causing a pool to become unimportable has only happened a single time in the history of ZFS. If it has ever happened to someone else, the story has not reached me.

The whole concept of ZFS, which I find very appealing, seems to be a design that is robust against hardware failures. While originally intended for large data centers with high-end servers (with ECC memory), i.e., reliable memory but by nature unreliable hard disks, I don't see any reason not to extend the protection mechanisms based on redundancy and checksums from hard disks to memory. Simply constructing metadata blocks twice in memory and comparing the results (as done in the compiler example mentioned earlier) shouldn't be too difficult.

[...] Work to improve reliability is currently focused on other areas where failures are more common.

Are there any other failures known which are similarly catastrophic as the example you mentioned? Since you didn't reject my interpretation, I still assume that not being able to import the ZFS pool means not being able to access any data at all.

There is no defense that we can do to ensure that a bitflip never causes an unimportable pool. Should a worst case scenario occur, you would need to restore from backup.

ryao · 2023-03-13T05:28:47Z

ryao
Mar 13, 2023
Collaborator

For what it is worth, CPU L3 caches are typically protected by ECC. Hypothetically speaking, if you made a hypervisor that ran a CPU in no-fill mode, you could use the L3 cache as RAM and implement ECC in software. However, it would be slow and I have only ever heard of this being done in a proprietary hypervisor to implement software based memory encryption.

0 replies

wq9578 · 2023-03-15T15:49:53Z

wq9578
Mar 15, 2023
Author

The source memory for the computations can be where you have a bitflip, so it is impossible to compute two copies of a metadata block independently. Both calculations would be dependent on the bitflip not happening in the source data.

The same could be said about being dependent on the absence of a bitflip in the checksum code. If a bitflip does happen there, it is possible that ZFS would keep writing out corrupt checksums without the system knowing until it is too late. I have never heard of that theoretical situation happening, but it probably will happen somewhere eventually. [...]

Of course, in order to maintain independence, the code for metadata computation and all its subroutines (e.g., checksum code) would have to be present twice in memory.
You simply could have a source code directory metadata_computation_1 with functions like compute_metadata_1() and subroutines such as checksum_1(), and some script would create a copy of the directory named metadata_computation_2 with renamed functions compute_metadata_2() and checksum_2(), accordingly, and make sure that no compiler optimization will reduce the redundancy in the code.

There are probably other failure modes. Unfortunately enumerating all of them is likely an intractable task. There are just too many possibilities for things to go wrong for me to be able to examine them to find all of the ones that could cause an unimportable pool.

Ok, so verified (importable) backups seem to be inevitable, even using ZFS with a redundant disk array with scrubbing on a regular basis.

A onetime event can be considered to be an outlier. As far as I know, a bitflip causing a pool to become unimportable has only happened a single time in the history of ZFS. If it has ever happened to someone else, the story has not reached me.

I clearly disagree here, for two reasons.

First, the chance of a metadata bit flip is relatively high.
Let's do a sample calculation. The numbers are wild guesses, but the order of magnitude will be roughly applicable.
Let's say statistically half of memory used is occupied by data read from disk (text files, etc.), and one percent of this half is metadata. So 0.5 % of used memory space is metadata, and in the case of a bit flip there is a chance of 0.5 % that a metadata block is corrupted. Using the mode of calculation in bits of security, the factor 200 (0.5 % = 1 / 200) corresponds to ln_2(200) = ln(200)/ln(2) = ca. 7,64 bits, and even adding a safety of factor 10 (00.5 % of memory space is metadata) yields only ln(2000)/ln(2) = ca. 10,97, let's say 11 bits, which is not much.
Even if only one out of ten thousand bit flips were metadata bit flips, I would consider this a high rate since it creates an unaccessible pool. But since memory is used for disk cache, rate might be higher.
Also, most people won't report such an incident since they won't be able to figure out the cause or won't make the efforts to report. The incidence among a small number of developers clearly is an indication the the probability is relatively high.

In the case of the unimportable pool known to you, was the main node (main directory metadata block) affected or just any metadata block (one of many, an arbitrary metadata block)?

Secondly, the argument

A onetime event can be considered to be an outlier.

concerning safety, and here expressed in terms of mathematical improbability, is clearly invalid. For example, as written in NSA CYBERSECURITY 2020 YEAR IN REVIEW, page 5, the NSA obviously uses its publicly known or similar algorithms for controlling nuclear missiles:

Foremost in NSA’s code-making mission is the production of the nuclear “launch codes”
and related materials that would be used should the president ever authorize the launch of U.S. nuclear weapons. NSA also provides the encryption in the communications systems used to convey those orders.

You couldn't argue that accidentally launching a nuclear intercontinental missile or allowing to launch by a third party breaking the cryptography due to a weak algorithm is an "outlier". The algorithm then wasn't strong enough (the probability too high). In practice, sufficient improbability means infeasibility. And the 20,000 or 30,000 mathematicians at the NSA seem to share my view ...

Although in a totally different context, you could argue similarly with regards to nuking a filesystem. Clearly the design is too poor, allowing such an incidence (loss of all data) even once. There hasn't ever been a successful brute-force attack on an 128-bit key, and to my knowledge not even one on an 80-bit key, as SHA-1 with 80-bits of security was broken because of a design flaw, not by brute-force.

So I would still advocate the duplicate independent computation of metadata blocks because of their importance, increasing security from maybe from 50 to 100 bits of security, hence increasing the security margin by roughly 50 bits (making an unimportable pool because of metadata corruption 1125899906842624 times less likely), as a consequence of the widespread use of ZFS on non-ECC memory nowadays, following the general policy of ZFS employing checksums and redundancy to protect against loss or corruption of data because of hardware failure.

Are there any other failures known which are similarly catastrophic as the example you mentioned? Since you didn't reject my interpretation, I still assume that not being able to import the ZFS pool means not being able to access any data at all.

There is no defense that we can do to ensure that a bitflip never causes an unimportable pool. Should a worst case scenario occur, you would need to restore from backup.

Would it be sufficient to run a ZFS mirror with three hard disks, and from time to time remove one of the disks containing the data (replacing it with a new empty disk to be filled with the data still available on the other two), and make a test import in read-only mode of the pool on the removed disk at another computer to ensure that this disk is a valid backup and can be stored at a different place?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could a random bit flip in memory lead to an incorrect self-healing, and hence to a file corruption in ZFS? #14613

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Could a random bit flip in memory lead to an incorrect self-healing, and hence to a file corruption in ZFS? #14613

wq9578 Mar 12, 2023

Replies: 5 comments · 5 replies

ryao Mar 12, 2023 Collaborator

pansila Dec 26, 2023

wq9578 Mar 12, 2023 Author

ryao Mar 12, 2023 Collaborator

wq9578 Mar 13, 2023 Author

ryao Mar 13, 2023 Collaborator

wq9578 Mar 13, 2023 Author

ryao Mar 13, 2023 Collaborator

ryao Mar 13, 2023 Collaborator

wq9578 Mar 15, 2023 Author

wq9578
Mar 12, 2023

Replies: 5 comments 5 replies

ryao
Mar 12, 2023
Collaborator

wq9578
Mar 12, 2023
Author

ryao
Mar 12, 2023
Collaborator

wq9578 Mar 13, 2023
Author

ryao Mar 13, 2023
Collaborator

wq9578 Mar 13, 2023
Author

ryao Mar 13, 2023
Collaborator

ryao
Mar 13, 2023
Collaborator

wq9578
Mar 15, 2023
Author