-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to detect bad block cloning? #15554
Comments
Not as well as you might hope. (All that follows is to the best of my current understanding; since it's not fixed in master atm, some of this may be incorrect, though I would be surprised.) The problem there is similar to #11900, where the actual write wasn't the problem, it was the data read in the moment and the application's acting on it. Specifically, there is a window where something being cloned wasn't properly marked as dirty right after cloning, so trying to use SEEK_HOLE/SEEK_DATA or equivalent on the thing that had just been cloned could incorrectly decide that swathes of the data were sparse, and write the destination accordingly. So, for example, if you were doing So you could basically go looking for unexpected regions of |
Is block cloning only happening when I explizitly ask for it with things like " |
rsync won't ever do it afaik cp as of GNU coreutils 9 defaults to =auto, which uses copy_file_range, which if block cloning is enabled will try it, yes. |
Isnt there a way to find all cloned files/blocks on a filesystem? I assume that must be possible but I do not seem to find anything with google. |
Again, that won't help you here. The problem isn't the cloned blocks. The problem is if you tried to do a normal read+write of them too soon after being written, the result won't be cloned but will be mangled. |
Can we say that the pool is safe if the block cloning feature has not been enabled after update to zfs 2.2.0? That is: not done a 'zpool upgrade', and 'zpool get all' gives: |
That particular bug cannot arise if block_cloning is disabled, correct. |
building |
And corrupted files may not be cloned at all as @rincebrain pointed out. |
It wont tell you which files are corrupted but someone said |
Thank you very much. That helped me a lot. At least I know now that all files which have ever be cloned are just source code files which I downloaded from somewhere into my src folder. Nothing important! I did not notice any corruption so far and now I know that no important data got cloned anyways. |
That's just not true, as I explained earlier in the thread.
No, that just tells you the only ones with clones now are ... |
You're totally right, brain fart from me there. But it should tell you how much data is IN RISK of having been corrupted by this particular bug right? |
@rincebrain First thank's for trying to fix this issue as good as possible from what I saw in all the threads regarding this issue. I try to understand the exact impact of the bug, as I am hit by this badly, can you please confirm or clarify the following questions that came up after reading through the threads relating to this bug.
Sorry, if this should be obvious, but I try to keep any further impact of this as low as possible on the data I am responsible for so better ask twice than further corrupt data here. (btw. This is the worst data corruption I ran into in the last 20 years of IT. Silent data corruption (e.g. bad RAM, no ECC possible) is really bad and avoidance of it is exactly the major reason I chose ZFS in the first place.) |
If I understand it correctly it gives you the information of the potential CURRENTLY detectable corrupted files. But if e.g. the source of the copy got deleted in the past there no longer is a cloned block entry. So this would not show up. (I hope this information is correct as am trying to understand the impacts myself.) |
|
Just wanted to cross post that I have a reproducer script here: #15526 (comment) |
@rincebrain thank's for the comprehensive clarification! 2: So that means a and b are always correct and only copies of b are potentially corrupted!? or is this a typo and "original (a)" was ment? 4: I understand that restriction, but with at least 1 snapshot available for each day since feature activation this is an acceptable corner case (I hope) at least for user data. At the end this is damage control, we need to live with this anyways a complete rollback on the affected systems here is not acceptable either. |
Oh, no, everything is awful forever, it reproduces on 2.1.x too, which means it's probably not a block cloning bug, and is just another case of the same kind of issue as #11900 elsewhere. Great. |
You can use the script posted here to help determine whether you have files that are corrupted as a result of this bug. |
That script reports any files with regions of \0s that are multiples of 4k. Not all files with strings of \0s are mangled. |
|
As much as I wanted that script to be comprehensive, it was written before a couple things were clarified or discovered
It's also worth noting that only in some specific circumstances that bcloned files became corrupted, so even on a system that has hit the bug from block cloning, not all bcloned files are necessarily corrupted. |
It is all said in this thread. |
zfs 2.2.0 has a nasty block cloning bug which corrupts data: #15526
This has been fixed with zfs 2.2.1.
How can I detect if I have been hit by the block cloning bug? Can I somehow check the filesystem to find corrupted data that needs to be restored?
The text was updated successfully, but these errors were encountered: