What happens in consistency checking when the files are the only examples? #557
Replies: 10 comments
-
If consistency checking finds missing files, it will try to replicate them from another source. But what if no other source exists? Or, if a file is missing from a site and consistency checking recognises this, but Rucio tries to use that site as a source to make an additional replica? It seems that in the first case consistency checking can do nothing and in the second case Rucio will try repeatedly to make a transfer with no end. Perhaps a good enhancement here could be for consistency checking to make a list of files for each of these scenarios which can then be sent to the sites for scrutiny, and then to DM for invalidation if necessary. |
Beta Was this translation helpful? Give feedback.
-
Consistency Enforcement (CE) simply reconciles the information stored in Rucio replicas table with actual contents of the RSE. The scanner/comparison part detects the discrepancies, classifying them into "dark" and "missing" files. The action part filters the results, trying to be a bit more conservative and careful on the dark side and reports the verified discrepancies to Rucio by
CE does not try to replicate anything. This is supposedly done by Rucio as it is following the applicable replication rules. So if the missing replica was the last replica, then I am afraid the file is lost for good. I am not sure how making 2 separate lists of missing files would change anything. |
Beta Was this translation helpful? Give feedback.
-
Just to close the loop from the private conversation I had with Igor, I think what Katy is saying is that it would be good to get a list of these if possible so that Ops can investigate further and possibly invalidate the files/DIDs if that's what's needed. |
Beta Was this translation helpful? Give feedback.
-
Currently, CE detects the fact that a replica of a DID, expected to be present in the RSE, is missing. I think it would be more useful if someone or something generated a list of all file DIDs with no good replicas, without making that list contingent on detecting inconsistencies between RSE and Rucio DB, because a file can be replica-less without any inconsistencies. Another point: just because a DID does not have a replica in Rucio does not actually mean the file is dead. Because a valid mode of operation is to declare a DID without any replicas first and then later add a replica to an existing DID. So the file will be replica-less between these 2 events. |
Beta Was this translation helpful? Give feedback.
-
Yes, I see that this may be an extension of the functionality. But perhaps it need not be so difficult. The consistency checking already monitors how many weeks a file has been listed as Dark. So could it also count the number of weeks a file is listed as Missing? Proposal workflow: Does this sound simple enough? Aa a further extension, a similar thing could be done for Dark data, in order to spot repeatedly failing deletions. My main aims here are to make it simpler for sites to clean up their transfers and to stop Rucio from attempting repeated transfers which never end on files that do not exist. The fewer problematic files we have showing as available, the fewer Users will complain about their jobs failing (due to file access). File access seems one of the most common issues for CRAB Users. This would also help Production jobs in the same way. |
Beta Was this translation helpful? Give feedback.
-
I am a bit confused. For dark replicas, CE waits for several weeks for the relica to consistently appear dark before it takes the action on the file. The dark action is to report the replica to the dark reaper so the replica is removed. What exactly are you proposing for missing replicas ? During these ~4 weeks, do you want the CE to take action (mark the replica as bad in the database, which is pretty much equivalent to removing the replica from the DB) or wait for 4 weeks and not take the action ? If you are proposing to take the action right away and keep watching, then the replica will not be detected any more after the action is taken because the replica will not be in the DB any more. If you are proposing not to take an action and keep watching, then I am sure the replica will keep being detected as missing because Rucio does not know that the replica is missing and will not try to re-create it. As for the transfers on problematic files, I think the first thing to do to prevent them from being initiated is to mark missing replicas as bad so Rucio knows not to even start these requests. |
Beta Was this translation helpful? Give feedback.
-
I am proposing (at least, at first) not to further automate any actions by the CE, but simply to make lists of files that are persistently showing up in the Missing and Dark lists week after week. These can then be manually attended to. There is no simple way for sites to identify these problem files, so far as I know. In the future, if we feel confident to do so, further automation could be possible. |
Beta Was this translation helpful? Give feedback.
-
My point is that if we do not take the action on a missing replica, it is pretty much guaranteed to persistently show up as missing week after week. And if we do take the action, it is guaranteed to disappear from the missing list immediately, not because the replica is re-created, but because we say "it's not even supposed to be there". So your proposal for missing files essentially means let's not automate actions on missing files at all and leave this up to the site admin. As for dark files, it is done already pretty much the way you propose. The dark file is acted on only after it is detected persistently a week after week for N consecutive weeks, where N is configurable. |
Beta Was this translation helpful? Give feedback.
-
Sorry Igor, I have not made myself understood. Let me try one more time. I don't want the CE to stop doing anything it is currently doing. But I am concerned about files that are Missing and cannot be re-transferred, and Dark files that are failing to delete, with the current setup. If you don't want to do anything then I can do a manual comparison between today's Missing/Dark lists and one from a few weeks ago to see which files are persistently appearing. But if every site has to do this then it won't happen, or at least not regularly. DM certainly don't have time to do this. My proposal is simply to add this information to CE. Perhaps a link to a text file on the web UI labelled 'Problem files, please check'? |
Beta Was this translation helpful? Give feedback.
-
Hello @ivmfnal, @KatyEllis , has this discussion concluded? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
All reactions