Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor crab resubmit #6270

Open
Tracked by #8337
belforte opened this issue Nov 25, 2020 · 8 comments
Open
Tracked by #8337

refactor crab resubmit #6270

belforte opened this issue Nov 25, 2020 · 8 comments

Comments

@belforte
Copy link
Member

avoid editing dagman log and user new Dagman features developed by Mark C.
Once we have agreed on the semantyc and the new Dagman is available ...

WIll be a cleaner and definitive solution to #5876

@belforte
Copy link
Member Author

Raise priority, time to start looking at this seriously

@belforte
Copy link
Member Author

for convenience, paste here conclusion from #5876
After discussion with HTCondor developers we came to these conclusion:

  1. the problem was not there in 2014 when Brian initially coded this
  2. the problem came when condor got smarter about writing logs and stopped locking them by default, opening the way for our log editing procedure (which attempts to use condor file locking API) to overwrite a log file with an "old" version where some events are missing
  3. condor can be configured to revert to old behavior by setting ENABLE_USERLOG_LOCKING=True in its configuration so that we do not have the problem
  4. we made that change in all CRAB schedd's and did not find any sign of increased load or slower operations, so we can run in that way for a while. Ref. https://cms-logbook.cern.ch/elog/Analysis+Operations/3282
  5. editing logs is bad anyhow and it has been agreed to enhance DAGMAN functionality to allow CRAB to do resubmits w/o tampering with files it should not tamper with. Discussion on this has started with HTCondor DAGMAN expert Mark Coatsworth:
    https://docs.google.com/document/d/1vgJApmjkH9brYhQZbdRooGnj2mqSrme7BWVxzReZ0oM/edit

@belforte
Copy link
Member Author

will look at this after transition to py3

@belforte
Copy link
Member Author

Update from Mark C.


On 22/02/2021 18:40, Mark Coatsworth wrote:
> Hi Stefano, long overdue update on this work (replacing the old CMS
> CRAB log editing mechanism).
> 
> I had a couple false starts but finally implemented the
> DAGMAN_PUT_FAILED_JOBS_ON_HOLD mechanic that we discussed. It's fairly
> simple: when this option is set to True, DAGMan will put failed jobs
> on hold instead of aborting the dag. This gives CRAB the opportunity
> to fix the problem and continue processing the dag (hence, no need to
> edit the log to re-run failed nodes).
> 
> The new feature will ship in the 9.1 release of Condor later this spring.
> 
> Please keep me posted where things are at on your end. I know this
> will involve some changes in the CRAB code, and likely some tweaks on
> the DAGMan side also. I'd be happy to help with this when the time
> comes,
> 
> Mark

@belforte
Copy link
Member Author

@dciangot since you manifested interest on this, I add you to assignees so you can keep track

@belforte
Copy link
Member Author

belforte commented Feb 8, 2022

note also this recent thread in htcondor forum
https://lists.cs.wisc.edu/archive/htcondor-users/2022-February/msg00015.shtml

@belforte belforte assigned novicecpp and unassigned dciangot Apr 30, 2022
@belforte
Copy link
Member Author

with ref to last line in #6270 (comment)
2 years later is better make a copy of the old googleDoc in condor space, just in case:
original: https://docs.google.com/document/d/1vgJApmjkH9brYhQZbdRooGnj2mqSrme7BWVxzReZ0oM/edit
copy in my drive: https://docs.google.com/document/d/1qsil0UGewazg96cA-KP1QVOwkIcjSI6-ceT-qMsOYa0/edit?usp=sharing

@belforte
Copy link
Member Author

this should be more straightforward now that we have decided not to allow resubmission of successful jobs dmwm/CRABClient#5285

@belforte belforte mentioned this issue Jan 14, 2025
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants