You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just wanted to capture something discussed in the SCR meeting today:
@adammoody mentioned that it would be nice if we could begin copying a checkpoint at the end of a allocation, and take advantage of the fact that the BB API can continue the transfer after the allocation is done. He mentioned the case where you might have a 12 hour allocation, but since doing a checkpoint that takes 1 hour, you really only have 11 hours of computation time. With BB API, you could kick off the checkpoint copy right at the end of your allocation, and get the full 12 hours of computation.
The question then becomes, how do we know if a checkpoint actually transferred successfully after the job is done? One way to do it is to save the transfer handle in the temporary copy file name, as discussed here: ECP-VeloC/AXL#57 (comment)
The steps might look like this:
The job runs for 12 hours and 59 minutes. At the end, it requests a BB API transfer of checkpoint1 to GPFS.
AXL starts the transfer, but actually transfers it to a temporary file name of checkpoint1._AXL-123456, where 123456 is the BB API transfer handle.
The job stops, and the transfer continues "in the background" until it finished
The user starts up SCR again and requests it to load the last checkpoint.
SCR looks for the list of checkpoints. As part of that, it calls a new function, AXL_Cleanup(char *path) where path is the path to a directory or file. AXL_Cleanup() does the following:
a) If 'path' is a directory, look inside it for any files with an ._AXL-* extension.
b) If there's no transfer handle number in the extension, delete the file (it's probably be an aborted AXL pthreads or sync transfer).
c) If there's a transfer handle number, do a BB_GetTransferInfo() and check the status code. If the status code is BBFULLSUCCESS, then it's a completed transfer and rename the file to the final filename (checkpoint1). If the status show an error in the transfer, delete the file. If the status is BBINPROGRESS, cancel the transfer, and delete the file. NOTE: I don't know how long the BB API keeps the transfer history. It's definitely doesn't immediately delete it, as I've been able to do BB_GetTransferInfo() calls at least minutes after a transfer was complete. We'd need to test this to see how long the transfer info is query-able.
We're making the assumption that if a job wants to do the checkpoint at the end of it's transfer, that it will not relaunch until sometime after the transfer completes. If the job dies, we expect it to relaunch immediately, in which case the transfer will be cancelled (since most likely the transfer will still be in progress).
The text was updated successfully, but these errors were encountered:
I'm starting to implement this along with ECP-VeloC/AXL#66. One observation:
The new AXL_Cleanup(char *path) function I proposed would do a couple of things:
Remove old, unsuccessfully transferred files.
Wait on previous BB API transfers that were still transferring.
As such, I don't think AXL_Cleanup() is really the right name for it. AXL_Finalize(char *path) would be the best name, but it's already being used as AXL_Finalize(void). However, we can do some crazy stuff to get around that inconvenient fact 🤪 :
It works for both GCC and clang. In fact, we really should make it AXL_Finalize(void *data) to allow passing in any vendor-specific information into it (which in this case would be a char *path).
Just wanted to capture something discussed in the SCR meeting today:
@adammoody mentioned that it would be nice if we could begin copying a checkpoint at the end of a allocation, and take advantage of the fact that the BB API can continue the transfer after the allocation is done. He mentioned the case where you might have a 12 hour allocation, but since doing a checkpoint that takes 1 hour, you really only have 11 hours of computation time. With BB API, you could kick off the checkpoint copy right at the end of your allocation, and get the full 12 hours of computation.
The question then becomes, how do we know if a checkpoint actually transferred successfully after the job is done? One way to do it is to save the transfer handle in the temporary copy file name, as discussed here: ECP-VeloC/AXL#57 (comment)
The steps might look like this:
checkpoint1
to GPFS.checkpoint1._AXL-123456
, where 123456 is the BB API transfer handle.AXL_Cleanup(char *path)
wherepath
is the path to a directory or file.AXL_Cleanup()
does the following:a) If 'path' is a directory, look inside it for any files with an
._AXL-*
extension.b) If there's no transfer handle number in the extension, delete the file (it's probably be an aborted AXL pthreads or sync transfer).
c) If there's a transfer handle number, do a
BB_GetTransferInfo()
and check the status code. If the status code isBBFULLSUCCESS
, then it's a completed transfer and rename the file to the final filename (checkpoint1
). If the status show an error in the transfer, delete the file. If the status isBBINPROGRESS
, cancel the transfer, and delete the file. NOTE: I don't know how long the BB API keeps the transfer history. It's definitely doesn't immediately delete it, as I've been able to doBB_GetTransferInfo()
calls at least minutes after a transfer was complete. We'd need to test this to see how long the transfer info is query-able.We're making the assumption that if a job wants to do the checkpoint at the end of it's transfer, that it will not relaunch until sometime after the transfer completes. If the job dies, we expect it to relaunch immediately, in which case the transfer will be cancelled (since most likely the transfer will still be in progress).
The text was updated successfully, but these errors were encountered: