Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: starting a BB API checkpoint copy at the end of an allocation #209

Open
tonyhutter opened this issue Jun 18, 2020 · 1 comment

Comments

@tonyhutter
Copy link
Contributor

Just wanted to capture something discussed in the SCR meeting today:

@adammoody mentioned that it would be nice if we could begin copying a checkpoint at the end of a allocation, and take advantage of the fact that the BB API can continue the transfer after the allocation is done. He mentioned the case where you might have a 12 hour allocation, but since doing a checkpoint that takes 1 hour, you really only have 11 hours of computation time. With BB API, you could kick off the checkpoint copy right at the end of your allocation, and get the full 12 hours of computation.

The question then becomes, how do we know if a checkpoint actually transferred successfully after the job is done? One way to do it is to save the transfer handle in the temporary copy file name, as discussed here: ECP-VeloC/AXL#57 (comment)

The steps might look like this:

  1. The job runs for 12 hours and 59 minutes. At the end, it requests a BB API transfer of checkpoint1 to GPFS.
  2. AXL starts the transfer, but actually transfers it to a temporary file name of checkpoint1._AXL-123456, where 123456 is the BB API transfer handle.
  3. The job stops, and the transfer continues "in the background" until it finished
  4. The user starts up SCR again and requests it to load the last checkpoint.
  5. SCR looks for the list of checkpoints. As part of that, it calls a new function, AXL_Cleanup(char *path) where path is the path to a directory or file. AXL_Cleanup() does the following:

a) If 'path' is a directory, look inside it for any files with an ._AXL-* extension.
b) If there's no transfer handle number in the extension, delete the file (it's probably be an aborted AXL pthreads or sync transfer).
c) If there's a transfer handle number, do a BB_GetTransferInfo() and check the status code. If the status code is BBFULLSUCCESS, then it's a completed transfer and rename the file to the final filename (checkpoint1). If the status show an error in the transfer, delete the file. If the status is BBINPROGRESS, cancel the transfer, and delete the file. NOTE: I don't know how long the BB API keeps the transfer history. It's definitely doesn't immediately delete it, as I've been able to do BB_GetTransferInfo() calls at least minutes after a transfer was complete. We'd need to test this to see how long the transfer info is query-able.

We're making the assumption that if a job wants to do the checkpoint at the end of it's transfer, that it will not relaunch until sometime after the transfer completes. If the job dies, we expect it to relaunch immediately, in which case the transfer will be cancelled (since most likely the transfer will still be in progress).

@tonyhutter
Copy link
Contributor Author

I'm starting to implement this along with ECP-VeloC/AXL#66. One observation:

The new AXL_Cleanup(char *path) function I proposed would do a couple of things:

  1. Remove old, unsuccessfully transferred files.
  2. Wait on previous BB API transfers that were still transferring.

As such, I don't think AXL_Cleanup() is really the right name for it. AXL_Finalize(char *path) would be the best name, but it's already being used as AXL_Finalize(void). However, we can do some crazy stuff to get around that inconvenient fact 🤪 :

#include <stdio.h>
  
#define ARG0(dummy, a0, ...) a0
#define GET_ARG0(...) ARG0(dummy, ## __VA_ARGS__, 0)

void __AXL_Finalize(char *path)
{
        printf("Path is %s\n", path);
}

#define AXL_Finalize(...) __AXL_Finalize(GET_ARG0(__VA_ARGS__))

int main(int argc, char**argv)
{
        char *ptr = "hello world";
        AXL_Finalize();
        AXL_Finalize(ptr);
}

This prints:

Path is (null)
Path is hello world

It works for both GCC and clang. In fact, we really should make it AXL_Finalize(void *data) to allow passing in any vendor-specific information into it (which in this case would be a char *path).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant