-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for BBAPI post-stage in SCR #243
Comments
Copying my comment from ECP-VeloC/AXL#75 (comment): One idea is to have the poststage script call |
Copying my comment from ECP-VeloC/AXL#75 (comment) The more I think about this, the more I like this setup:
|
We could also call it |
Yes, I like the idea of an SCR variant like an To start, let's focus on getting things working for a stand alone AXL user. To emulate what a stand alone user will be doing, we can use a test that runs a multi-process MPI job, where each process transfers a file (e.g., part of a checkpoint). We should extend that test to also handle a case where each process has started multiple outstanding transfers (e.g., each process wrote a checkpoint file and an output file, but those came at two different times in the program so that they were started in separate transfers). |
A large job will end up with lots of transfer handles, and we should test this at scale. For example, a single transfer at sierra scale which is running an MPI process per core would lead to We haven't stress tested the IBM software for poststage operations, so there may be scalability bugs or performance issues hiding for us to uncover. |
Just speaking to The best case scenario is for the user to call In order to make that happen, we'd need a way for Also, I dunno if this is relevant, but I noticed that the post-stage environment has |
Some things that will come into play here. SCR maintains a "flush" file that tracks the set of dataset ids in cache that could be flushed and the status for each of those datasets, e.g., currently flushing via async or already flushed. This file is stored at Then for each dataset, there is a directory where SCR stores any metadata specific to that dataset. For dataset So one can look at the flush file to figure out whether any datasets are being transferred, and then for any dataset, look in its dataset directory for the state files. |
@adammoody thanks, having a |
Quick update - I'm currently able to do a checkpoint, cancel the transfer midway though, finish the transfer, and then manually create the summary file and add it to the index: # Use test_api to create a checkpoint, but AXL_DEBUG_PAUSE_AFTER will halt the AXL
# transfer partway though to simulate the job ending.
$ AXL_DEBUG_PAUSE_AFTER=1 SCR_CONF_FILE=~/myscr.conf ./test_api
# Verify that we can't load from the partially transmitted checkpoint
$ ./scr_index -l -p $BBPATH
DSET VALID FLUSHED CUR NAME
1 NO ckpt.1
# "Finish" the transfer by renaming the file to it's final name. This is simulating
# the "BBAPI transferred the file in the background between job runs"
$ mv $BBPATH/ckpt.1/rank_0.ckpt._AXL $BBPATH/ckpt.1/rank_0.ckpt
# Create a summary.scr file for our "finished" transfer, and update the flush file's
# "LOCATION" entry to say the file is flushed.
$ ./scr_flush_file --d $BBPATH -s 1 -S
ckpt.1
# Add our checkpoint into the index
./scr_index -p $BBPATH --add=ckpt.1
Found `ckpt.1' as dataset 1 at $BBPATH/.scr/scr.dataset.1
Adding `ckpt.1' to index
# Verify our index now sees the finished checkpoint
./scr_index -l -p $BBPATH
DSET VALID FLUSHED CUR NAME
1 YES 2020-11-09T17:05:01 ckpt.1 The next step would be to actually do the AXL resume in |
Update: I've now added an
I think now is the time to decide on what the behaviour should be. The simplest would be to just cancel any ongoing transfers in |
Side note - since users may want to call |
Based on my understanding of the IBM BB software, I don't think that second job allocation can either cancel or wait on a transfer that was started in an earlier job allocation. It could probably sync by waiting for a state change in some SCR file though. |
In addition to the race condition, we have a couple more details to figure out even for a single job. For a multi-rank job, we need to come up with a scheme to know whether the transfers succeeded for all ranks. Consider a two process job. If our scr_poststage script only sees one state file, and if the transfer associated with that file succeeds, is the transfer as a whole good or not? It could be that the second process had no files to transfer, so that it never started a transfer and never wrote a state file. On the other hand, maybe it did have files to transfer, but it failed before it wrote out its state file. In the first case, the transfer is complete and in the second case it is not. We'll need to be able to distinguish between those two. I think we'll want to be able to support multiple transfers. For example, the job might have transferred two datasets, say a checkpoint and an output dataset, just before it shuts down. In this case, each process will have multiple state files, one for each of its transfers. |
In general this is true. There is a hack though: if you export
I was envisioning that
After the files transfer,
So far so good - you have a valid dataset. No imagine the same thing happening, but rank 1 doesn't write the state file due to a failure. In that case, |
Oh, I see. That might work, though there could be some rough edges that we'll want to clean up. I had a different design in mind. My first thought was that our poststage script would follow the model of the scr_postrun/scavenge scripts and mark the state of the checkpoint in the index file based on whether it succeeded or failed to finalize everything. Really, we shouldn't be updating the index file to mark the checkpoint as valid unless we know that we successfully got all of the files. If files are missing for some rank, then when the user runs
Having said that, if we do optimistically mark the dataset as We should review that code in SCR where we delete some files. It could use some double-checking and fresh testing since SCR was ported to the components. If we go this route, to be clean about it, we should also consider defining a third state, like Anyway, let's keep going with what you have in mind. That will give us a good starting point, even if we find we need to change it. |
@adammoody thanks I see what you're saying now. Yea, having |
For scavenge, the scr/scripts/common/scr_postrun.in Line 144 in f877d26
Perhaps we could define a second type of a scan or hook into that |
More updates/braindump:
$ ~/kvtree_print_file ./.scr/scr.dataset.1/rank_11.state_file
FILE
/tmp/bblv_hutter2_132955/tmp/hutter2/scr.defjobid/scr.dataset.1/rank_11.ckpt
STATUS
1
DEST
/p/gpfs1/hutter2/prefix/ckpt.1/rank_11.ckpt._AXL
STATE
1
STATUS
1
NAME
ckpt.1
TYPE
2
STATE_FILE
/p/gpfs1/hutter2/prefix/.scr/scr.dataset.1/rank_11.state_file It will not have access to the original source file in the burst buffer. So, we need to encode the file size into
Just recording this info for posterity... |
Using its post-stage functionality, the BBAPI will continue to transfer files in the background even after an application ends. LSF allows the user to register poststage scripts in the bsub command to detect and act on the status of post-stage transfers. To allow for BBAPI transfers at the end of an SCR job, we need to write some logic to finalize those transfers. There are numerous items to address.
In AXL we'd need to effectively wait/resume on each of those in order to finish its transfers, e.g., rename files from temporary to final file name and set metadata on final files ECP-VeloC/AXL#75
In SCR, we'd need to update the SCR index file to mark the transfer as good or bad. Assuming that most transfers succeed, it would be nice to wait and mark the checkpoint as valid, so that any subsequent run could then restart from that checkpoint. However, that also introduces a race condition in which job 1 flushes a checkpoint in post-stage, but job 2 starts up before that transfer has completed. In that case, job 2 would restart from an older checkpoint, and then it would likely rewrite and reflush the same checkpoint as job 1, perhaps while the system is still busy flushing job 1's checkpoint. At that point, we'd have two different jobs trying to write the same checkpoint, and that's going to break things.
We could update the SCR scavenge logic to use AXL/BBAPI to start a transfer and exit the job allocation instead of synchronously copying files to the file system at the end of the allocation. We'd need a post-stage script that both waits on the transfer to finish and executes any rebuild logic, as our scavenge normally does.
We need to modify the existing scavenge logic to detect and deal with async transfers in case the dataset it's trying to scavenge is the one being transferred. We'd want to at least add the redundancy files. That could be done as a separate transfer, or we could cancel the first and restart it after adding the redundancy files.
As a short term fix, we can modify SCR to avoid starting any async transfers when the job gets close to the end of its allocation time limit. All transfers would switch to use synchronous mode.
The text was updated successfully, but these errors were encountered: