You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi folks,
I am using llm-foundry to train some LLMs, and trying to save checkpoints directly to network drive (AWS on-prem storage). The issue I am hitting looks like this:
File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 352, in _save_checkpoint
os.symlink(os.path.relpath(src_path, os.path.dirname(symlink)), symlink)
FYI: saving on a local disk works just fine. I think this is an issue of not being able to create symlinks on the network drive. For example, running: touch test1.txt && ln -s test1.txt test2.txt, results with the same Unknown error 524.
I was wondering whether you have any suggestion on how to bypass this restriction (?) of not being able to create symlinks on network drives. If not, is there a straight-forward way to save checkpoints on the network drive but keep symlinks on local disks.
After digging a bit through the Composer lib, I feel that this could be hacked relatively easy but I'm wondering if you think that might break some other parts of either Composer or llm-foundry.
The text was updated successfully, but these errors were encountered:
You can specify save_latest_filename to keep the symlink on your local disk if that works for you. That seems like the easiest solution.
For object stores, we emulate a symlink by creating a file that has the path to the checkpoint in it's contents. We could try building a similar solution for a network drive -- this seems like the "right" solution. Unfortunately, it's not something we will be able to build since we don't have access to network drives to test this, but I'm happy to work with you and give some guidance if you're interested.
Hi folks,
I am using llm-foundry to train some LLMs, and trying to save checkpoints directly to network drive (AWS on-prem storage). The issue I am hitting looks like this:
at the line:
FYI: saving on a local disk works just fine. I think this is an issue of not being able to create symlinks on the network drive. For example, running:
touch test1.txt && ln -s test1.txt test2.txt
, results with the sameUnknown error 524
.I was wondering whether you have any suggestion on how to bypass this restriction (?) of not being able to create symlinks on network drives. If not, is there a straight-forward way to save checkpoints on the network drive but keep symlinks on local disks.
After digging a bit through the Composer lib, I feel that this could be hacked relatively easy but I'm wondering if you think that might break some other parts of either Composer or llm-foundry.
The text was updated successfully, but these errors were encountered: