-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce ray job cold time #825
base: main
Are you sure you want to change the base?
Conversation
I am aware that a couple of tests are failing, I'm trying to figure them out as locally I can't reproduce that error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Only some comments for amd64.
I tested it on both arm (mac) and amd (ubuntu) and it makes such a difference 🥳
Completely forgot the first time around: knowing that we are now pre-downloading models, it is the perfect time to add a warning to users that they need to make sure they have enough disk+docker space available for them. WDYT of, here in the README.md, something like: """## Get started The simplest way (...)
You can run and develop Lumigator locally (...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! Tried it again on ubuntu + mac - LGTM (after rebasing+tests pass)!
Thanks Irina for all the comments! I am trying to figure out what's wrong with the tests as they don't look related to my PR. I think it's an issue with the persistence we added recently. I've just triggered another run to triple check and if not I'll follow up tomorrow morning |
docker-compose.yaml
Outdated
mkdir -p /tmp/ray/session_latest/runtime_resources/pip | ||
rmdir /tmp/ray/session_latest/runtime_resources/pip/ && ln -s /tmp/ray_pip_cache /tmp/ray/session_latest/runtime_resources/pip | ||
sleep infinity | ||
shm_size: 2g | ||
volumes: | ||
- ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface | ||
- huggingface_cache_vol:/home/ray/.cache/huggingface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any strong reason for moving this to a volume? This makes lumigator's cache not interoperable with the HF cache (that might already reside on users' machines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely one of the problems this PR may introduce (on top of slower CI times) I moved this to a volume because we create that volume before, so we ensure the bart image is there (reducing first time to experiment). Without it being a volume, I'm not sure if I know how I can add this to ray cache.
README.md
Outdated
@@ -45,6 +45,7 @@ need to have the following prerequisites installed on your machine: | |||
- On Linux, you need to follow the | |||
[post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/). | |||
- The system Python (version managers such as uv should be deactivated) | |||
- At least 10 GB available on disk and allocated for docker, since some small language models will be pre downloaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be great if we added in the docs (1) what models are downloaded (right now it's bart alone, right? I'd suggest roberta-large too for the bertscore metric), (2) their exact size (bart+roberta are less than 3GB), and (3) how this can be disabled if e.g. someone has no intention of ever running bart. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bart has to run right now to generate GT, that's why I added it as a kind of mandatory model. As it is, we can't disable it (apart from manually removing the service from docker-compose, which is not very user friendly I'd say.
We could maybe add a list of variables that is the models you want to predownload into Ray's cache. Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that'd be great! For instance right now we are using roberta-large for bertscore evaluations so the models are already two, and having a list we could point users to makes it easy for them to customise it. Thank you!
model_path = snapshot_download('facebook/bart-large-cnn', cache_dir='/home/ray/.cache/huggingface/hub'); \ | ||
print('Model downloaded to:', model_path)\ | ||
" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to have more than one model, perhaps we could have something like
model_names = ["model/1", "model/2", ...]
for model_name in model_names:
model_path = ...
print(f"Model {model_name} downloaded to: {model_path}")
WDYT?
(also, I think model_path is relative to the container and might be misleading as the actual path is different)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely happy with the addition to have more than 1 model (look at my comment above). Not sure I follow about the path.
In the docker-compose, in this line
- huggingface_cache_vol:/home/ray/.cache/huggingface
we use the same path inside Ray (I did a few tests around this to get it right)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, sorry, I did not explain it properly!
What I meant is that we are printing "Model downloaded to " with that python code, and the will be the directory inside the container... Which makes no sense to the user because it is not where they will look for it if they need it (that is, the volume or the local path, not the container one).
As an example, let's say that we are storing this in the classical HF_HOME path. The user will see a message "Model blahblah downloaded to /home/ray/.cache/huggingface", but that is the folder in the container, not on their host.
@@ -68,6 +78,8 @@ services: | |||
depends_on: | |||
redis: | |||
condition: service_healthy | |||
inference-model: | |||
condition: service_completed_successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this will take a while, what are we planning to do with the other services in the meantime? Options:
- make all of them depend on ray (might not be needed, we can e.g. still upload datasets or directly check previous experiment results)
- have something to prevent us from running ray-dependent workflows until ray is up
- none of the above, but clearly communicate to the users that they'll have to wait a bit before running anything that requires ray (not ideal for beginners IMO)
Re: persistence, if that's the HF one there's a new issue open as it looks like we lost it in one of the most recent updates. Hope this helps! |
Thanks for the comments Davide! The new issue 876 technically would be fixed by this one, but I say technically because we're removing that cache, but I'd like everyone to agree with this approach |
@aittalam I just updated the code according to our conversation this morning. I know pull a list of models into the container, that go directly into the local computer cache (which then gets pulled into ray) Could you have another look into this? |
platform: linux/${ARCH} | ||
command: /bin/true | ||
volumes: | ||
- ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface | |
- ${HF_HOME}:/home/ray/.cache/huggingface |
We have just fixed the HF cache issue and refactored the code a bit.
The main idea now is that, consistently with HuggingFace's definition, we map the HF cache dir in the container to a HF_HOME dir whose default is ${HOME}/.cache/huggingface, but that can be customised if the user is already using a different place for that (see #935).
Sorry for the change under the hood!
What's changing
As described in #694 currently Ray first job takes a while in the local environment. What I've tried to do in this PR is to add a new container that builds a cache for Ray. Right now this container is build locally, but if we think is better, I can have the option to get this container in a registry where we can pull (improving also start time)
Refs #694
Closes #694
How to test it
Steps to test the changes:
main
branch: Ensure you clean all containers AND volumes (docker volume prune -a
). Then start lumigator locally and record how long it takes to add ground truth to a dataset (I've tested with the one in the repo)agpituk/694-reduce-ray-job-cold-time
)Additional notes for reviewers
This removes the local container that we have cached as a developers, so it's a downside we need to consider if it's worth it.
I already...
/docs
)