Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce ray job cold time #825

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open

Conversation

agpituk
Copy link
Contributor

@agpituk agpituk commented Feb 7, 2025

What's changing

As described in #694 currently Ray first job takes a while in the local environment. What I've tried to do in this PR is to add a new container that builds a cache for Ray. Right now this container is build locally, but if we think is better, I can have the option to get this container in a registry where we can pull (improving also start time)

If this PR is related to an issue or closes one, please link it here.

Refs #694
Closes #694

How to test it

Steps to test the changes:

  1. In main branch: Ensure you clean all containers AND volumes (docker volume prune -a). Then start lumigator locally and record how long it takes to add ground truth to a dataset (I've tested with the one in the repo)
  2. Repeat the process in this PR branch (agpituk/694-reduce-ray-job-cold-time)
  3. hopefully it improves the timing!

Additional notes for reviewers

This removes the local container that we have cached as a developers, so it's a downside we need to consider if it's worth it.

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • [N/A] Added some tests for any new functionality
  • Updated the documentation (both comments in code and product documentation under /docs)
  • [N/A] Checked if a (backend) DB migration step was required and included it if required

@agpituk agpituk changed the title Agpituk/694 reduce ray job cold time Reduce ray job cold time Feb 10, 2025
@agpituk agpituk marked this pull request as ready for review February 10, 2025 15:34
@agpituk
Copy link
Contributor Author

agpituk commented Feb 10, 2025

I am aware that a couple of tests are failing, I'm trying to figure them out as locally I can't reproduce that error.

Copy link
Contributor

@ividal ividal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Only some comments for amd64.
I tested it on both arm (mac) and amd (ubuntu) and it makes such a difference 🥳

cache/Dockerfile.model-inference Outdated Show resolved Hide resolved
docker-compose.yaml Outdated Show resolved Hide resolved
@ividal ividal added the docker label Feb 11, 2025
@ividal
Copy link
Contributor

ividal commented Feb 11, 2025

Completely forgot the first time around: knowing that we are now pre-downloading models, it is the perfect time to add a warning to users that they need to make sure they have enough disk+docker space available for them.

WDYT of, here in the README.md, something like:

"""## Get started

The simplest way (...)

  • The system Python (version managers such as uv should be deactivated)
    - At least X GB available on disk and allocated for docker, since some small language models will be pre-downloaded.

You can run and develop Lumigator locally (...)
"""

@ividal ividal self-requested a review February 13, 2025 18:31
Copy link
Contributor

@ividal ividal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Tried it again on ubuntu + mac - LGTM (after rebasing+tests pass)!

@agpituk
Copy link
Contributor Author

agpituk commented Feb 13, 2025

Thanks for the changes! Tried it again on ubuntu + mac - LGTM (after rebasing+tests pass)!

Thanks Irina for all the comments! I am trying to figure out what's wrong with the tests as they don't look related to my PR. I think it's an issue with the persistence we added recently. I've just triggered another run to triple check and if not I'll follow up tomorrow morning

@github-actions github-actions bot added the gha GitHub actions related label Feb 14, 2025
@agpituk agpituk enabled auto-merge (squash) February 17, 2025 14:10
@agpituk agpituk disabled auto-merge February 17, 2025 14:23
mkdir -p /tmp/ray/session_latest/runtime_resources/pip
rmdir /tmp/ray/session_latest/runtime_resources/pip/ && ln -s /tmp/ray_pip_cache /tmp/ray/session_latest/runtime_resources/pip
sleep infinity
shm_size: 2g
volumes:
- ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface
- huggingface_cache_vol:/home/ray/.cache/huggingface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any strong reason for moving this to a volume? This makes lumigator's cache not interoperable with the HF cache (that might already reside on users' machines).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely one of the problems this PR may introduce (on top of slower CI times) I moved this to a volume because we create that volume before, so we ensure the bart image is there (reducing first time to experiment). Without it being a volume, I'm not sure if I know how I can add this to ray cache.

README.md Outdated
@@ -45,6 +45,7 @@ need to have the following prerequisites installed on your machine:
- On Linux, you need to follow the
[post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/).
- The system Python (version managers such as uv should be deactivated)
- At least 10 GB available on disk and allocated for docker, since some small language models will be pre downloaded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be great if we added in the docs (1) what models are downloaded (right now it's bart alone, right? I'd suggest roberta-large too for the bertscore metric), (2) their exact size (bart+roberta are less than 3GB), and (3) how this can be disabled if e.g. someone has no intention of ever running bart. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bart has to run right now to generate GT, that's why I added it as a kind of mandatory model. As it is, we can't disable it (apart from manually removing the service from docker-compose, which is not very user friendly I'd say.
We could maybe add a list of variables that is the models you want to predownload into Ray's cache. Would that work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that'd be great! For instance right now we are using roberta-large for bertscore evaluations so the models are already two, and having a list we could point users to makes it easy for them to customise it. Thank you!

model_path = snapshot_download('facebook/bart-large-cnn', cache_dir='/home/ray/.cache/huggingface/hub'); \
print('Model downloaded to:', model_path)\
"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to have more than one model, perhaps we could have something like

model_names = ["model/1", "model/2", ...]
for model_name in model_names:
    model_path = ...
    print(f"Model {model_name} downloaded to: {model_path}")

WDYT?
(also, I think model_path is relative to the container and might be misleading as the actual path is different)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely happy with the addition to have more than 1 model (look at my comment above). Not sure I follow about the path.
In the docker-compose, in this line
- huggingface_cache_vol:/home/ray/.cache/huggingface
we use the same path inside Ray (I did a few tests around this to get it right)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, sorry, I did not explain it properly!
What I meant is that we are printing "Model downloaded to " with that python code, and the will be the directory inside the container... Which makes no sense to the user because it is not where they will look for it if they need it (that is, the volume or the local path, not the container one).
As an example, let's say that we are storing this in the classical HF_HOME path. The user will see a message "Model blahblah downloaded to /home/ray/.cache/huggingface", but that is the folder in the container, not on their host.

@@ -68,6 +78,8 @@ services:
depends_on:
redis:
condition: service_healthy
inference-model:
condition: service_completed_successfully
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this will take a while, what are we planning to do with the other services in the meantime? Options:

  • make all of them depend on ray (might not be needed, we can e.g. still upload datasets or directly check previous experiment results)
  • have something to prevent us from running ray-dependent workflows until ray is up
  • none of the above, but clearly communicate to the users that they'll have to wait a bit before running anything that requires ray (not ideal for beginners IMO)

@aittalam
Copy link
Member

Thanks for the changes! Tried it again on ubuntu + mac - LGTM (after rebasing+tests pass)!

Thanks Irina for all the comments! I am trying to figure out what's wrong with the tests as they don't look related to my PR. I think it's an issue with the persistence we added recently. I've just triggered another run to triple check and if not I'll follow up tomorrow morning

Re: persistence, if that's the HF one there's a new issue open as it looks like we lost it in one of the most recent updates. Hope this helps!

@agpituk
Copy link
Contributor Author

agpituk commented Feb 17, 2025

Thanks for the changes! Tried it again on ubuntu + mac - LGTM (after rebasing+tests pass)!

Thanks Irina for all the comments! I am trying to figure out what's wrong with the tests as they don't look related to my PR. I think it's an issue with the persistence we added recently. I've just triggered another run to triple check and if not I'll follow up tomorrow morning

Re: persistence, if that's the HF one there's a new issue open as it looks like we lost it in one of the most recent updates. Hope this helps!

Thanks for the comments Davide!

The new issue 876 technically would be fixed by this one, but I say technically because we're removing that cache, but I'd like everyone to agree with this approach

@agpituk
Copy link
Contributor Author

agpituk commented Feb 19, 2025

@aittalam I just updated the code according to our conversation this morning. I know pull a list of models into the container, that go directly into the local computer cache (which then gets pulled into ray)
Also updated the doc.
The only thing I'd left for a different PR is related to the waiting times for ray to start. I'll also open a new issue to improve CI build times, as this will increase CI build times.

Could you have another look into this?
cc @ividal , in case you want to see the new version of this

@agpituk agpituk requested review from aittalam and ividal February 19, 2025 15:01
platform: linux/${ARCH}
command: /bin/true
volumes:
- ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface
- ${HF_HOME}:/home/ray/.cache/huggingface

We have just fixed the HF cache issue and refactored the code a bit.
The main idea now is that, consistently with HuggingFace's definition, we map the HF cache dir in the container to a HF_HOME dir whose default is ${HOME}/.cache/huggingface, but that can be customised if the user is already using a different place for that (see #935).
Sorry for the change under the hood!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker gha GitHub actions related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Reduce Ray job cold-start time
4 participants