Reduce ray job cold time #825

agpituk · 2025-02-07T15:49:10Z

What's changing

As described in #694 currently Ray first job takes a while in the local environment. What I've tried to do in this PR is to add a new container that builds a cache for Ray. Right now this container is build locally, but if we think is better, I can have the option to get this container in a registry where we can pull (improving also start time)

If this PR is related to an issue or closes one, please link it here.

Refs #694
Closes #694

How to test it

Steps to test the changes:

In main branch: Ensure you clean all containers AND volumes (docker volume prune -a). Then start lumigator locally and record how long it takes to add ground truth to a dataset (I've tested with the one in the repo)
Repeat the process in this PR branch (agpituk/694-reduce-ray-job-cold-time)
hopefully it improves the timing!

Additional notes for reviewers

This removes the local container that we have cached as a developers, so it's a downside we need to consider if it's worth it.

I already...

Tested the changes in a working environment to ensure they work as expected
[N/A] Added some tests for any new functionality
Updated the documentation (both comments in code and product documentation under /docs)
[N/A] Checked if a (backend) DB migration step was required and included it if required

agpituk · 2025-02-10T15:35:18Z

I am aware that a couple of tests are failing, I'm trying to figure them out as locally I can't reproduce that error.

ividal

Thanks for this! Only some comments for amd64.
I tested it on both arm (mac) and amd (ubuntu) and it makes such a difference 🥳

cache/Dockerfile.model-inference

docker-compose.yaml

ividal · 2025-02-11T19:47:16Z

Completely forgot the first time around: knowing that we are now pre-downloading models, it is the perfect time to add a warning to users that they need to make sure they have enough disk+docker space available for them.

WDYT of, here in the README.md, something like:

"""## Get started

The simplest way (...)

The system Python (version managers such as uv should be deactivated)
- At least X GB available on disk and allocated for docker, since some small language models will be pre-downloaded.

You can run and develop Lumigator locally (...)
"""

ividal

Thanks for the changes! Tried it again on ubuntu + mac - LGTM (after rebasing+tests pass)!

agpituk · 2025-02-13T21:43:03Z

Thanks for the changes! Tried it again on ubuntu + mac - LGTM (after rebasing+tests pass)!

Thanks Irina for all the comments! I am trying to figure out what's wrong with the tests as they don't look related to my PR. I think it's an issue with the persistence we added recently. I've just triggered another run to triple check and if not I'll follow up tomorrow morning

…he errors

aittalam · 2025-02-17T14:32:43Z

docker-compose.yaml

          mkdir -p /tmp/ray/session_latest/runtime_resources/pip
          rmdir /tmp/ray/session_latest/runtime_resources/pip/ && ln -s /tmp/ray_pip_cache /tmp/ray/session_latest/runtime_resources/pip
          sleep infinity
    shm_size: 2g
    volumes:
-      - ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface
+      - huggingface_cache_vol:/home/ray/.cache/huggingface


Is there any strong reason for moving this to a volume? This makes lumigator's cache not interoperable with the HF cache (that might already reside on users' machines).

This is definitely one of the problems this PR may introduce (on top of slower CI times) I moved this to a volume because we create that volume before, so we ensure the bart image is there (reducing first time to experiment). Without it being a volume, I'm not sure if I know how I can add this to ray cache.

aittalam · 2025-02-17T14:37:14Z

README.md

@@ -45,6 +45,7 @@ need to have the following prerequisites installed on your machine:
    - On Linux, you need to follow the
      [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/).
 - The system Python (version managers such as uv should be deactivated)
+- At least 10 GB available on disk and allocated for docker, since some small language models will be pre downloaded


I think it'd be great if we added in the docs (1) what models are downloaded (right now it's bart alone, right? I'd suggest roberta-large too for the bertscore metric), (2) their exact size (bart+roberta are less than 3GB), and (3) how this can be disabled if e.g. someone has no intention of ever running bart. WDYT?

Bart has to run right now to generate GT, that's why I added it as a kind of mandatory model. As it is, we can't disable it (apart from manually removing the service from docker-compose, which is not very user friendly I'd say.
We could maybe add a list of variables that is the models you want to predownload into Ray's cache. Would that work?

Yes, I think that'd be great! For instance right now we are using roberta-large for bertscore evaluations so the models are already two, and having a list we could point users to makes it easy for them to customise it. Thank you!

aittalam · 2025-02-17T14:39:44Z

cache/Dockerfile.model-inference

+model_path = snapshot_download('facebook/bart-large-cnn', cache_dir='/home/ray/.cache/huggingface/hub'); \
+print('Model downloaded to:', model_path)\
+"
+


If we want to have more than one model, perhaps we could have something like

model_names = ["model/1", "model/2", ...] for model_name in model_names: model_path = ... print(f"Model {model_name} downloaded to: {model_path}")

WDYT?
(also, I think model_path is relative to the container and might be misleading as the actual path is different)

Definitely happy with the addition to have more than 1 model (look at my comment above). Not sure I follow about the path.
In the docker-compose, in this line
- huggingface_cache_vol:/home/ray/.cache/huggingface
we use the same path inside Ray (I did a few tests around this to get it right)

My bad, sorry, I did not explain it properly!
What I meant is that we are printing "Model downloaded to " with that python code, and the will be the directory inside the container... Which makes no sense to the user because it is not where they will look for it if they need it (that is, the volume or the local path, not the container one).
As an example, let's say that we are storing this in the classical HF_HOME path. The user will see a message "Model blahblah downloaded to /home/ray/.cache/huggingface", but that is the folder in the container, not on their host.

aittalam · 2025-02-17T14:46:29Z

docker-compose.yaml

@@ -68,6 +78,8 @@ services:
    depends_on:
      redis:
        condition: service_healthy
+      inference-model:
+        condition: service_completed_successfully


As this will take a while, what are we planning to do with the other services in the meantime? Options:

make all of them depend on ray (might not be needed, we can e.g. still upload datasets or directly check previous experiment results)

have something to prevent us from running ray-dependent workflows until ray is up

none of the above, but clearly communicate to the users that they'll have to wait a bit before running anything that requires ray (not ideal for beginners IMO)

aittalam · 2025-02-17T14:48:49Z

Thanks for the changes! Tried it again on ubuntu + mac - LGTM (after rebasing+tests pass)!

Thanks Irina for all the comments! I am trying to figure out what's wrong with the tests as they don't look related to my PR. I think it's an issue with the persistence we added recently. I've just triggered another run to triple check and if not I'll follow up tomorrow morning

Re: persistence, if that's the HF one there's a new issue open as it looks like we lost it in one of the most recent updates. Hope this helps!

agpituk · 2025-02-17T16:05:44Z

Thanks for the changes! Tried it again on ubuntu + mac - LGTM (after rebasing+tests pass)!

Thanks Irina for all the comments! I am trying to figure out what's wrong with the tests as they don't look related to my PR. I think it's an issue with the persistence we added recently. I've just triggered another run to triple check and if not I'll follow up tomorrow morning

Re: persistence, if that's the HF one there's a new issue open as it looks like we lost it in one of the most recent updates. Hope this helps!

Thanks for the comments Davide!

The new issue 876 technically would be fixed by this one, but I say technically because we're removing that cache, but I'd like everyone to agree with this approach

agpituk · 2025-02-19T12:50:35Z

@aittalam I just updated the code according to our conversation this morning. I know pull a list of models into the container, that go directly into the local computer cache (which then gets pulled into ray)
Also updated the doc.
The only thing I'd left for a different PR is related to the waiting times for ray to start. I'll also open a new issue to improve CI build times, as this will increase CI build times.

Could you have another look into this?
cc @ividal , in case you want to see the new version of this

aittalam · 2025-02-19T16:16:10Z

docker-compose.yaml

+    platform: linux/${ARCH}
+    command: /bin/true
+    volumes:
+      - ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface


Suggested change

- ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface

- ${HF_HOME}:/home/ray/.cache/huggingface

We have just fixed the HF cache issue and refactored the code a bit.
The main idea now is that, consistently with HuggingFace's definition, we map the HF cache dir in the container to a HF_HOME dir whose default is ${HOME}/.cache/huggingface, but that can be customised if the user is already using a different place for that (see #935).
Sorry for the change under the hood!

agpituk added 3 commits February 7, 2025 14:49

Initial push to reduce loading time for bart

0ea7a00

Improved model download

2100cfc

merged main

cac5994

agpituk changed the title ~~Agpituk/694 reduce ray job cold time~~ Reduce ray job cold time Feb 10, 2025

merged main

cf4aad0

agpituk requested review from aittalam and javiermtorres February 10, 2025 15:34

agpituk marked this pull request as ready for review February 10, 2025 15:34

Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time

bc691ad

ividal requested changes Feb 11, 2025

View reviewed changes

cache/Dockerfile.model-inference Outdated Show resolved Hide resolved

docker-compose.yaml Outdated Show resolved Hide resolved

ividal added the docker label Feb 11, 2025

ividal mentioned this pull request Feb 11, 2025

[BUG]: currently make start-lumigator doesn't work out of the box for amd64 architecture #788

Closed

1 task

agpituk added 2 commits February 13, 2025 13:56

Multiplatform build for cache image - added readme info

c52b25c

Merged main

047193b

ividal self-requested a review February 13, 2025 18:31

ividal approved these changes Feb 13, 2025

View reviewed changes

Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time

0373e5c

Increase lumigator runner size

edb2899

github-actions bot added the gha GitHub actions related label Feb 14, 2025

agpituk and others added 8 commits February 14, 2025 10:51

Fixed missing s

dad7db7

Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time

dc77037

Merged Main

640f7df

Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time

93fbe69

Adding removed label to the redis volume

c201dcb

Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time

dddf260

Removing docker-compose volume for ray to check if it's introducing t…

8ed4552

…he errors

Deleting volume in Dockerfile + adding cache volume to Ray

62ba702

Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time

eff24b9

agpituk enabled auto-merge (squash) February 17, 2025 14:10

agpituk disabled auto-merge February 17, 2025 14:23

Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time

018b808

aittalam reviewed Feb 17, 2025

View reviewed changes

ividal mentioned this pull request Feb 17, 2025

[BUG]: HF model caching does not work anymore #876

Closed

1 task

agpituk added 2 commits February 19, 2025 12:48

precommit fixes

c74bb1e

Merge branch 'main' into agpituk/694-reduce-ray-job-cold-time

5601431

Moved new vars into the new build system, out of docker-compose

24a37f2

agpituk requested review from aittalam and ividal February 19, 2025 15:01

agpituk added 3 commits February 19, 2025 16:02

Reduce unnecesary comments

a1794d9

Merged main

e5de144

Fix some comments

535d4eb

aittalam reviewed Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce ray job cold time #825

Reduce ray job cold time #825

agpituk commented Feb 7, 2025 •

edited

Loading

agpituk commented Feb 10, 2025

ividal left a comment

ividal commented Feb 11, 2025

ividal left a comment •

edited

Loading

agpituk commented Feb 13, 2025

aittalam Feb 17, 2025

agpituk Feb 17, 2025

aittalam Feb 17, 2025

agpituk Feb 17, 2025

aittalam Feb 18, 2025

aittalam Feb 17, 2025

agpituk Feb 17, 2025

aittalam Feb 18, 2025

aittalam Feb 17, 2025

aittalam commented Feb 17, 2025

agpituk commented Feb 17, 2025

agpituk commented Feb 19, 2025

aittalam Feb 19, 2025

	- ${HOME}/.cache/huggingface:/home/ray/.cache/huggingface
	- ${HF_HOME}:/home/ray/.cache/huggingface

Reduce ray job cold time #825

Are you sure you want to change the base?

Reduce ray job cold time #825

Conversation

agpituk commented Feb 7, 2025 • edited Loading

What's changing

How to test it

Additional notes for reviewers

I already...

agpituk commented Feb 10, 2025

ividal left a comment

Choose a reason for hiding this comment

ividal commented Feb 11, 2025

ividal left a comment • edited Loading

Choose a reason for hiding this comment

agpituk commented Feb 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aittalam commented Feb 17, 2025

agpituk commented Feb 17, 2025

agpituk commented Feb 19, 2025

Choose a reason for hiding this comment

agpituk commented Feb 7, 2025 •

edited

Loading

ividal left a comment •

edited

Loading