Support DiLoCo training. #1353

ZacharyGarrett · 2025-03-06T15:01:32Z

Description

This is an initial implementation of Distributed Low-Communication training (DiLoCo) as described in https://arxiv.org/abs/2311.08105. DiLoCo is an inner-outer bi-level optimization training strategy that significantly reduces the amount of bandwidth used compared to data-parallel training by syncing between the replicas less frequently.

This implementation adds the drjax package to the pip requirements for bookkeeping and subtle configuration of the jax.vmap's spmd_axis_name argument.

Going forward, one can specify ici_diloco_parallelism or dcn_dilooco_parallelism greater than 1 (the default, which disables) to enable DiLoCo training.

Next steps would include implementing the streaming DiLoCo variant (https://arxiv.org/abs/2501.18512)

Tests

This PR introduces a new tests/diloco_test.py that has a test for numerical correctness of a simple two parameter model.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

This is an initial implementation of Distributed Low-Communication trianing (DiLoCo) as described in https://arxiv.org/abs/2311.08105. This implementation adds the `drjax` package to the pip requirements for bookkeeping and subtle configuraiton of the `jax.vmap`'s `spmd_axis_name` argument. Going forward, one can specify `ici_diloco_parallelism` or `dcn_dilooco_parallelism` greater than 1 (the default, which disables) to enable DiLoCo training.

ZacharyGarrett requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla, RissyRan, richjames0, rni418 and gagika as code owners March 6, 2025 15:01

ZacharyGarrett force-pushed the zachgarrett/diloco branch from 1299433 to a46f8b5 Compare March 6, 2025 22:25

ZacharyGarrett requested review from shralex, yangyuwei, SurbhiJainUSC, hengtaoguo and A9isha as code owners March 6, 2025 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DiLoCo training. #1353

Support DiLoCo training. #1353

ZacharyGarrett commented Mar 6, 2025 •

edited

Loading

Support DiLoCo training. #1353

Are you sure you want to change the base?

Support DiLoCo training. #1353

Conversation

ZacharyGarrett commented Mar 6, 2025 • edited Loading

Description

Tests

Checklist

ZacharyGarrett commented Mar 6, 2025 •

edited

Loading