Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: skip dataset re-download and ensure safe dataset syncing #220

Merged
merged 10 commits into from
Jan 8, 2025

Conversation

JSabadin
Copy link
Contributor

@JSabadin JSabadin commented Jan 6, 2025

Skip Redownload Dataset and Fix Potential Race Conditions

A new parameter, update_mode, has been introduced in the LuxonisLoader. This parameter allows more control over when the dataset from the cloud is already downloaded locally.

  • If set to UpdateMode.ALWAYS, the loader will always re-download the dataset locally.
  • If set to UpdateMode.IF_EMPTY, it will only download the dataset if it does not already exist locally.

Example Usage:

from luxonis_loader import LuxonisLoader, UpdateMode

loader = LuxonisLoader(
    dataset,
    update_mode=UpdateMode.IF_EMPTY
)

This ensures that the loader uses the local dataset if one is already present, helping to avoid unnecessary downloads.


FileLock Added for Safe Syncing

To address potential race conditions, a FileLock mechanism has been implemented in the following areas:

  1. sync_from_cloud Method: This prevents multiple processes from attempting to sync the dataset simultaneously in a distributed environment (e.g., DDP on GCP).
  2. _get_metadata in Dataset Initialization: Ensures safe concurrent access when multiple processes initialize the dataset and loader before setting up the DDP environment.

Additionally, the _load_df_offline method has been updated to:

  • Only access local storage.
  • Avoid attempting to download dataset annotations from the cloud, as sync_from_cloud already handles this.

These changes help to ensure safe and efficient operation in distributed environments by preventing redundant downloads and race conditions.

@JSabadin JSabadin requested a review from a team as a code owner January 6, 2025 10:36
@JSabadin JSabadin requested review from kozlov721, klemen1999, tersekmatija and conorsim and removed request for a team January 6, 2025 10:36
@github-actions github-actions bot added the enhancement New feature or request label Jan 6, 2025
Copy link

codecov bot commented Jan 6, 2025

Codecov Report

Attention: Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 95.81%. Comparing base (1cbc795) to head (2c49d12).
Report is 1 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
luxonis_ml/data/datasets/luxonis_dataset.py 88.23% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #220      +/-   ##
==========================================
- Coverage   95.84%   95.81%   -0.04%     
==========================================
  Files          88       88              
  Lines        4571     4583      +12     
==========================================
+ Hits         4381     4391      +10     
- Misses        190      192       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JSabadin JSabadin changed the title feat: skip dataset re-download feat: skip dataset re-download and added FileLock for safe syncing Jan 6, 2025
@github-actions github-actions bot added the tracker Changes affecting luxonis_ml.tracker subpackage label Jan 7, 2025
@kozlov721
Copy link
Collaborator

Doesn't the force flag already do what the skip_redownload_dataset does?

@JSabadin
Copy link
Contributor Author

JSabadin commented Jan 7, 2025

Doesn't the force flag already do what the skip_redownload_dataset does?

They serve different purposes:

  • force always triggers a fresh download, ignoring what’s on disk.
  • skip_redownload_dataset simply prevents downloads if data already exists locally.

Probably a string parameter like update_mode could encapsulate all scenarios better:

  • "always" = force a fresh download
  • "if_empty" = only download if there’s no local data
  • "never" = skip the download entirely

@kozlov721
Copy link
Collaborator

Probably a string parameter like update_mode could encapsulate all scenarios better:

  • "always" = force a fresh download
  • "if_empty" = only download if there’s no local data
  • "never" = skip the download entirely

Merging those two parameters to one would be better. Imho it's not very clear what's the difference between force=False and skip_redownloading_dataset=True.

@github-actions github-actions bot added the data Changes affecting luxonis_ml.data subpackage label Jan 7, 2025
luxonis_ml/data/datasets/luxonis_dataset.py Outdated Show resolved Hide resolved
luxonis_ml/data/datasets/luxonis_dataset.py Show resolved Hide resolved
luxonis_ml/data/loaders/luxonis_loader.py Show resolved Hide resolved
@JSabadin JSabadin changed the title feat: skip dataset re-download and added FileLock for safe syncing feat: skip dataset re-download and ensure safe dataset syncing Jan 8, 2025
@JSabadin JSabadin merged commit 11ee765 into main Jan 8, 2025
10 of 12 checks passed
@JSabadin JSabadin deleted the feat/dataset-redownload branch January 8, 2025 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Changes affecting luxonis_ml.data subpackage enhancement New feature or request tracker Changes affecting luxonis_ml.tracker subpackage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants