Skip to content

Commit

Permalink
Revampt download to local dir process (#2223)
Browse files Browse the repository at this point in the history
* still an early draft

* this is better

* fix

* revampt/refactor download process

* resume download by default + do not upload .huggingface folder

* compute sha256 if necessary

* fix hash

* add tests + fix some stuff

* fix snapshot download tests

* fix test

* lots of docs

* add secu

* as constant

* dix

* fix tests

* remove unused code

* don't use jsons

* style

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Warn more about resume_download

* fix test

* Add tests specific to .huggingface folder

* remove advice to use hf_transfer when downloading from cli

* fix torhc test

* more test fix

* feedback

* suggested changes

* more robust

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <[email protected]>

* comment

* commen

* robust tests

* fix CI

* ez

* more ribust?

* allow for 1s diff

* don't raise on unlink

* style

* robustenss

---------

Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
  • Loading branch information
3 people authored Apr 29, 2024
1 parent b6de3f9 commit 4df59b4
Show file tree
Hide file tree
Showing 29 changed files with 1,301 additions and 791 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ quality:
mypy src

style:
ruff check --fix $(check_dirs) # linter
ruff format $(check_dirs) # formatter
ruff check --fix $(check_dirs) # linter
python utils/check_contrib_list.py --update
python utils/check_static_imports.py --update
python utils/generate_async_inference_client.py --update
Expand Down
12 changes: 7 additions & 5 deletions docs/source/en/guides/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,18 +224,20 @@ The examples above show how to download from the latest commit on the main branc

### Download to a local folder

The recommended (and default) way to download files from the Hub is to use the cache-system. However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow closer to what git commands offer. You can do that using the `--local_dir` option. The file is downloaded to a tmp file and then moved to the local dir to avoid having partially downloaded files in the local folder.
The recommended (and default) way to download files from the Hub is to use the cache-system. However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow closer to what git commands offer. You can do that using the `--local_dir` option.

<Tip warning={true}>
A `./huggingface/` folder is created at the root of your local directory containing metadata about the downloaded files. This prevents re-downloading files if they're already up-to-date. If the metadata has changed, then the new file version is downloaded. This makes the `local_dir` optimized for pulling only the latest changes.

Downloading to a local directory comes with some downsides. Please check out the limitations in the [Download](./download#download-files-to-local-folder) guide before using `--local-dir`.
<Tip>

For more details on how downloading to a local file works, check out the [download](./download.md#download-files-to-a-local-folder) guide.

</Tip>

```bash
>>> huggingface-cli download adept/fuyu-8b model-00001-of-00002.safetensors --local-dir .
>>> huggingface-cli download adept/fuyu-8b model-00001-of-00002.safetensors --local-dir fuyu
...
./model-00001-of-00002.safetensors
fuyu/model-00001-of-00002.safetensors
```

### Specify cache directory
Expand Down
51 changes: 15 additions & 36 deletions docs/source/en/guides/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,42 +126,21 @@ files except `vocab.json`.
>>> snapshot_download(repo_id="gpt2", allow_patterns=["*.md", "*.json"], ignore_patterns="vocab.json")
```

## Download file(s) to local folder

The recommended (and default) way to download files from the Hub is to use the [cache-system](./manage-cache).
You can define your cache location by setting `cache_dir` parameter (both in [`hf_hub_download`] and [`snapshot_download`]).

However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow
closer to what `git` commands offer. You can do that using the `local_dir` and `local_dir_use_symlinks` parameters:
- `local_dir` must be a path to a folder on your system. The downloaded files will keep the same file structure as in the
repo. For example if `filename="data/train.csv"` and `local_dir="path/to/folder"`, then the returned filepath will be
`"path/to/folder/data/train.csv"`.
- `local_dir_use_symlinks` defines how the file must be saved in your local folder.
- The default behavior (`"auto"`) is to duplicate small files (<5MB) and use symlinks for bigger files. Symlinks allow
to optimize both bandwidth and disk usage. However manually editing a symlinked file might corrupt the cache, hence
the duplication for small files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`
environment variable.
- If `local_dir_use_symlinks=True` is set, all files are symlinked for an optimal disk space optimization. This is
for example useful when downloading a huge dataset with thousands of small files.
- Finally, if you don't want symlinks at all you can disable them (`local_dir_use_symlinks=False`). The cache directory
will still be used to check whether the file is already cached or not. If already cached, the file is **duplicated**
from the cache (i.e. saves bandwidth but increases disk usage). If the file is not already cached, it will be
downloaded and moved directly to the local dir. This means that if you need to reuse it somewhere else later, it
will be **re-downloaded**.

Here is a table that summarizes the different options to help you choose the parameters that best suit your use case.

<!-- Generated with https://www.tablesgenerator.com/markdown_tables -->
| Parameters | File already cached | Returned path | Can read path? | Can save to path? | Optimized bandwidth | Optimized disk usage |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| `local_dir=None` | | symlink in cache || ❌<br>_(save would corrupt the cache)_ |||
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks="auto"` | | file or symlink in folder ||_(for small files)_ <br> ⚠️ _(for big files do not resolve path before saving)_ |||
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=True` | | symlink in folder || ⚠️<br>_(do not resolve path before saving)_ |||
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | No | file in folder ||| ❌<br>_(if re-run, file is re-downloaded)_ | ⚠️<br>(multiple copies if ran in multiple folders) |
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | Yes | file in folder ||| ⚠️<br>_(file has to be cached first)_ | ❌<br>_(file is duplicated)_ |

**Note:** if you are on a Windows machine, you need to enable developer mode or run `huggingface_hub` as admin to enable
symlinks. Check out the [cache limitations](../guides/manage-cache#limitations) section for more details.
## Download file(s) to a local folder

By default, we recommend using the [cache system](./manage-cache) to download files from the Hub. You can specify a custom cache location using the `cache_dir` parameter in [`hf_hub_download`] and [`snapshot_download`], or by setting the [`HF_HOME`](../package_reference/environment_variables#hf_home) environment variable.

However, if you need to download files to a specific folder, you can pass a `local_dir` parameter to the download function. This is useful to get a workflow closer to what the `git` command offers. The downloaded files will maintain their original file structure within the specified folder. For example, if `filename="data/train.csv"` and `local_dir="path/to/folder"`, the resulting filepath will be `"path/to/folder/data/train.csv"`.

A `./huggingface/` folder is created at the root of your local directory containing metadata about the downloaded files. This prevents re-downloading files if they're already up-to-date. If the metadata has changed, then the new file version is downloaded. This makes the `local_dir` optimized for pulling only the latest changes.

After completing the download, you can safely remove the `.huggingface/` folder if you no longer need it. However, be aware that re-running your script without this folder may result in longer recovery times, as metadata will be lost. Rest assured that your local data will remain intact and unaffected.

<Tip>

Don't worry about the `.huggingface/` folder when committing changes to the Hub! This folder is automatically ignored by both `git` and [`upload_folder`].

</Tip>

## Download from the CLI

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/guides/integrations.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ common to offer parameters like:
- `token`: to download from a private repo
- `revision`: to download from a specific branch
- `cache_dir`: to cache files in a specific directory
- `force_download`/`resume_download`/`local_files_only`: to reuse the cache or not
- `force_download`/`local_files_only`: to reuse the cache or not
- `proxies`: configure HTTP session

When pushing models, similar parameters are supported:
Expand Down
5 changes: 1 addition & 4 deletions docs/source/en/package_reference/environment_variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,10 +67,7 @@ For more details, see [logging reference](../package_reference/utilities#hugging

### HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD

Integer value to define under which size a file is considered as "small". When downloading files to a local directory,
small files will be duplicated to ease user experience while bigger files are symlinked to save disk usage.

For more details, see the [download guide](../guides/download#download-files-to-local-folder).
This environment variable has been deprecated and is now ignored by `huggingface_hub`. Downloading files to the local dir does not rely on symlinks anymore.

### HF_HUB_ETAG_TIMEOUT

Expand Down
12 changes: 7 additions & 5 deletions src/huggingface_hub/_commit_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from .file_download import hf_hub_url
from .lfs import UploadInfo, lfs_upload, post_lfs_batch_info
from .utils import (
FORBIDDEN_FOLDERS,
EntryNotFoundError,
chunk_iterable,
get_session,
Expand Down Expand Up @@ -254,11 +255,12 @@ def _validate_path_in_repo(path_in_repo: str) -> str:
raise ValueError(f"Invalid `path_in_repo` in CommitOperation: '{path_in_repo}'")
if path_in_repo.startswith("./"):
path_in_repo = path_in_repo[2:]
if any(part == ".git" for part in path_in_repo.split("/")):
raise ValueError(
"Invalid `path_in_repo` in CommitOperation: cannot update files under a '.git/' folder (path:"
f" '{path_in_repo}')."
)
for forbidden in FORBIDDEN_FOLDERS:
if any(part == forbidden for part in path_in_repo.split("/")):
raise ValueError(
f"Invalid `path_in_repo` in CommitOperation: cannot update files under a '{forbidden}/' folder (path:"
f" '{path_in_repo}')."
)
return path_in_repo


Expand Down
4 changes: 2 additions & 2 deletions src/huggingface_hub/_commit_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from threading import Lock, Thread
from typing import Dict, List, Optional, Union

from .hf_api import IGNORE_GIT_FOLDER_PATTERNS, CommitInfo, CommitOperationAdd, HfApi
from .hf_api import DEFAULT_IGNORE_PATTERNS, CommitInfo, CommitOperationAdd, HfApi
from .utils import filter_repo_objects


Expand Down Expand Up @@ -107,7 +107,7 @@ def __init__(
ignore_patterns = []
elif isinstance(ignore_patterns, str):
ignore_patterns = [ignore_patterns]
self.ignore_patterns = ignore_patterns + IGNORE_GIT_FOLDER_PATTERNS
self.ignore_patterns = ignore_patterns + DEFAULT_IGNORE_PATTERNS

if self.folder_path.is_file():
raise ValueError(f"'folder_path' must be a directory, not a file: '{self.folder_path}'.")
Expand Down
229 changes: 229 additions & 0 deletions src/huggingface_hub/_local_folder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
# coding=utf-8
# Copyright 2024-present, the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Contains utilities to handle the `../.huggingface` folder in local directories.
First discussed in https://github.com/huggingface/huggingface_hub/issues/1738 to store
download metadata when downloading files from the hub to a local directory (without
using the cache).
./.huggingface folder structure:
[4.0K] data
├── [4.0K] .huggingface
│ └── [4.0K] download
│ ├── [ 16] file.parquet.metadata
│ ├── [ 16] file.txt.metadata
│ └── [4.0K] folder
│ └── [ 16] file.parquet.metadata
├── [6.5G] file.parquet
├── [1.5K] file.txt
└── [4.0K] folder
└── [ 16] file.parquet
Metadata file structure:
```
# file.txt.metadata
11c5a3d5811f50298f278a704980280950aedb10
a16a55fda99d2f2e7b69cce5cf93ff4ad3049930
1712656091.123
# file.parquet.metadata
11c5a3d5811f50298f278a704980280950aedb10
7c5d3f4b8b76583b422fcb9189ad6c89d5d97a094541ce8932dce3ecabde1421
1712656091.123
}
```
"""

import logging
import os
import time
from dataclasses import dataclass
from functools import lru_cache
from pathlib import Path
from typing import Optional

from .utils import WeakFileLock


logger = logging.getLogger(__name__)


@dataclass
class LocalDownloadFilePaths:
"""
Paths to the files related to a download process in a local dir.
Returned by `get_local_download_paths`.
Attributes:
file_path (`Path`):
Path where the file will be saved.
lock_path (`Path`):
Path to the lock file used to ensure atomicity when reading/writing metadata.
metadata_path (`Path`):
Path to the metadata file.
"""

file_path: Path
lock_path: Path
metadata_path: Path

def incomplete_path(self, etag: str) -> Path:
"""Return the path where a file will be temporarily downloaded before being moved to `file_path`."""
return self.metadata_path.with_suffix(f".{etag}.incomplete")


@dataclass
class LocalDownloadFileMetadata:
"""
Metadata about a file in the local directory related to a download process.
Attributes:
filename (`str`):
Path of the file in the repo.
commit_hash (`str`):
Commit hash of the file in the repo.
etag (`str`):
ETag of the file in the repo. Used to check if the file has changed.
For LFS files, this is the sha256 of the file. For regular files, it corresponds to the git hash.
timestamp (`int`):
Unix timestamp of when the metadata was saved i.e. when the metadata was accurate.
"""

filename: str
commit_hash: str
etag: str
timestamp: float


@lru_cache(maxsize=128) # ensure singleton
def get_local_download_paths(local_dir: Path, filename: str) -> LocalDownloadFilePaths:
"""Compute paths to the files related to a download process.
Folders containing the paths are all guaranteed to exist.
Args:
local_dir (`Path`):
Path to the local directory in which files are downloaded.
filename (`str`):
Path of the file in the repo.
Return:
[`LocalDownloadFilePaths`]: the paths to the files (file_path, lock_path, metadata_path, incomplete_path).
"""
# filename is the path in the Hub repository (separated by '/')
# make sure to have a cross platform transcription
sanitized_filename = os.path.join(*filename.split("/"))
if os.name == "nt":
if sanitized_filename.startswith("..\\") or "\\..\\" in sanitized_filename:
raise ValueError(
f"Invalid filename: cannot handle filename '{sanitized_filename}' on Windows. Please ask the repository"
" owner to rename this file."
)
file_path = local_dir / sanitized_filename
metadata_path = _huggingface_dir(local_dir) / "download" / f"{sanitized_filename}.metadata"
lock_path = metadata_path.with_suffix(".lock")

file_path.parent.mkdir(parents=True, exist_ok=True)
metadata_path.parent.mkdir(parents=True, exist_ok=True)
return LocalDownloadFilePaths(file_path=file_path, lock_path=lock_path, metadata_path=metadata_path)


def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDownloadFileMetadata]:
"""Read metadata about a file in the local directory related to a download process.
Args:
local_dir (`Path`):
Path to the local directory in which files are downloaded.
filename (`str`):
Path of the file in the repo.
Return:
`[LocalDownloadFileMetadata]` or `None`: the metadata if it exists, `None` otherwise.
"""
paths = get_local_download_paths(local_dir, filename)
# file_path = local_file_path(local_dir, filename)
# lock_path, metadata_path = _download_metadata_file_path(local_dir, filename)
with WeakFileLock(paths.lock_path):
if paths.metadata_path.exists():
try:
with paths.metadata_path.open() as f:
commit_hash = f.readline().strip()
etag = f.readline().strip()
timestamp = float(f.readline().strip())
metadata = LocalDownloadFileMetadata(
filename=filename,
commit_hash=commit_hash,
etag=etag,
timestamp=timestamp,
)
except Exception as e:
# remove the metadata file if it is corrupted / not the right format
logger.warning(
f"Invalid metadata file {paths.metadata_path}: {e}. Removing it from disk and continue."
)
try:
paths.metadata_path.unlink()
except Exception as e:
logger.warning(f"Could not remove corrupted metadata file {paths.metadata_path}: {e}")

try:
# check if the file exists and hasn't been modified since the metadata was saved
stat = paths.file_path.stat()
if (
stat.st_mtime - 1 <= metadata.timestamp
): # allow 1s difference as stat.st_mtime might not be precise
return metadata
logger.info(f"Ignored metadata for '{filename}' (outdated). Will re-compute hash.")
except FileNotFoundError:
# file does not exist => metadata is outdated
return None
return None


def write_download_metadata(local_dir: Path, filename: str, commit_hash: str, etag: str) -> None:
"""Write metadata about a file in the local directory related to a download process.
Args:
local_dir (`Path`):
Path to the local directory in which files are downloaded.
"""
paths = get_local_download_paths(local_dir, filename)
with WeakFileLock(paths.lock_path):
with paths.metadata_path.open("w") as f:
f.write(f"{commit_hash}\n{etag}\n{time.time()}\n")


@lru_cache()
def _huggingface_dir(local_dir: Path) -> Path:
"""Return the path to the `.huggingface` directory in a local directory."""
# Wrap in lru_cache to avoid overwriting the .gitignore file if called multiple times
path = local_dir / ".huggingface"
path.mkdir(exist_ok=True, parents=True)

# Create a .gitignore file in the .huggingface directory if it doesn't exist
# Should be thread-safe enough like this.
gitignore = path / ".gitignore"
gitignore_lock = path / ".gitignore.lock"
if not gitignore.exists():
with WeakFileLock(gitignore_lock):
gitignore.write_text("*")
try:
gitignore_lock.unlink()
except OSError: # FileNotFoundError, PermissionError, etc.
pass
return path
Loading

0 comments on commit 4df59b4

Please sign in to comment.