-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cloud storage in load_dataset #5281
Comments
Or for example an archive on GitHub releases! Before I added support for JXL (locally only, PR still pending) I was considering hosting my files on GitHub instead... |
+1 to this. I would like to use 'audiofolder' with a data_dir that's on S3, for example. I don't want to upload my dataset to the Hub, but I would find all the fingerprinting/caching features useful. |
Adding to the conversation, Dask also uses Dask: How to connect to remote data Happy to help on this feature :D |
+1 to this feature request since I think it also tackles my use-case. I am collaborating with a team, working with a loading script which takes some time to generate the dataset artifacts. It would be very handy to use this as a cloud cache to avoid duplicating the effort. Currently we could use |
Makes sense ! If you want to load locally a dataset that you download_and_prepared on a cloud storage, you would use |
It seems currently the datasets/src/datasets/utils/file_utils.py Line 181 in b5672a9
I guess one can add another condition that handles |
Thanks @dwyatte for adding support for fsspec urls Let me just reopen this since the original issue is not resolved |
I'm not yet understanding how to use #5580 in order to use |
It's still not officially supported x) But you can try to update |
Thank you for your guys help on this and merging in #5580. I manually pulled the changes to my local datasets package (datasets.utils.file_utils.py) since it only seemed to be this file that was changed in the PR and I'm getting the error: |
The current implementation depends on gcsfs/s3fs being able to authenticate through some other means e.g., environmental variables. For AWS, it looks like you can set Note that while testing this just now, I did note a discrepancy between gcsfs and s3fs that we might want to address where gcsfs passes the timeout from It seems like some work trying to unify kwargs across different fsspec implementations, so if the plan is to pass down |
|
@lhoestq sorry for being a little dense here but I am very keen to use fsspec / adlfs for for a larger image dataset I have for object detection. I have to keep it on Azure storage and would also like to avoid a full download or zipping (so use Would you expect the setup to work for:
? I appreciate that you mostly focus on s3 but it seems that, similar to the remaining cloud storage functionality, it should also work for Azure blob storage. I would imagine that something like (Streaming true or false): d = load_dataset("new_dataset.py", storage_options=storage_options, split="train") would work with # new_dataset.py
....
_URL="abfs://container/image_folder```
archive_path = dl_manager.download(_URL)
split_metadata_paths = dl_manager.download(_METADATA_URLS)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"annotation_file_path": split_metadata_paths["train"],
"files": dl_manager.iter_files(archive_path)
},
),
... but I get
|
What version of |
@lhoestq storage_options = { and it throws an error: |
Hi @lhoestq, thanks for getting back to me :) you have been busy over the summer I see... I was on Now Any chance that it could work out-of-the-box by supplying just the image folder, not the full list of image filenames? It seems that Let me know if there is anything I can do to help. Thanks, |
@mayorblock This is not supported right now, you have to use archives or implement a way to get the list by yourself
@hjq133 Can you update pip install -U fsspec |
thanks for your suggestion,it works now ! |
I'm seeing same problem as @hjq133 with following versions:
|
I am trying to do the same thing, but the loading is just hanging, without any error. |
Hi ! S3 support is still experimental. It seems like there is an extra |
That worked! Thanks |
@lhoestq Would this work either with an Azure Blob Storage Container or its respective Azure Machine Learning Datastore? If yes, what would that look like in code? I've tried a couple of combinations but no success so far, on the latest version of @mayorblock would you mind sharing how you got it to work? What did you pass as |
This ticket would be of so much help. |
@lhoestq I've been using this feature for the last year on GCS without problem, but I think we need to fix an issue with S3 and then document the supported calling patterns to reduce confusion It looks like
(specifically the
Is this an issue that needs fixed with s3fs? Or can we avoid passing these default storage options in some cases? Update: I think probably #6127 is where these default storage options were introduced |
Hmm not sure, maybe it has to do with |
Also running into this issue downloading a parquet dataset from S3 (Upload worked fine using current main branch). When commenting out the post_init part to set 'hf', I am able to download the dataset. |
Can someone paste a complete example for getting this to work? Running this: from datasets import load_from_disk, load_dataset
from os import environ
from s3fs import S3FileSystem
storage_options = {
"key": "AzureDiamond",
"secret": "hunter2",
"endpoint_url": "https://fly.storage.tigris.dev"
}
model_name = "Qwen/Qwen2.5-32B"
dataset_name = "mlabonne/FineTome-100k"
dataset = load_dataset(
f"s3://datnybakfu/model-ready/{model_name}/{dataset_name}",
storage_options=storage_options,
filesystem=S3FileSystem(),
streaming=True,
) I get the following error trace: Stacktrace (folded)---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
[<ipython-input-19-0ae7d5f39c46>](https://localhost:8080/#) in <cell line: 29>()
27 dataset_name = "mlabonne/FineTome-100k"
28
---> 29 dataset = load_dataset(
30 f"s3://datnybakfu/model-ready/{model_name}/{dataset_name}",
31 storage_options=storage_options,
2 frames
[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
2130
2131 # Create a dataset builder
-> 2132 builder_instance = load_dataset_builder(
2133 path=path,
2134 name=name,
[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)
1851 download_config = download_config.copy() if download_config else DownloadConfig()
1852 download_config.storage_options.update(storage_options)
-> 1853 dataset_module = dataset_module_factory(
1854 path,
1855 revision=revision,
[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)
1733 )
1734 else:
-> 1735 raise FileNotFoundError(f"Couldn't find any data file at {relative_to_absolute_path(path)}.")
1736
1737
FileNotFoundError: Couldn't find any data file at /content/s3:/datnybakfu/model-ready/Qwen/Qwen2.5-32B/mlabonne/FineTome-100k.
|
Hi, using cc @Wauplin if you have a script to move data from S3 to HF ? I know there is this code adapted from fsspec/filesystem_spec#909 that works well (it creates one commit per file) from multiprocessing.pool import ThreadPool
import fsspec
from tqdm import tqdm
s3_storage_options = {} # add S3 credentials here if needed, e.g. {"key": aws_access_key_id, "secret": aws_secret_access_key}
a = fsspec.get_mapper("s3://bucket/dataset_folder", **s3_storage_options)
b = fsspec.get_mapper("hf://datasets/username/dataset_name")
def f(k):
b[k]=a[k]
with ThreadPool(32) as p:
keys = [key for key in a.keys() if key and not key.endswith("/")] # ignore root and directories
for _ in tqdm(p.imap_unordered(f, keys), total=len(keys)):
pass |
It's more complex than this. Unfortunately, the best I can advice is to download the data locally (at least partially) and then use The solution suggested above works ok as long as the number of files is not too big. The problem of 1 file == 1 commit is that 1. it can be slow 2. it can lead to concurrency issues (concurrent commits) 3. if can lead to slow repo (once 10000 commits have been made) and 4. it can lead to rate limits (maximum number of commits per hour). All of this is taken care of with the |
Would be nice to be able to do
The idea would be to use
fsspec
as indownload_and_prepare
andsave_to_disk
.This has been requested several times already. Some users want to use their data from private cloud storage to train models
related:
#3490
#5244
forum
The text was updated successfully, but these errors were encountered: