Support cloud storage in load_dataset #5281

lhoestq · 2022-11-22T14:00:10Z

Would be nice to be able to do

data_files=["s3://..."]  # or gs:// or any cloud storage path
storage_options = {...}
load_dataset(..., data_files=data_files, storage_options=storage_options)

The idea would be to use fsspec as in download_and_prepare and save_to_disk.

This has been requested several times already. Some users want to use their data from private cloud storage to train models

The text was updated successfully, but these errors were encountered:

alexjc · 2022-11-25T15:54:18Z

Or for example an archive on GitHub releases! Before I added support for JXL (locally only, PR still pending) I was considering hosting my files on GitHub instead...

iceboundflame · 2022-12-09T03:58:11Z

+1 to this. I would like to use 'audiofolder' with a data_dir that's on S3, for example. I don't want to upload my dataset to the Hub, but I would find all the fingerprinting/caching features useful.

Dref360 · 2022-12-12T14:21:07Z

Adding to the conversation, Dask also uses fsspec for this feature.

Dask: How to connect to remote data

Happy to help on this feature :D

eballesteros · 2023-02-08T11:20:47Z

+1 to this feature request since I think it also tackles my use-case. I am collaborating with a team, working with a loading script which takes some time to generate the dataset artifacts. It would be very handy to use this as a cloud cache to avoid duplicating the effort.

Currently we could use builder.download_and_prepare(path_to_cloud_storage, storage_options, ...) to cache the artifacts to cloud storage, but then builder.as_dataset() yields NotImplementedError: Loading a dataset cached in SomeCloudFileSystem is not supported

lhoestq · 2023-02-08T13:03:37Z

Makes sense ! If you want to load locally a dataset that you download_and_prepared on a cloud storage, you would use load_dataset(path_to_cloud_storage) indeed. It would download the data from the cloud storage, cache them locally, and return a Dataset.

kyamagu · 2023-02-16T08:11:48Z

It seems currently the cached_path function handles all URLs by get_from_cache that only supports ftp and http(s) here:

datasets/src/datasets/utils/file_utils.py

Line 181 in b5672a9

if is_remote_url(url_or_filename):

I guess one can add another condition that handles s3:// or gs:// URLs via fsspec here.

dwyatte · 2023-02-27T04:10:27Z

I could use this functionality, so I put together a PR using @kyamagu's suggestion to use fsspec in datasets.utils.file_utils

#5580

lhoestq · 2023-03-11T00:56:37Z

Thanks @dwyatte for adding support for fsspec urls

Let me just reopen this since the original issue is not resolved

janmaltel · 2023-03-22T12:06:54Z

I'm not yet understanding how to use #5580 in order to use load_dataset(data_files="s3://..."). Any help/example would be much appreciated :) thanks!

lhoestq · 2023-03-22T13:29:15Z

It's still not officially supported x) But you can try to update request_etag in file_utils.py to use fsspec_head instead of http_head. It is responsible of getting the ETags of the remote files for caching. This change may do the trick for S3 urls

ssabatier · 2023-03-24T17:07:36Z

Thank you for your guys help on this and merging in #5580. I manually pulled the changes to my local datasets package (datasets.utils.file_utils.py) since it only seemed to be this file that was changed in the PR and I'm getting the error:
InvalidSchema: No connection adapters were found for 's3://bucket/folder/'. I'm calling load_dataset using the S3 URI. When I use the S3 URL I get HTTPError: 403 Client Error.
Am I not supposed to use the S3 URI? How do I pull in the changes from this merge? I'm running datasets 2.10.1.

dwyatte · 2023-03-26T16:56:11Z

The current implementation depends on gcsfs/s3fs being able to authenticate through some other means e.g., environmental variables. For AWS, it looks like you can set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN

Note that while testing this just now, I did note a discrepancy between gcsfs and s3fs that we might want to address where gcsfs passes the timeout from storage_options here down into the aiohttp.ClientSession.request, but s3fs does not handle this (tries to pass to the aiobotocore.session.AioSession constructor raising TypeError: __init__() got an unexpected keyword argument 'requests_timeout').

It seems like some work trying to unify kwargs across different fsspec implementations, so if the plan is to pass down storage_options, I wonder if we should just let users control the timeout (and other kwargs) using that and if not specified, use the default?

dwyatte · 2023-03-26T20:09:44Z

Note that while testing this just now, I did note a discrepancy between gcsfs and s3fs that we might want to address where gcsfs passes the timeout from storage_options here down into the aiohttp.ClientSession.request, but s3fs does not handle this (tries to pass to the aiobotocore.session.AioSession constructor raising TypeError: init() got an unexpected keyword argument 'requests_timeout').

It seems like some work trying to unify kwargs across different fsspec implementations, so if the plan is to pass down storage_options, I wonder if we should just let users control the timeout (and other kwargs) and if not specified, use the default?

@lhoestq here's a small PR for this: #5673

mayorblock · 2023-08-07T11:51:33Z

@lhoestq sorry for being a little dense here but I am very keen to use fsspec / adlfs for for a larger image dataset I have for object detection. I have to keep it on Azure storage and would also like to avoid a full download or zipping (so use load_dataset(..., streaming=True). So this development is godsend :) only... I am unable to make it work.

Would you expect the setup to work for:

azure blob storage
image files (not the standard formats json, parquet....

? I appreciate that you mostly focus on s3 but it seems that, similar to the remaining cloud storage functionality, it should also work for Azure blob storage.

I would imagine that something like (Streaming true or false):

d = load_dataset("new_dataset.py", storage_options=storage_options, split="train")

would work with

# new_dataset.py
....
_URL="abfs://container/image_folder``` 

archive_path = dl_manager.download(_URL)
split_metadata_paths = dl_manager.download(_METADATA_URLS)
return [
    datasets.SplitGenerator(
        name=datasets.Split.TRAIN,
        gen_kwargs={
            "annotation_file_path": split_metadata_paths["train"],
            "files": dl_manager.iter_files(archive_path)
},
      ),
...

but I get

Traceback (most recent call last):
...        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/load.py", line 1797, in load_dataset
    builder_instance.download_and_prepare(
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 890, in download_and_prepare
    self._download_and_prepare(
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 963, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.cache/huggingface/modules/datasets_modules/datasets/new_dataset/dd26a081eab90074f41fa2c821b458424fde393cc73d3d8241aca956d1fb3aa0/new_dataset_script.py", line 56, in _split_generators
    archive_path = dl_manager.download(_URL)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/download/download_manager.py", line 427, in download
    downloaded_path_or_paths = map_nested(
                               ^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 435, in map_nested
    return function(data_struct)
           ^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/download/download_manager.py", line 453, in _download
    return cached_path(url_or_filename, download_config=download_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 206, in cached_path
    raise ValueError(f"unable to parse {url_or_filename} as a URL or as a local path")
ValueError: unable to parse abfs://container/image_folder as a URL or as a local path

lhoestq · 2023-08-16T09:57:03Z

What version of datasets are you using ?

hjq133 · 2023-08-17T05:15:18Z

@lhoestq
hello, i still have problem with loading json from S3:

storage_options = {
"key": xxxx,
"secret": xxx,
"endpoint_url": xxxx
}
path = 's3://xxx/xxxxxxx.json'
dataset = load_dataset("json", data_files=path, storage_options=storage_options)

and it throws an error:
TypeError: AioSession.init() got an unexpected keyword argument 'hf'
and I use the lastest 2.14.4_dev0 version

mayorblock · 2023-08-17T06:30:37Z

Hi @lhoestq, thanks for getting back to me :) you have been busy over the summer I see... I was on 2.12.0. I have updated to 2.14.4.

Now d = load_dataset("new_dataset.py", storage_options=storage_options, split="train", streaming=True) works for Azure blob storage (with a local data loader script) when I explicitly list all blobs (I am struggling to make fs.ls(<path>) work in the script to make the list available to the download manager).

Any chance that it could work out-of-the-box by supplying just the image folder, not the full list of image filenames? It seems that dl_manager.download(_URL) always wants one or more (possibly archived) files. In my situation, where I don't want to archive or download, it would be great to just supply the folder (seems reasonably doable with fsspec).

Let me know if there is anything I can do to help.

Thanks,

lhoestq · 2023-08-17T08:23:44Z

Any chance that it could work out-of-the-box by supplying just the image folder, not the full list of image filenames? It seems that dl_manager.download(_URL) always wants one or more (possibly archived) files. In my situation, where I don't want to archive or download, it would be great to just supply the folder (seems reasonably doable with fsspec).

@mayorblock This is not supported right now, you have to use archives or implement a way to get the list by yourself

TypeError: AioSession.init() got an unexpected keyword argument 'hf'

@hjq133 Can you update fsspec and try again ?

pip install -U fsspec

hjq133 · 2023-08-18T13:04:11Z

thanks for your suggestion，it works now !

mariokostelac · 2023-11-23T14:16:35Z

I'm seeing same problem as @hjq133 with following versions:

datasets==2.15.0
(venv) ➜  finetuning-llama2 git:(main) ✗ pip freeze | grep s3fs   
s3fs==2023.10.0
(venv) ➜  finetuning-llama2 git:(main) ✗ pip freeze | grep fsspec
fsspec==2023.10.0

aarbelle · 2023-11-26T13:33:12Z

@lhoestq hello, i still have problem with loading json from S3:

storage_options = { "key": xxxx, "secret": xxx, "endpoint_url": xxxx } path = 's3://xxx/xxxxxxx.json' dataset = load_dataset("json", data_files=path, storage_options=storage_options)

and it throws an error: TypeError: AioSession.init() got an unexpected keyword argument 'hf' and I use the lastest 2.14.4_dev0 version

I am trying to do the same thing, but the loading is just hanging, without any error.
@lhoestq is there any documentation how to load from private s3 buckets?

lhoestq · 2023-11-27T11:13:25Z

Hi ! S3 support is still experimental. It seems like there is an extra hf field passed to the s3fs storage_options that causes this error. I just check the source code of _prepare_single_hop_path_and_storage_options and I think you can try passing explicitly your own storage_options={"s3": {...}}. Also note that it's generally better to load datasets from HF (we run extensive tests and benchmarks for speed and robustness)

aarbelle · 2023-11-28T07:26:18Z

That worked! Thanks
It seems thought that data_dir=... doesn't work on s3, only data_files.

csanadpoda · 2023-11-28T22:10:26Z

@lhoestq Would this work either with an Azure Blob Storage Container or its respective Azure Machine Learning Datastore? If yes, what would that look like in code? I've tried a couple of combinations but no success so far, on the latest version of datasets. I need to migrate a dataset to the Azure cloud, load_dataset("path_to_data") worked perfectly while the files were local only. Thank you!

@mayorblock would you mind sharing how you got it to work? What did you pass as storage_options? Would it maybe work without a custom data loader script?

Sritharan-racap · 2023-12-01T16:29:59Z

This ticket would be of so much help.

dwyatte · 2024-01-15T21:00:06Z

@lhoestq I've been using this feature for the last year on GCS without problem, but I think we need to fix an issue with S3 and then document the supported calling patterns to reduce confusion

It looks like datasets uses a default DownloadConfig which is where some potentially unintended storage options are getting passed to fsspec

DownloadConfig(
	cache_dir=None, 
	force_download=False, 
	resume_download=False, 
	local_files_only=False, 
	proxies=None, 
	user_agent=None, 
	extract_compressed_file=False, 
	force_extract=False, 
	delete_extracted=False, 
	use_etag=True, 
	num_proc=None, 
	max_retries=1, 
	token=None, 
	ignore_url_params=False, 
	storage_options={'hf': {'token': None, 'endpoint': 'https://huggingface.co'}}, 
	download_desc=None
)

(specifically the storage_options={'hf': {'token': None, 'endpoint': 'https://huggingface.co'}} part)

gcsfs is robust to the extra key in storage options for whatever reason, but s3fs is not (haven't dug into why). I'm unable to test adlfs but it looks like people here got it working

Is this an issue that needs fixed with s3fs? Or can we avoid passing these default storage options in some cases?

Update: I think probably #6127 is where these default storage options were introduced

lhoestq · 2024-01-16T15:36:18Z

Hmm not sure, maybe it has to do with _prepare_single_hop_path_and_storage_options returning the "hf" storage options when it shouldn't

Crashkurs · 2024-03-11T15:38:12Z

Also running into this issue downloading a parquet dataset from S3 (Upload worked fine using current main branch).
dataset = Dataset.from_parquet('s3://path-to-file')
raises
TypeError: AioSession.__init__() got an unexpected keyword argument 'hf'
Found that the issue is introduced in #6028

When commenting out the post_init part to set 'hf', I am able to download the dataset.

Xe · 2024-11-14T17:27:02Z

Can someone paste a complete example for getting this to work? Running this:

from datasets import load_from_disk, load_dataset
from os import environ
from s3fs import S3FileSystem

storage_options = {
    "key": "AzureDiamond",
    "secret": "hunter2",
    "endpoint_url": "https://fly.storage.tigris.dev"
}

model_name = "Qwen/Qwen2.5-32B"
dataset_name = "mlabonne/FineTome-100k"

dataset = load_dataset(
    f"s3://datnybakfu/model-ready/{model_name}/{dataset_name}",
    storage_options=storage_options,
    filesystem=S3FileSystem(),
    streaming=True,
)

I get the following error trace:

Stacktrace (folded)

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

[<ipython-input-19-0ae7d5f39c46>](https://localhost:8080/#) in <cell line: 29>()
     27 dataset_name = "mlabonne/FineTome-100k"
     28 
---> 29 dataset = load_dataset(
     30     f"s3://datnybakfu/model-ready/{model_name}/{dataset_name}",
     31     storage_options=storage_options,

2 frames

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2130 
   2131     # Create a dataset builder
-> 2132     builder_instance = load_dataset_builder(
   2133         path=path,
   2134         name=name,

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)
   1851         download_config = download_config.copy() if download_config else DownloadConfig()
   1852         download_config.storage_options.update(storage_options)
-> 1853     dataset_module = dataset_module_factory(
   1854         path,
   1855         revision=revision,

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)
   1733         )
   1734     else:
-> 1735         raise FileNotFoundError(f"Couldn't find any data file at {relative_to_absolute_path(path)}.")
   1736 
   1737 

FileNotFoundError: Couldn't find any data file at /content/s3:/datnybakfu/model-ready/Qwen/Qwen2.5-32B/mlabonne/FineTome-100k.

lhoestq · 2024-11-15T09:46:40Z

Hi, using s3:// as first argument in load_dataset directly is not supported. You can pass a local path or a HF repository (in the form username/dataset_name)

cc @Wauplin if you have a script to move data from S3 to HF ? I know there is this code adapted from fsspec/filesystem_spec#909 that works well (it creates one commit per file)

from multiprocessing.pool import ThreadPool

import fsspec
from tqdm import tqdm

s3_storage_options = {}  # add S3 credentials here if needed, e.g. {"key": aws_access_key_id, "secret": aws_secret_access_key}
a = fsspec.get_mapper("s3://bucket/dataset_folder", **s3_storage_options)
b = fsspec.get_mapper("hf://datasets/username/dataset_name")

def f(k):
 b[k]=a[k]

with ThreadPool(32) as p:
 keys = [key for key in a.keys() if key and not key.endswith("/")]  # ignore root and directories
 for _ in tqdm(p.imap_unordered(f, keys), total=len(keys)):
   pass

Wauplin · 2024-11-15T15:03:39Z

It's more complex than this. Unfortunately, the best I can advice is to download the data locally (at least partially) and then use huggingface-cli upload-large-folder (see docs) to make sure the data is properly uploaded.

The solution suggested above works ok as long as the number of files is not too big. The problem of 1 file == 1 commit is that 1. it can be slow 2. it can lead to concurrency issues (concurrent commits) 3. if can lead to slow repo (once 10000 commits have been made) and 4. it can lead to rate limits (maximum number of commits per hour). All of this is taken care of with the upload_large_folder but requires the data to be locally downloaded. If local disk is smaller than the dataset size, you can download and reupload per chunk (folder by folder for instance).

lhoestq added the enhancement New feature or request label Nov 22, 2022

OllieBroadhurst mentioned this issue Dec 12, 2022

Support remote file systems for Audio #5353

Closed

lhoestq mentioned this issue Jan 4, 2023

Missing state.json when creating a cloud dataset using a dataset_builder #5402

Open

lhoestq added the good second issue Issues a bit more difficult than "Good First" issues label Feb 1, 2023

lhoestq mentioned this issue Feb 23, 2023

Directly reading parquet files in a s3 bucket from the load_dataset method #5566

Open

dwyatte mentioned this issue Feb 27, 2023

Support cloud storage in load_dataset via fsspec #5580

Merged

lhoestq closed this as completed in #5580 Mar 11, 2023

lhoestq reopened this Mar 11, 2023

dwyatte mentioned this issue Mar 26, 2023

Pass down storage options #5673

Merged

lhoestq mentioned this issue Apr 4, 2023

Force JSON format regardless of file naming on S3 #5012

Closed

mariosasko mentioned this issue Apr 20, 2023

Support cloud storage for loading datasets #5771

Closed

mariosasko mentioned this issue May 7, 2023

FileNotFound even though exists #5825

Closed

lhoestq mentioned this issue May 10, 2023

Streaming support for load_from_disk #5838

Closed

This was referenced May 24, 2023

load_dataset from s3 file system through streaming can't not iterate data #5880

Open

Modify is_remote_filesystem to return True for FUSE-mounted paths #5885

Closed

aittalam mentioned this issue Dec 11, 2024

Evaluation as a stand-alone job over data, not the trigger for multiple jobs mozilla-ai/lumigator#435

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cloud storage in load_dataset #5281

Support cloud storage in load_dataset #5281

lhoestq commented Nov 22, 2022 •

edited

Loading

alexjc commented Nov 25, 2022

iceboundflame commented Dec 9, 2022

Dref360 commented Dec 12, 2022 •

edited

Loading

eballesteros commented Feb 8, 2023

lhoestq commented Feb 8, 2023

kyamagu commented Feb 16, 2023

dwyatte commented Feb 27, 2023

lhoestq commented Mar 11, 2023 •

edited

Loading

janmaltel commented Mar 22, 2023

lhoestq commented Mar 22, 2023

ssabatier commented Mar 24, 2023

dwyatte commented Mar 26, 2023 •

edited

Loading

dwyatte commented Mar 26, 2023 •

edited

Loading

mayorblock commented Aug 7, 2023 •

edited

Loading

lhoestq commented Aug 16, 2023

hjq133 commented Aug 17, 2023 •

edited

Loading

mayorblock commented Aug 17, 2023

lhoestq commented Aug 17, 2023

hjq133 commented Aug 18, 2023

mariokostelac commented Nov 23, 2023

aarbelle commented Nov 26, 2023

lhoestq commented Nov 27, 2023

aarbelle commented Nov 28, 2023

csanadpoda commented Nov 28, 2023 •

edited

Loading

Sritharan-racap commented Dec 1, 2023

dwyatte commented Jan 15, 2024 •

edited

Loading

lhoestq commented Jan 16, 2024

Crashkurs commented Mar 11, 2024

Xe commented Nov 14, 2024

lhoestq commented Nov 15, 2024 •

edited

Loading

Wauplin commented Nov 15, 2024

Support cloud storage in load_dataset #5281

Support cloud storage in load_dataset #5281

Comments

lhoestq commented Nov 22, 2022 • edited Loading

alexjc commented Nov 25, 2022

iceboundflame commented Dec 9, 2022

Dref360 commented Dec 12, 2022 • edited Loading

eballesteros commented Feb 8, 2023

lhoestq commented Feb 8, 2023

kyamagu commented Feb 16, 2023

dwyatte commented Feb 27, 2023

lhoestq commented Mar 11, 2023 • edited Loading

janmaltel commented Mar 22, 2023

lhoestq commented Mar 22, 2023

ssabatier commented Mar 24, 2023

dwyatte commented Mar 26, 2023 • edited Loading

dwyatte commented Mar 26, 2023 • edited Loading

mayorblock commented Aug 7, 2023 • edited Loading

lhoestq commented Aug 16, 2023

hjq133 commented Aug 17, 2023 • edited Loading

mayorblock commented Aug 17, 2023

lhoestq commented Aug 17, 2023

hjq133 commented Aug 18, 2023

mariokostelac commented Nov 23, 2023

aarbelle commented Nov 26, 2023

lhoestq commented Nov 27, 2023

aarbelle commented Nov 28, 2023

csanadpoda commented Nov 28, 2023 • edited Loading

Sritharan-racap commented Dec 1, 2023

dwyatte commented Jan 15, 2024 • edited Loading

lhoestq commented Jan 16, 2024

Crashkurs commented Mar 11, 2024

Xe commented Nov 14, 2024

lhoestq commented Nov 15, 2024 • edited Loading

Wauplin commented Nov 15, 2024

lhoestq commented Nov 22, 2022 •

edited

Loading

Dref360 commented Dec 12, 2022 •

edited

Loading

lhoestq commented Mar 11, 2023 •

edited

Loading

dwyatte commented Mar 26, 2023 •

edited

Loading

dwyatte commented Mar 26, 2023 •

edited

Loading

mayorblock commented Aug 7, 2023 •

edited

Loading

hjq133 commented Aug 17, 2023 •

edited

Loading

csanadpoda commented Nov 28, 2023 •

edited

Loading

dwyatte commented Jan 15, 2024 •

edited

Loading

lhoestq commented Nov 15, 2024 •

edited

Loading