Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cloud storage in load_dataset #5281

Open
lhoestq opened this issue Nov 22, 2022 · 31 comments · Fixed by #5580
Open

Support cloud storage in load_dataset #5281

lhoestq opened this issue Nov 22, 2022 · 31 comments · Fixed by #5580
Labels
enhancement New feature or request good second issue Issues a bit more difficult than "Good First" issues

Comments

@lhoestq
Copy link
Member

lhoestq commented Nov 22, 2022

Would be nice to be able to do

data_files=["s3://..."]  # or gs:// or any cloud storage path
storage_options = {...}
load_dataset(..., data_files=data_files, storage_options=storage_options)

The idea would be to use fsspec as in download_and_prepare and save_to_disk.

This has been requested several times already. Some users want to use their data from private cloud storage to train models

related:

#3490
#5244
forum

@lhoestq lhoestq added the enhancement New feature or request label Nov 22, 2022
@alexjc
Copy link

alexjc commented Nov 25, 2022

Or for example an archive on GitHub releases! Before I added support for JXL (locally only, PR still pending) I was considering hosting my files on GitHub instead...

@iceboundflame
Copy link

+1 to this. I would like to use 'audiofolder' with a data_dir that's on S3, for example. I don't want to upload my dataset to the Hub, but I would find all the fingerprinting/caching features useful.

@Dref360
Copy link
Contributor

Dref360 commented Dec 12, 2022

Adding to the conversation, Dask also uses fsspec for this feature.

Dask: How to connect to remote data

Happy to help on this feature :D

@lhoestq lhoestq added the good second issue Issues a bit more difficult than "Good First" issues label Feb 1, 2023
@eballesteros
Copy link

+1 to this feature request since I think it also tackles my use-case. I am collaborating with a team, working with a loading script which takes some time to generate the dataset artifacts. It would be very handy to use this as a cloud cache to avoid duplicating the effort.

Currently we could use builder.download_and_prepare(path_to_cloud_storage, storage_options, ...) to cache the artifacts to cloud storage, but then builder.as_dataset() yields NotImplementedError: Loading a dataset cached in SomeCloudFileSystem is not supported

@lhoestq
Copy link
Member Author

lhoestq commented Feb 8, 2023

Makes sense ! If you want to load locally a dataset that you download_and_prepared on a cloud storage, you would use load_dataset(path_to_cloud_storage) indeed. It would download the data from the cloud storage, cache them locally, and return a Dataset.

@kyamagu
Copy link

kyamagu commented Feb 16, 2023

It seems currently the cached_path function handles all URLs by get_from_cache that only supports ftp and http(s) here:

if is_remote_url(url_or_filename):

I guess one can add another condition that handles s3:// or gs:// URLs via fsspec here.

@dwyatte
Copy link
Contributor

dwyatte commented Feb 27, 2023

I could use this functionality, so I put together a PR using @kyamagu's suggestion to use fsspec in datasets.utils.file_utils

#5580

@lhoestq
Copy link
Member Author

lhoestq commented Mar 11, 2023

Thanks @dwyatte for adding support for fsspec urls

Let me just reopen this since the original issue is not resolved

@lhoestq lhoestq reopened this Mar 11, 2023
@janmaltel
Copy link

I'm not yet understanding how to use #5580 in order to use load_dataset(data_files="s3://..."). Any help/example would be much appreciated :) thanks!

@lhoestq
Copy link
Member Author

lhoestq commented Mar 22, 2023

It's still not officially supported x) But you can try to update request_etag in file_utils.py to use fsspec_head instead of http_head. It is responsible of getting the ETags of the remote files for caching. This change may do the trick for S3 urls

@ssabatier
Copy link

Thank you for your guys help on this and merging in #5580. I manually pulled the changes to my local datasets package (datasets.utils.file_utils.py) since it only seemed to be this file that was changed in the PR and I'm getting the error:
InvalidSchema: No connection adapters were found for 's3://bucket/folder/'. I'm calling load_dataset using the S3 URI. When I use the S3 URL I get HTTPError: 403 Client Error.
Am I not supposed to use the S3 URI? How do I pull in the changes from this merge? I'm running datasets 2.10.1.

@dwyatte
Copy link
Contributor

dwyatte commented Mar 26, 2023

The current implementation depends on gcsfs/s3fs being able to authenticate through some other means e.g., environmental variables. For AWS, it looks like you can set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN

Note that while testing this just now, I did note a discrepancy between gcsfs and s3fs that we might want to address where gcsfs passes the timeout from storage_options here down into the aiohttp.ClientSession.request, but s3fs does not handle this (tries to pass to the aiobotocore.session.AioSession constructor raising TypeError: __init__() got an unexpected keyword argument 'requests_timeout').

It seems like some work trying to unify kwargs across different fsspec implementations, so if the plan is to pass down storage_options, I wonder if we should just let users control the timeout (and other kwargs) using that and if not specified, use the default?

@dwyatte
Copy link
Contributor

dwyatte commented Mar 26, 2023

Note that while testing this just now, I did note a discrepancy between gcsfs and s3fs that we might want to address where gcsfs passes the timeout from storage_options here down into the aiohttp.ClientSession.request, but s3fs does not handle this (tries to pass to the aiobotocore.session.AioSession constructor raising TypeError: init() got an unexpected keyword argument 'requests_timeout').

It seems like some work trying to unify kwargs across different fsspec implementations, so if the plan is to pass down storage_options, I wonder if we should just let users control the timeout (and other kwargs) and if not specified, use the default?

@lhoestq here's a small PR for this: #5673

@mayorblock
Copy link

mayorblock commented Aug 7, 2023

@lhoestq sorry for being a little dense here but I am very keen to use fsspec / adlfs for for a larger image dataset I have for object detection. I have to keep it on Azure storage and would also like to avoid a full download or zipping (so use load_dataset(..., streaming=True). So this development is godsend :) only... I am unable to make it work.

Would you expect the setup to work for:

  • azure blob storage
  • image files (not the standard formats json, parquet....

? I appreciate that you mostly focus on s3 but it seems that, similar to the remaining cloud storage functionality, it should also work for Azure blob storage.

I would imagine that something like (Streaming true or false):

d = load_dataset("new_dataset.py", storage_options=storage_options, split="train")

would work with

# new_dataset.py
....
_URL="abfs://container/image_folder``` 

archive_path = dl_manager.download(_URL)
split_metadata_paths = dl_manager.download(_METADATA_URLS)
return [
    datasets.SplitGenerator(
        name=datasets.Split.TRAIN,
        gen_kwargs={
            "annotation_file_path": split_metadata_paths["train"],
            "files": dl_manager.iter_files(archive_path)
},
      ),
...

but I get

Traceback (most recent call last):
...        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/load.py", line 1797, in load_dataset
    builder_instance.download_and_prepare(
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 890, in download_and_prepare
    self._download_and_prepare(
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 963, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.cache/huggingface/modules/datasets_modules/datasets/new_dataset/dd26a081eab90074f41fa2c821b458424fde393cc73d3d8241aca956d1fb3aa0/new_dataset_script.py", line 56, in _split_generators
    archive_path = dl_manager.download(_URL)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/download/download_manager.py", line 427, in download
    downloaded_path_or_paths = map_nested(
                               ^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 435, in map_nested
    return function(data_struct)
           ^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/download/download_manager.py", line 453, in _download
    return cached_path(url_or_filename, download_config=download_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 206, in cached_path
    raise ValueError(f"unable to parse {url_or_filename} as a URL or as a local path")
ValueError: unable to parse abfs://container/image_folder as a URL or as a local path

@lhoestq
Copy link
Member Author

lhoestq commented Aug 16, 2023

What version of datasets are you using ?

@hjq133
Copy link

hjq133 commented Aug 17, 2023

@lhoestq
hello, i still have problem with loading json from S3:

storage_options = {
"key": xxxx,
"secret": xxx,
"endpoint_url": xxxx
}
path = 's3://xxx/xxxxxxx.json'
dataset = load_dataset("json", data_files=path, storage_options=storage_options)

and it throws an error:
TypeError: AioSession.init() got an unexpected keyword argument 'hf'
and I use the lastest 2.14.4_dev0 version

@mayorblock
Copy link

Hi @lhoestq, thanks for getting back to me :) you have been busy over the summer I see... I was on 2.12.0. I have updated to 2.14.4.

Now d = load_dataset("new_dataset.py", storage_options=storage_options, split="train", streaming=True) works for Azure blob storage (with a local data loader script) when I explicitly list all blobs (I am struggling to make fs.ls(<path>) work in the script to make the list available to the download manager).

Any chance that it could work out-of-the-box by supplying just the image folder, not the full list of image filenames? It seems that dl_manager.download(_URL) always wants one or more (possibly archived) files. In my situation, where I don't want to archive or download, it would be great to just supply the folder (seems reasonably doable with fsspec).

Let me know if there is anything I can do to help.

Thanks,

@lhoestq
Copy link
Member Author

lhoestq commented Aug 17, 2023

Any chance that it could work out-of-the-box by supplying just the image folder, not the full list of image filenames? It seems that dl_manager.download(_URL) always wants one or more (possibly archived) files. In my situation, where I don't want to archive or download, it would be great to just supply the folder (seems reasonably doable with fsspec).

@mayorblock This is not supported right now, you have to use archives or implement a way to get the list by yourself

TypeError: AioSession.init() got an unexpected keyword argument 'hf'

@hjq133 Can you update fsspec and try again ?

pip install -U fsspec

@hjq133
Copy link

hjq133 commented Aug 18, 2023

thanks for your suggestion,it works now !

@mariokostelac
Copy link

I'm seeing same problem as @hjq133 with following versions:

datasets==2.15.0
(venv) ➜  finetuning-llama2 git:(main) ✗ pip freeze | grep s3fs   
s3fs==2023.10.0
(venv) ➜  finetuning-llama2 git:(main) ✗ pip freeze | grep fsspec
fsspec==2023.10.0

@aarbelle
Copy link

@lhoestq hello, i still have problem with loading json from S3:

storage_options = { "key": xxxx, "secret": xxx, "endpoint_url": xxxx } path = 's3://xxx/xxxxxxx.json' dataset = load_dataset("json", data_files=path, storage_options=storage_options)

and it throws an error: TypeError: AioSession.init() got an unexpected keyword argument 'hf' and I use the lastest 2.14.4_dev0 version

I am trying to do the same thing, but the loading is just hanging, without any error.
@lhoestq is there any documentation how to load from private s3 buckets?

@lhoestq
Copy link
Member Author

lhoestq commented Nov 27, 2023

Hi ! S3 support is still experimental. It seems like there is an extra hf field passed to the s3fs storage_options that causes this error. I just check the source code of _prepare_single_hop_path_and_storage_options and I think you can try passing explicitly your own storage_options={"s3": {...}}. Also note that it's generally better to load datasets from HF (we run extensive tests and benchmarks for speed and robustness)

@aarbelle
Copy link

That worked! Thanks
It seems thought that data_dir=... doesn't work on s3, only data_files.

@csanadpoda
Copy link

csanadpoda commented Nov 28, 2023

@lhoestq Would this work either with an Azure Blob Storage Container or its respective Azure Machine Learning Datastore? If yes, what would that look like in code? I've tried a couple of combinations but no success so far, on the latest version of datasets. I need to migrate a dataset to the Azure cloud, load_dataset("path_to_data") worked perfectly while the files were local only. Thank you!

@mayorblock would you mind sharing how you got it to work? What did you pass as storage_options? Would it maybe work without a custom data loader script?

@Sritharan-racap
Copy link

This ticket would be of so much help.

@dwyatte
Copy link
Contributor

dwyatte commented Jan 15, 2024

@lhoestq I've been using this feature for the last year on GCS without problem, but I think we need to fix an issue with S3 and then document the supported calling patterns to reduce confusion

It looks like datasets uses a default DownloadConfig which is where some potentially unintended storage options are getting passed to fsspec

DownloadConfig(
	cache_dir=None, 
	force_download=False, 
	resume_download=False, 
	local_files_only=False, 
	proxies=None, 
	user_agent=None, 
	extract_compressed_file=False, 
	force_extract=False, 
	delete_extracted=False, 
	use_etag=True, 
	num_proc=None, 
	max_retries=1, 
	token=None, 
	ignore_url_params=False, 
	storage_options={'hf': {'token': None, 'endpoint': 'https://huggingface.co'}}, 
	download_desc=None
)

(specifically the storage_options={'hf': {'token': None, 'endpoint': 'https://huggingface.co'}} part)

gcsfs is robust to the extra key in storage options for whatever reason, but s3fs is not (haven't dug into why). I'm unable to test adlfs but it looks like people here got it working

Is this an issue that needs fixed with s3fs? Or can we avoid passing these default storage options in some cases?

Update: I think probably #6127 is where these default storage options were introduced

@lhoestq
Copy link
Member Author

lhoestq commented Jan 16, 2024

Hmm not sure, maybe it has to do with _prepare_single_hop_path_and_storage_options returning the "hf" storage options when it shouldn't

@Crashkurs
Copy link

Also running into this issue downloading a parquet dataset from S3 (Upload worked fine using current main branch).
dataset = Dataset.from_parquet('s3://path-to-file')
raises
TypeError: AioSession.__init__() got an unexpected keyword argument 'hf'
Found that the issue is introduced in #6028

When commenting out the post_init part to set 'hf', I am able to download the dataset.

@Xe
Copy link

Xe commented Nov 14, 2024

Can someone paste a complete example for getting this to work? Running this:

from datasets import load_from_disk, load_dataset
from os import environ
from s3fs import S3FileSystem

storage_options = {
    "key": "AzureDiamond",
    "secret": "hunter2",
    "endpoint_url": "https://fly.storage.tigris.dev"
}

model_name = "Qwen/Qwen2.5-32B"
dataset_name = "mlabonne/FineTome-100k"

dataset = load_dataset(
    f"s3://datnybakfu/model-ready/{model_name}/{dataset_name}",
    storage_options=storage_options,
    filesystem=S3FileSystem(),
    streaming=True,
)

I get the following error trace:

Stacktrace (folded)--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) [<ipython-input-19-0ae7d5f39c46>](https://localhost:8080/#) in <cell line: 29>() 27 dataset_name = "mlabonne/FineTome-100k" 28 ---> 29 dataset = load_dataset( 30 f"s3://datnybakfu/model-ready/{model_name}/{dataset_name}", 31 storage_options=storage_options, 2 frames [/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs) 2130 2131 # Create a dataset builder -> 2132 builder_instance = load_dataset_builder( 2133 path=path, 2134 name=name, [/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs) 1851 download_config = download_config.copy() if download_config else DownloadConfig() 1852 download_config.storage_options.update(storage_options) -> 1853 dataset_module = dataset_module_factory( 1854 path, 1855 revision=revision, [/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs) 1733 ) 1734 else: -> 1735 raise FileNotFoundError(f"Couldn't find any data file at {relative_to_absolute_path(path)}.") 1736 1737 FileNotFoundError: Couldn't find any data file at /content/s3:/datnybakfu/model-ready/Qwen/Qwen2.5-32B/mlabonne/FineTome-100k.

@lhoestq
Copy link
Member Author

lhoestq commented Nov 15, 2024

Hi, using s3:// as first argument in load_dataset directly is not supported. You can pass a local path or a HF repository (in the form username/dataset_name)

cc @Wauplin if you have a script to move data from S3 to HF ? I know there is this code adapted from fsspec/filesystem_spec#909 that works well (it creates one commit per file)

from multiprocessing.pool import ThreadPool

import fsspec
from tqdm import tqdm

s3_storage_options = {}  # add S3 credentials here if needed, e.g. {"key": aws_access_key_id, "secret": aws_secret_access_key}
a = fsspec.get_mapper("s3://bucket/dataset_folder", **s3_storage_options)
b = fsspec.get_mapper("hf://datasets/username/dataset_name")

def f(k):
 b[k]=a[k]

with ThreadPool(32) as p:
 keys = [key for key in a.keys() if key and not key.endswith("/")]  # ignore root and directories
 for _ in tqdm(p.imap_unordered(f, keys), total=len(keys)):
   pass

@Wauplin
Copy link
Contributor

Wauplin commented Nov 15, 2024

It's more complex than this. Unfortunately, the best I can advice is to download the data locally (at least partially) and then use huggingface-cli upload-large-folder (see docs) to make sure the data is properly uploaded.

The solution suggested above works ok as long as the number of files is not too big. The problem of 1 file == 1 commit is that 1. it can be slow 2. it can lead to concurrency issues (concurrent commits) 3. if can lead to slow repo (once 10000 commits have been made) and 4. it can lead to rate limits (maximum number of commits per hour). All of this is taken care of with the upload_large_folder but requires the data to be locally downloaded. If local disk is smaller than the dataset size, you can download and reupload per chunk (folder by folder for instance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good second issue Issues a bit more difficult than "Good First" issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.