[FEAT] support df.to_parquet and df.read_parquet() #165

machichima · 2025-01-27T15:50:33Z

Add write() function for BufferedFileSimple used whan calling fsspec.open().

obstore/obstore/python/obstore/fsspec.py

Lines 177 to 186 in b40d59b

    
               def _open(self, path, mode="rb", **kwargs): 
        
                   """Return raw bytes-mode file-like from the file-system""" 
        
                   return BufferedFileSimple(self, path, mode, **kwargs) 
        
           class BufferedFileSimple(fsspec.spec.AbstractBufferedFile): 
        
               def __init__(self, fs, path, mode="rb", **kwargs): 
        
                   if mode != "rb": 
        
                       raise ValueError("Only 'rb' mode is currently supported") 
        
                   super().__init__(fs, path, mode, **kwargs)

Related to issue #164

obstore/python/obstore/fsspec.py

machichima · 2025-01-30T07:50:52Z

I found that there's also some bug in checking if parquet file exists in info() so I rename the title

machichima · 2025-01-30T13:46:12Z

Hi @kylebarron ,

I am wondering about the test here. Originally, fs.info("dir") for directory will raise file not found error, which cause error in using df.to_parquet(). After fixing it, the line fs.cat("dir", recursive=True) will raise FileNotFoundError for "dir" as fs.info("dir") has no error so "dir" will be processed.

obstore/tests/test_fsspec.py

Lines 47 to 59 in 428a66d

    
           def test_multi_file_ops(fs): 
        
               data = {"dir/test1": b"test data1", "dir/test2": b"test data2"} 
        
               fs.pipe(data) 
        
               out = fs.cat(list(data)) 
        
               assert out == data 
        
               out = fs.cat("dir", recursive=True) 
        
               assert out == data 
        
               fs.cp("dir", "dir2", recursive=True) 
        
               out = fs.find("", detail=False) 
        
               assert out == ["afile", "dir/test1", "dir/test2", "dir2/test1", "dir2/test2"] 
        
               fs.rm(["dir", "dir2"], recursive=True) 
        
               out = fs.find("", detail=False) 
        
               assert out == ["afile"]

Should I try to make its output as {"dir/test1": b"test data1", "dir/test2": b"test data2"} here? Which requires to override _cat() in fsspec as follow

    async def _cat(
        self, path, recursive=False, on_error="raise", batch_size=None, **kwargs
    ):
        paths = await self._expand_path(path, recursive=recursive)
        coros = [self._cat_file(path, **kwargs) for path in paths if not self._isdir(path)]   # ignore dir for cat_file
        batch_size = batch_size or self.batch_size

Refer to fsspec, it simply gives FileNotFoundError when doing so. Maybe we can just remove this line or make it assert if FileNotFound raise?

kylebarron · 2025-01-30T15:33:40Z

@martindurant wrote that test and is obviously more familiar with fsspec than I am... @martindurant do you have any suggestions here?

martindurant · 2025-01-30T15:41:05Z

Keep pinging me until I have a chance to look at this :)

obstore/python/obstore/fsspec.py

martindurant

I made some comments on the code as it stands.

However, the outstanding issues is: how to construct these instances via fsspec.open(). It would mean

registering each of the expected protocols (s3, gs, ab) to override the fsspec default ones. Perhaps a top-level function in obstore would do this explicitly (I wouldn't do it implicitly on import).
writing a _get_kwargs_from_urls to create the right obstore instance for the given path(s), including the bucket. This would also be a way to stash the value of the bucket, for later asserting the paths are right.

The alternative way, annoying for the user, would be to explicitly pass a premade instance with filesystem= (sometimes fs=) to the given loading function.

obstore/python/obstore/fsspec.py

martindurant · 2025-02-05T14:31:57Z

obstore/python/obstore/fsspec.py

+                "version": head["version"],
+            }
+        except FileNotFoundError:
+            # try ls, refer to the info implementation in fsspec


Why does this PR need the extra code? Are you trying open() with globs? I don't know the details on head_async, whether it might already achieve this.

For the code that stores the parquet as: file.csv/00000, file.csv/00001, ...etc, when reading the file.csv/ from s3, the info() will give FileNotFoundError. As I known, s3's folder is not an object but a prefix, which cause this error from happending. So I add the code if getting FileNotFoundError in head_async to solve it

OK, the same old "is it a folder" problem - I am well familiar with this.

obstore/python/obstore/fsspec.py

Co-authored-by: Martin Durant <[email protected]>

kylebarron · 2025-02-05T16:16:16Z

I made some comments on the code as it stands.

Thank you!

registering each of the expected protocols (s3, gs, ab) to override the fsspec default ones. Perhaps a top-level function in obstore would do this explicitly (I wouldn't do it implicitly on import).

I'm in favor of this approach. I definitely wouldn't do it explicitly on import, but I'd propose we have obstore.fsspec.register() which would register these protocols with fsspec's registry.

machichima · 2025-02-06T13:52:10Z

I made some comments on the code as it stands.

However, the outstanding issues is: how to construct these instances via fsspec.open(). It would mean

registering each of the expected protocols (s3, gs, ab) to override the fsspec default ones. Perhaps a top-level function in obstore would do this explicitly (I wouldn't do it implicitly on import).

writing a _get_kwargs_from_urls to create the right obstore instance for the given path(s), including the bucket. This would also be a way to stash the value of the bucket, for later asserting the paths are right.

The alternative way, annoying for the user, would be to explicitly pass a premade instance with filesystem= (sometimes fs=) to the given loading function.

Hi @martindurant ,

I've opened a new draft PR for this to ensure consistency in how instances are constructed across methods. My goal is to align the usage with fsspec.

With this PR, obstore can be registered as an fsspec storage backend using:

fsspec.register_implementation("s3", S3FsspecStore)

The bucket is extracted from the file path and used as a cache key when creating obstore objects. Here's an example usage that I would like to achieve:

fsspec.register_implementation("s3", S3FsspecStore)
fs: AsyncFsspecStore = fsspec.filesystem(
    "s3",
    config={
        "endpoint": "http://localhost:30002",
        "access_key_id": "minio",
        "secret_access_key": "miniostorage",
        "virtual_hosted_style_request": True,  # path contain bucket name
    },
    client_options={"timeout": "99999s", "allow_http": "true"},
    retry_config={
        "max_retries": 2,
        "backoff": {
            "base": 2,
            "init_backoff": timedelta(seconds=2),
            "max_backoff": timedelta(seconds=16),
        },
        "retry_timeout": timedelta(minutes=3),
    },
)

fs.cat_file("my-s3-bucket/test.txt")

Does this align with your expectations? Please let me know if you have any suggestions!
Thanks!

machichima · 2025-02-06T13:56:46Z

I'm in favor of this approach. I definitely wouldn't do it explicitly on import, but I'd propose we have obstore.fsspec.register() which would register these protocols with fsspec's registry.

Hi @kylebarron ,

I think we can directly use fsspec.register() for this? Which can be used as: fsspec.register_implementation("s3", AsyncFsspecStore). Or do you mean that we can do something like: obstore.fsspec.register("s3") so that we do not need to create more classes inherit from AsyncFsspecStore?

obstore/python/obstore/fsspec.py

kylebarron · 2025-02-06T15:54:54Z

Or do you mean that we can do something like: obstore.fsspec.register("s3") so that we do not need to create more classes inherit from AsyncFsspecStore?

I like this because it means that our fsspec subclasses could potentially stay private. So in theory the only API exported from obstore.fsspec would be register(). In practice, that might not be enough for all fsspec use cases.

But overall I think having obstore.fsspec.register, even if that function is a one-liner that wraps fsspec.register, is useful for simplicity.

martindurant · 2025-02-06T15:59:38Z

I think having obstore.fsspec.register, even if that function is a one-liner that wraps fsspec.register

Exactly what I was thinking - the user can call register themselves as int he example above, but it would be useful to provide a utility function that knows what to register, so the user only needs to call one thing once.

machichima · 2025-02-09T04:22:39Z

I will continue on this once this PR is merge, so that we can use the new way to construct the obstore insntance in open() too

obstore/python/obstore/fsspec.py

kylebarron · 2025-03-03T18:42:16Z

obstore/python/obstore/fsspec.py

@@ -542,5 +737,6 @@ def _register(protocol: str, *, asynchronous: bool) -> None:
                "asynchronous": asynchronous,
            },  # Assign protocol dynamically
        ),
-        clobber=False,
+        # Override any existing implementations of the same protocol
+        clobber=True,


Note, I changed this, because I think we want to override any of the default fsspec protocols

kylebarron · 2025-03-03T18:42:24Z

I pushed a couple updates here:

Always use the underlying Rust BufferedFile and BufferedWriter where possible, and use the provided fsspec methods as little as possible.
- Override the provided read, tell, seek, readline, readlines, flush, write methods
- Use the underlying rust reader for readline and readlines
Added typing overloads for the BufferedFile __init__ for either readers/writers.
Minimized what we set on self. We don't need to set fs, store, or path on self.

kylebarron · 2025-03-03T18:47:22Z

Thanks for starting this!

machichima added 3 commits January 27, 2025 19:37

feat: add write() for open() in fsspec

2b5f6b5

temp: upload with iterator

361b30d

refactor: rename data_li to buffer

e0ec01a

kylebarron mentioned this pull request Jan 28, 2025

Add support for writable file-like objects #167

Merged

5 tasks

machichima added 2 commits January 29, 2025 23:20

feat: buffered write in fsspec

d75f3e8

fix: remove unused code

75d3734

machichima force-pushed the fsspec-open-write branch from 212fe05 to 75d3734 Compare January 29, 2025 15:20

kylebarron reviewed Jan 29, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Show resolved Hide resolved

kylebarron reviewed Jan 29, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Show resolved Hide resolved

obstore/python/obstore/fsspec.py Show resolved Hide resolved

machichima added 3 commits January 30, 2025 13:11

Merge branch 'main' into fsspec-open-write

f958ec8

fix: assert mode is either rb or wb

b2f9d6f

fix: correctly detect file exist for read_parquet

07ae55d

machichima changed the title ~~[WIP] fsspec write method for open()~~ [WIP] support df.to_parquet and df.read_parquet() Jan 30, 2025

run pre-commit

428a66d

feat: split bucket name from path in fsspec _open

bc4ffaa

kylebarron mentioned this pull request Feb 2, 2025

Support obstore as storage for df.to_parquet() #164

Closed

kylebarron reviewed Feb 2, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Outdated Show resolved Hide resolved

martindurant reviewed Feb 5, 2025

View reviewed changes

Update obstore/python/obstore/fsspec.py

79e40a1

Co-authored-by: Martin Durant <[email protected]>

fix: move incorrect mode exception into else

74dd9ed

fix: remove writer in init and add self.closed=True

8797944

kylebarron reviewed Feb 6, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Outdated Show resolved Hide resolved

This was referenced Feb 7, 2025

[FEAT] Create obstore store in fsspec on demand #198

Merged

[WIP] Apply obstore as storage backend flyteorg/flytekit#3033

Open

machichima added 2 commits February 9, 2025 12:15

fix: self._writer not exist error in close

cf1856a

fix: use info() in AbstractFileSystem

ff5d6bd

machichima added 5 commits March 1, 2025 18:11

Merge branch 'main' into fsspec-open-write

fb3e6b2

fix: typing and linting

19c2646

feat: merge BufferedFileWrite/Read together

23febfb

test: for write to parquet

ddcf6f9

docs: update docstring

27c62c2

machichima changed the title ~~[WIP] support df.to_parquet and df.read_parquet()~~ [FEAT] support df.to_parquet and df.read_parquet() Mar 2, 2025

machichima requested a review from kylebarron March 2, 2025 12:42

kylebarron reviewed Mar 3, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Outdated Show resolved Hide resolved

kylebarron added 3 commits March 3, 2025 12:59

Use underlying reader/writer methods where possible

5947e8a

Updated docs

7f229d0

Override loc property

317e954

kylebarron reviewed Mar 3, 2025

View reviewed changes

loc setter to allow __init__

0442a5f

kylebarron enabled auto-merge (squash) March 3, 2025 18:46

kylebarron merged commit 31ed2e4 into developmentseed:main Mar 3, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] support df.to_parquet and df.read_parquet() #165

[FEAT] support df.to_parquet and df.read_parquet() #165

machichima commented Jan 27, 2025 •

edited

Loading

machichima commented Jan 30, 2025

machichima commented Jan 30, 2025

kylebarron commented Jan 30, 2025

martindurant commented Jan 30, 2025

martindurant left a comment

martindurant Feb 5, 2025

machichima Feb 6, 2025

martindurant Feb 6, 2025

kylebarron commented Feb 5, 2025

machichima commented Feb 6, 2025 •

edited

Loading

machichima commented Feb 6, 2025

kylebarron commented Feb 6, 2025

martindurant commented Feb 6, 2025

machichima commented Feb 9, 2025

kylebarron Mar 3, 2025

kylebarron commented Mar 3, 2025

kylebarron commented Mar 3, 2025

	def _open(self, path, mode="rb", **kwargs):
	"""Return raw bytes-mode file-like from the file-system"""
	return BufferedFileSimple(self, path, mode, **kwargs)


	class BufferedFileSimple(fsspec.spec.AbstractBufferedFile):
	def __init__(self, fs, path, mode="rb", **kwargs):
	if mode != "rb":
	raise ValueError("Only 'rb' mode is currently supported")
	super().__init__(fs, path, mode, **kwargs)

[FEAT] support df.to_parquet and df.read_parquet() #165

[FEAT] support df.to_parquet and df.read_parquet() #165

Conversation

machichima commented Jan 27, 2025 • edited Loading

machichima commented Jan 30, 2025

machichima commented Jan 30, 2025

kylebarron commented Jan 30, 2025

martindurant commented Jan 30, 2025

martindurant left a comment

Choose a reason for hiding this comment

martindurant Feb 5, 2025

Choose a reason for hiding this comment

machichima Feb 6, 2025

Choose a reason for hiding this comment

martindurant Feb 6, 2025

Choose a reason for hiding this comment

kylebarron commented Feb 5, 2025

machichima commented Feb 6, 2025 • edited Loading

machichima commented Feb 6, 2025

kylebarron commented Feb 6, 2025

martindurant commented Feb 6, 2025

machichima commented Feb 9, 2025

kylebarron Mar 3, 2025

Choose a reason for hiding this comment

kylebarron commented Mar 3, 2025

kylebarron commented Mar 3, 2025

machichima commented Jan 27, 2025 •

edited

Loading

machichima commented Feb 6, 2025 •

edited

Loading