Skip to content

Commit

Permalink
Document fsspec integration in user guide
Browse files Browse the repository at this point in the history
  • Loading branch information
kylebarron committed Mar 3, 2025
1 parent 94b69cc commit 5bfb35d
Show file tree
Hide file tree
Showing 5 changed files with 164 additions and 20 deletions.
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,10 @@ The simplest, highest-throughput [^1] Python interface to [S3][s3], [GCS][gcs],
- **Streaming uploads** from files or async or sync iterators.
- **Streaming list**, with no need to paginate.
- Automatic [**multipart uploads**](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large file objects.
- File-like object API and [fsspec](https://github.com/fsspec/filesystem_spec) integration.
- Easy to install with **no required Python dependencies**.
- Support for **conditional put** ("put if not exists"), as well as custom tags and attributes.
- Optionally return list results in [Apache Arrow](https://arrow.apache.org/) format, which is faster and more memory-efficient than materializing Python `dict`s.
- File-like object API and [fsspec](https://github.com/fsspec/filesystem_spec) integration.
- Easy to install with no required Python dependencies.
- The [underlying Rust library](https://docs.rs/object_store) is production quality and used in large scale production systems, such as the Rust package registry [crates.io](https://crates.io/).
- Zero-copy data exchange between Rust and Python via the [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).

<!-- For Rust developers looking to add object_store support to their Python packages, refer to pyo3-object_store. -->
Expand Down
Binary file added docs/assets/fsspec-type-hinting.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
132 changes: 132 additions & 0 deletions docs/fsspec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# fsspec Integration

Obstore provides native integration with the [fsspec] ecosystem.

[fsspec]: https://github.com/fsspec/filesystem_spec

The fsspec integration is best effort and may not provide the same
performance as the rest of obstore. Where possible, implementations should use
the underlying `obstore` APIs directly. If you find any bugs with this
integration, please [file an
issue](https://github.com/developmentseed/obstore/issues/new/choose).

## Usage

### Direct class usage

Construct an fsspec-compatible filesystem with [`FsspecStore`][obstore.fsspec.FsspecStore]. This implements [`AbstractFileSystem`][fsspec.spec.AbstractFileSystem], so you can use it wherever an API expects an fsspec-compatible filesystem.

```py
from obstore.fsspec import FsspecStore

fs = FsspecStore("s3", region="us-west-2", skip_signature=True)
prefix = (
"s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/"
)
items = fs.ls(prefix)
# [{'name': 'sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/AOT.tif',
# 'size': 80689,
# 'type': 'file',
# 'e_tag': '"c93b0f6b0e2cf8e375968f41161f9df7"'},
# ...
```

If you need a readable or writable file-like object, you can call the `open`
method provided on `FsspecStore`, or you may construct a
[`BufferedFile`][obstore.fsspec.BufferedFile] directly.

```py
from obstore.fsspec import FsspecStore

fs = FsspecStore("s3", region="us-west-2", skip_signature=True)

with fs.open(
"s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/thumbnail.jpg",
) as file:
content = file.read()
```

Using the `FsspecStore` class directly may be preferred because the type hinting should work automatically, which may help IDEs like VSCode suggest valid keyword parameters.

### Register as a global handler

Use [`register`][obstore.fsspec.register] to register obstore as the default
handler for various protocols. Then use [`fsspec.filesystem`][] to create an
fsspec filesystem object for a specific protocol. Or use [`fsspec.open`][] to
open a file given a URL.

```py
import fsspec
from obstore.fsspec import register

# Register obstore as the default handler for all protocols supported by
# obstore.
# You may wish to register only specific protocols, instead.
register()

# Create a new fsspec filesystem for the given protocol
fs = fsspec.filesystem("https")
content = fs.cat_file("https://example.com/")

# Or, open the file directly
```py
url = "https://github.com/opengeospatial/geoparquet/raw/refs/heads/main/examples/example.parquet"
with fsspec.open(url) as file:
content = file.read()
```

## Store configuration

Some stores may require configuration. You may pass configuration parameters to the [`FsspecStore`][obstore.fsspec.FsspecStore] constructor directly. Or, if you're using [`fsspec.filesystem`][], you may pass configuration parameters to that call, which will pass parameters down to the `FsspecStore` constructor internally.

```py
from obstore.fsspec import FsspecStore

fs = FsspecStore("s3", region="us-west-2", skip_signature=True)
```

Or, with [`fsspec.filesystem`][]:

```py
import fsspec

from obstore.fsspec import register

register("s3")

fs = fsspec.filesystem("s3", region="us-west-2", skip_signature=True)
```

## Type hinting

The fsspec API is not conducive to type checking. The easiest way to get type hinting for parameters is to use [`FsspecStore`][obstore.fsspec.FsspecStore] to construct fsspec-compatible stores instead of [`fsspec.filesystem`][].

[`fsspec.open`][] and [`fsspec.filesystem`][] take arbitrary keyword arguments that they pass down to the underlying store, and these pass-through arguments are not typed.

However, it is possible to get type checking of store configuration by defining config parameters as a dictionary:

```py
from __future__ import annotations

from typing import TYPE_CHECKING

import fsspec

from obstore.fsspec import register

if TYPE_CHECKING:
from obstore.store import S3ConfigInput

register("s3")

config: S3ConfigInput = {"region": "us-west-2", "skip_signature": True}
fs = fsspec.filesystem("s3", config=config)
```

Then your type checker will validate that the `config` dictionary is compatible with [`S3ConfigInput`][obstore.store.S3ConfigInput]. VSCode also provides auto suggestions for parameters:

![](./assets/fsspec-type-hinting.jpg)

!!! note

`S3ConfigInput` is a "type-only" construct, and so it needs to be imported from within an `if TYPE_CHECKING` block. Additionally, `from __future__ import annotations` must be at the top of the file.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ nav:
- authentication.md
- integrations.md
- performance.md
- fsspec.md
- Alternatives: alternatives.md
- Troubleshooting:
- AWS: troubleshooting/aws.md
Expand Down
46 changes: 29 additions & 17 deletions obstore/python/obstore/fsspec.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,8 @@
[fsspec]: https://github.com/fsspec/filesystem_spec
The fsspec integration is **best effort** and not the primary API of `obstore`. This
integration may not be as stable and may not provide the same performance as the rest of
the library. Changes may be made even in patch releases to align better with fsspec
expectations. If you find any bugs, please [file an
The fsspec integration is best effort and may not provide the same performance as
the rest of obstore. If you find any bugs with this integration, please [file an
issue](https://github.com/developmentseed/obstore/issues/new/choose).
The underlying `object_store` Rust crate
Expand Down Expand Up @@ -39,7 +37,7 @@
from collections import defaultdict
from functools import lru_cache
from pathlib import Path
from typing import TYPE_CHECKING, Any, Literal, overload
from typing import TYPE_CHECKING, Any, Literal, Unpack, overload
from urllib.parse import urlparse

import fsspec.asyn
Expand All @@ -65,6 +63,12 @@
S3ConfigInput,
)

__all__ = [
"BufferedFile",
"FsspecStore",
"register",
]

SUPPORTED_PROTOCOLS: set[str] = {
"abfs",
"abfss",
Expand Down Expand Up @@ -113,46 +117,49 @@ class FsspecStore(fsspec.asyn.AsyncFileSystem):
@overload
def __init__(
self,
*args: Any,
protocol: Literal["s3", "s3a"],
*args: Any,
config: S3Config | S3ConfigInput | None = None,
client_options: ClientConfig | None = None,
retry_config: RetryConfig | None = None,
asynchronous: bool = False,
max_cache_size: int = 10,
loop: Any = None,
batch_size: int | None = None,
**kwargs: Unpack[S3ConfigInput],
) -> None: ...
@overload
def __init__(
self,
*args: Any,
protocol: Literal["gs"],
*args: Any,
config: GCSConfig | GCSConfigInput | None = None,
client_options: ClientConfig | None = None,
retry_config: RetryConfig | None = None,
asynchronous: bool = False,
max_cache_size: int = 10,
loop: Any = None,
batch_size: int | None = None,
**kwargs: Unpack[GCSConfigInput],
) -> None: ...
@overload
def __init__(
self,
*args: Any,
protocol: Literal["az", "adl", "azure", "abfs", "abfss"],
*args: Any,
config: AzureConfig | AzureConfigInput | None = None,
client_options: ClientConfig | None = None,
retry_config: RetryConfig | None = None,
asynchronous: bool = False,
max_cache_size: int = 10,
loop: Any = None,
batch_size: int | None = None,
**kwargs: Unpack[AzureConfigInput],
) -> None: ...
def __init__( # noqa: PLR0913
self,
protocol: SUPPORTED_PROTOCOLS_T | str | None = None,
*args: Any,
protocol: str | None = None,
config: (
S3Config
| S3ConfigInput
Expand All @@ -168,6 +175,7 @@ def __init__( # noqa: PLR0913
max_cache_size: int = 10,
loop: Any = None,
batch_size: int | None = None,
**kwargs: Any,
) -> None:
"""Construct a new FsspecStore.
Expand Down Expand Up @@ -197,15 +205,16 @@ def __init__( # noqa: PLR0913
batch_size: some operations on many files will batch their requests; if you
are seeing timeouts, you may want to set this number smaller than the
defaults, which are determined in `fsspec.asyn._get_batch_size`.
kwargs: per-store configuration passed down to store-specific builders.
**Examples:**
```py
from obstore.fsspec import FsspecStore
store = FsspecStore(protocol="https")
resp = store.cat("https://example.com")
assert resp.startswith(b"<!doctype html>")
store = FsspecStore("https")
resp = store.cat_file("https://raw.githubusercontent.com/developmentseed/obstore/refs/heads/main/README.md")
assert resp.startswith(b"# obstore")
```
"""
Expand All @@ -223,6 +232,7 @@ def __init__( # noqa: PLR0913
self.config = config
self.client_options = client_options
self.retry_config = retry_config
self.config_kwargs = kwargs

# https://stackoverflow.com/a/68550238
self._construct_store = lru_cache(maxsize=max_cache_size)(self._construct_store)
Expand Down Expand Up @@ -279,6 +289,7 @@ def _construct_store(self, bucket: str) -> ObjectStore:
config=self.config,
client_options=self.client_options,
retry_config=self.retry_config,
**self.config_kwargs,
)

async def _rm_file(self, path: str, **_kwargs: Any) -> None:
Expand Down Expand Up @@ -782,11 +793,12 @@ def register(
Args:
protocol: A single protocol (e.g., "s3", "gcs", "abfs") or
a list of protocols to register FsspecStore for. Defaults to `None`, which
will register `obstore` as the provider for all [supported
protocols][obstore.fsspec.SUPPORTED_PROTOCOLS] **except** for `file://` and
`memory://`. If you wish to use `obstore` via fsspec for `file://` or
`memory://` URLs, list them explicitly.
a list of protocols to register FsspecStore for.
Defaults to `None`, which will register `obstore` as the provider for all
[supported protocols][obstore.fsspec.SUPPORTED_PROTOCOLS] **except** for
`file://` and `memory://`. If you wish to use `obstore` via fsspec for
`file://` or `memory://` URLs, list them explicitly.
asynchronous: If `True`, the registered store will support
asynchronous operations. Defaults to `False`.
Expand Down

0 comments on commit 5bfb35d

Please sign in to comment.