Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask worker nodes unable to access files on Databricks Unity Catalog Volume #42

Open
songhan89 opened this issue Jan 13, 2025 · 5 comments
Labels
help wanted Extra attention is needed needs info

Comments

@songhan89
Copy link
Contributor

Hi,

I've noticed that the Dask worker nodes running on Databricks unfortunately are unable to access files on Unity Catalog Volume.

It seems that only the spark master node has default access to Unity Catalog Volume.

For now the workaround is to access the Azure Blob Storage directly although this is not preferred.

@jacobtomlinson
Copy link
Collaborator

The Dask cluster is set up on exactly the same nodes as the Spark cluster, so if the Unity Catalog Volume is not available to the Spark workers then it will not be available to the Dask workers either.

If the data you are trying to access from the volume is small then you could use Client.run_on_scheduler to call a function on the scheduler to read the data, then pass that to the workers.

It would help to understand more about what you are trying to do so we can figure out how best to help.

@songhan89
Copy link
Contributor Author

songhan89 commented Jan 15, 2025

Hi @jacobtomlinson,

I am trying to convert .grib weather data to .zarr files. Setting up a Dask cluster would help when i scale up to TB scale dataset.

Spark

To check if Spark workers would have the same problem, i tried to run the below

files_rdd = sc.parallelize(files)
ds_rdd = files_rdd.map(lambda file: xr.open_dataset(file, engine='cfgrib', filter_by_keys={'dataType': 'pf', 'edition': 1}).load())
ds_rdd.collect()

This was successful.

image

Dask

This unfortunately could not be done with Dask worker nodes.

import dask
import xarray as xr


# Function to open and load a dataset lazily
@dask.delayed
def load_dataset(file):
    return xr.open_dataset(file, engine='cfgrib', filter_by_keys={'dataType': 'pf', 'edition': 1}).load()

# Apply the function to each file
lazy_datasets = [load_dataset(file) for file in files]

# Trigger the computation and load all datasets
datasets = dask.compute(*lazy_datasets)

or a shorter code as below

datasets = xr.open_mfdataset('/Volumes/mss-uc/bronze/bronze-volume/nwp/ec_ens/input/N1E*1', engine='cfgrib', filter_by_keys={'dataType': 'pf', 'edition': 1}, parallel=True).load()

Error message:

PermissionError: [Errno 1] Operation not permitted: '/Volumes/mss-uc/bronze/bronze-volume/nwp/ec_ens/input/N1E01050000011318001'
File <command-3477088295534925>, line 14
     11 lazy_datasets = [load_dataset(file) for file in files]
     13 # Trigger the computation and load all datasets
---> 14 datasets = dask.compute(*lazy_datasets)
File /databricks/python/lib/python3.11/site-packages/cfgrib/messages.py:268, in itervalues()
    266 def itervalues(self) -> T.Iterator[Message]:
    267     errors = self.filestream.errors
--> 268     with open(self.filestream.path, "rb") as file:
    269         # enable MULTI-FIELD support on sequential reads (like when building the index)
    270         with multi_enabled(file):
    271             valid_message_found = False

@jacobtomlinson
Copy link
Collaborator

Thanks for the example, that's really helpful.

Just to check are you setting up the Dask client before calling your Dask code?

import dask_databricks

client = dask_databricks.get_client()

@songhan89
Copy link
Contributor Author

songhan89 commented Jan 15, 2025

Yes correct.

Meanwhile I'll check out Unity Catalog docs and see how it works. I suspect this might not be easy to resolve given that Unity Catalog Volume access controls may be tied to the Azure Entra ID.

@jacobtomlinson
Copy link
Collaborator

It's possible that the Spark workers have some knowledge of Unity Catalog, or some authentication set that's not set for the Dask workers.

Please report back here with any info you find out!

@jacobtomlinson jacobtomlinson added the help wanted Extra attention is needed label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed needs info
Projects
None yet
Development

No branches or pull requests

2 participants