Dask worker nodes unable to access files on Databricks Unity Catalog Volume #42

songhan89 · 2025-01-13T13:58:23Z

Hi,

I've noticed that the Dask worker nodes running on Databricks unfortunately are unable to access files on Unity Catalog Volume.

It seems that only the spark master node has default access to Unity Catalog Volume.

For now the workaround is to access the Azure Blob Storage directly although this is not preferred.

jacobtomlinson · 2025-01-14T14:57:21Z

The Dask cluster is set up on exactly the same nodes as the Spark cluster, so if the Unity Catalog Volume is not available to the Spark workers then it will not be available to the Dask workers either.

If the data you are trying to access from the volume is small then you could use Client.run_on_scheduler to call a function on the scheduler to read the data, then pass that to the workers.

It would help to understand more about what you are trying to do so we can figure out how best to help.

songhan89 · 2025-01-15T06:14:14Z

Hi @jacobtomlinson,

I am trying to convert .grib weather data to .zarr files. Setting up a Dask cluster would help when i scale up to TB scale dataset.

Spark

To check if Spark workers would have the same problem, i tried to run the below

files_rdd = sc.parallelize(files)
ds_rdd = files_rdd.map(lambda file: xr.open_dataset(file, engine='cfgrib', filter_by_keys={'dataType': 'pf', 'edition': 1}).load())
ds_rdd.collect()

This was successful.

Dask

This unfortunately could not be done with Dask worker nodes.

import dask
import xarray as xr


# Function to open and load a dataset lazily
@dask.delayed
def load_dataset(file):
    return xr.open_dataset(file, engine='cfgrib', filter_by_keys={'dataType': 'pf', 'edition': 1}).load()

# Apply the function to each file
lazy_datasets = [load_dataset(file) for file in files]

# Trigger the computation and load all datasets
datasets = dask.compute(*lazy_datasets)

or a shorter code as below

datasets = xr.open_mfdataset('/Volumes/mss-uc/bronze/bronze-volume/nwp/ec_ens/input/N1E*1', engine='cfgrib', filter_by_keys={'dataType': 'pf', 'edition': 1}, parallel=True).load()

Error message:

PermissionError: [Errno 1] Operation not permitted: '/Volumes/mss-uc/bronze/bronze-volume/nwp/ec_ens/input/N1E01050000011318001'
File <command-3477088295534925>, line 14
     11 lazy_datasets = [load_dataset(file) for file in files]
     13 # Trigger the computation and load all datasets
---> 14 datasets = dask.compute(*lazy_datasets)
File /databricks/python/lib/python3.11/site-packages/cfgrib/messages.py:268, in itervalues()
    266 def itervalues(self) -> T.Iterator[Message]:
    267     errors = self.filestream.errors
--> 268     with open(self.filestream.path, "rb") as file:
    269         # enable MULTI-FIELD support on sequential reads (like when building the index)
    270         with multi_enabled(file):
    271             valid_message_found = False

jacobtomlinson · 2025-01-15T09:44:38Z

Thanks for the example, that's really helpful.

Just to check are you setting up the Dask client before calling your Dask code?

import dask_databricks

client = dask_databricks.get_client()

songhan89 · 2025-01-15T09:54:20Z

Yes correct.

Meanwhile I'll check out Unity Catalog docs and see how it works. I suspect this might not be easy to resolve given that Unity Catalog Volume access controls may be tied to the Azure Entra ID.

jacobtomlinson · 2025-01-15T14:35:30Z

It's possible that the Spark workers have some knowledge of Unity Catalog, or some authentication set that's not set for the Dask workers.

Please report back here with any info you find out!

jacobtomlinson added the needs info label Jan 14, 2025

jacobtomlinson added the help wanted Extra attention is needed label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask worker nodes unable to access files on Databricks Unity Catalog Volume #42

Dask worker nodes unable to access files on Databricks Unity Catalog Volume #42

songhan89 commented Jan 13, 2025

jacobtomlinson commented Jan 14, 2025

songhan89 commented Jan 15, 2025 •

edited by jacobtomlinson

Loading

jacobtomlinson commented Jan 15, 2025

songhan89 commented Jan 15, 2025 •

edited

Loading

jacobtomlinson commented Jan 15, 2025

Dask worker nodes unable to access files on Databricks Unity Catalog Volume #42

Dask worker nodes unable to access files on Databricks Unity Catalog Volume #42

Comments

songhan89 commented Jan 13, 2025

jacobtomlinson commented Jan 14, 2025

songhan89 commented Jan 15, 2025 • edited by jacobtomlinson Loading

Spark

Dask

jacobtomlinson commented Jan 15, 2025

songhan89 commented Jan 15, 2025 • edited Loading

jacobtomlinson commented Jan 15, 2025

songhan89 commented Jan 15, 2025 •

edited by jacobtomlinson

Loading

songhan89 commented Jan 15, 2025 •

edited

Loading