-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask worker nodes unable to access files on Databricks Unity Catalog Volume #42
Comments
The Dask cluster is set up on exactly the same nodes as the Spark cluster, so if the Unity Catalog Volume is not available to the Spark workers then it will not be available to the Dask workers either. If the data you are trying to access from the volume is small then you could use It would help to understand more about what you are trying to do so we can figure out how best to help. |
Hi @jacobtomlinson, I am trying to convert SparkTo check if Spark workers would have the same problem, i tried to run the below files_rdd = sc.parallelize(files)
ds_rdd = files_rdd.map(lambda file: xr.open_dataset(file, engine='cfgrib', filter_by_keys={'dataType': 'pf', 'edition': 1}).load())
ds_rdd.collect() This was successful. DaskThis unfortunately could not be done with Dask worker nodes. import dask
import xarray as xr
# Function to open and load a dataset lazily
@dask.delayed
def load_dataset(file):
return xr.open_dataset(file, engine='cfgrib', filter_by_keys={'dataType': 'pf', 'edition': 1}).load()
# Apply the function to each file
lazy_datasets = [load_dataset(file) for file in files]
# Trigger the computation and load all datasets
datasets = dask.compute(*lazy_datasets) or a shorter code as below datasets = xr.open_mfdataset('/Volumes/mss-uc/bronze/bronze-volume/nwp/ec_ens/input/N1E*1', engine='cfgrib', filter_by_keys={'dataType': 'pf', 'edition': 1}, parallel=True).load() Error message:
|
Thanks for the example, that's really helpful. Just to check are you setting up the Dask client before calling your Dask code? import dask_databricks
client = dask_databricks.get_client() |
Yes correct. Meanwhile I'll check out Unity Catalog docs and see how it works. I suspect this might not be easy to resolve given that Unity Catalog Volume access controls may be tied to the Azure Entra ID. |
It's possible that the Spark workers have some knowledge of Unity Catalog, or some authentication set that's not set for the Dask workers. Please report back here with any info you find out! |
Hi,
I've noticed that the Dask worker nodes running on Databricks unfortunately are unable to access files on Unity Catalog Volume.
It seems that only the spark master node has default access to Unity Catalog Volume.
For now the workaround is to access the Azure Blob Storage directly although this is not preferred.
The text was updated successfully, but these errors were encountered: