-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Azure ML pipeline data passing #43
Use Azure ML pipeline data passing #43
Conversation
8ceebb0
to
26f3e0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updated contribution @tomasvanpottelbergh , looks much better now, I really like the implementation and I'm eager to test it!
Could you briefly summarize that my understanding is correct for this PR?
- If the flag
native_data_passing
is set tofalse
(default), everything works as previously, full backward compatibility to previous plugin's versions. - If the flag
native_data_passing
is set totrue
- then the runner overwrites the paths in the entries defined explicitly in the catalog, to swap their output paths to Azure-native ones. That means, if I for example usepandas.CSVDataSet
wrapped by your dataset, like this:
my_data_set:
type: AzureMLFolderDataset
dataset:
type: pandas.CSVDataSet
filepath: my/local/path/to/file.csv
at runtime, in Azure ML, the filepath
will be replaced to the path provided by Azure's runtime.
- question:
What happens when I havenative_data_passing
set totrue
, but the catalog will not have anyAzureMLFolderDataset
entires?
Thanks for taking the time to review it!
Correct!
In that case the native data passing will use pickle files for the datasets not specified in the catalog. It will not affect anything else dataset in the catalog which is not an
Definitely, I will add tests and docs once this approach is approved. |
…o-azureml into rebase/feat/native-data-passing
…o-azureml into rebase/feat/native-data-passing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've added my final thoughts. I was able to test it out, it works well, but needs a slight polishing.
I was expecting it to work with a catalog entry like this:
my_data_set:
type: AzureMLFolderDataset
dataset:
type: pandas.CSVDataSet
filepath: my/local/path/to/file.csv
But it doesn't as it expects the path
attribute in the AzureMLFolderDataset
, which is needless in my opinion, since you've already did a good job of allowing to set filepath_arg
there - you can easily extract the file name from the internal dataset, instead of having to specify it twice. This makes the use much more pleasant. I did some hacking on your code base to make it possible (by default there will be no need for the path
arg, but it can still be set if someone want's full control).
- Change parameters ordering and make path empty by default.
class AzureMLFolderDataset(AbstractDataSet):
def __init__(
self,
dataset: Union[str, Type[AbstractDataSet], Dict[str, Any]],
path: str = "",
filepath_arg: str = "filepath",
):
# ...
- Add a property to
AzureMLFolderDataset
:
@property
def original_dataset_path(self) -> Path:
return Path(self._dataset_config[self._filepath_arg])
- In the
run
ofAzurePipelinesRunner
:
def run(
self,
pipeline: Pipeline,
catalog: DataCatalog,
hook_manager: PluginManager = None,
session_id: str = None,
) -> Dict[str, Any]:
catalog = catalog.shallow_copy()
catalog_set = set(catalog.list())
# Loop over datasets in arguments to set their paths
for ds_name, azure_dataset_folder in self.data_paths.items():
if ds_name in catalog_set:
ds = catalog._get_dataset(ds_name)
if isinstance(ds, AzureMLFolderDataset):
file_name = (
ds.original_dataset_path.name
if not ds.path
else Path(ds.path).name
) # <--- this
ds.path = str(Path(azure_dataset_folder) / file_name) # <--- and this
catalog.add(ds_name, ds, replace=True)
else:
catalog.add(ds_name, self.create_default_data_set(ds_name))
return super().run(pipeline, catalog, hook_manager, session_id)
- In the
create_default_data_set
ofAzurePipelinesRunner
: params order needs to change (path / dataset) in thereturn dataset_cls(PickleDataSet, path)
Once you apply those changes, add docs and unit test, we're fine to merge. It will be released as 0.4.0
as this is a big feature 🎉
If I understand correctly, this is to essentially mount/attach an AzureML's Data Asset to a compute/cluster in AzureML. Is that correct? |
This PR only handles mounting (unregistered) Data Assets in order to pass data between pipeline nodes. When this PR is merged, I will extend this to registered Data Assets. |
Sorry, you're correct, I didn't spot the difference in your example that made it fail. I based the implementation on that of the I have changed the implementation in a slightly different way from what you proposed, but following the same idea. I would actually propose to remove the |
Your implementation looks fine and I see no reason for leaving the Please fix the 3 comments I've added above. Once unit tests pass, I will merge it 👍🏻 |
Thanks again for the review @marrrcin! I think I have addressed all issues and made an attempt to write some docs, although I leave it up to you to adapt them as you prefer. A last few things that came up:
|
Agree, let's address this later.
I will take over this next week.
Makes sense. |
Created #52 to track this.
Great, thanks!
Done in latest commit. |
Merged, awesome work, we will release it later this week. Thanks for the contribution 🎉 |
when do you think you'll release v0.4.0 with this new feature ? |
@jpoullet2000 it will be released by 2023-04-28. |
Resolves #7.
This is a reworked version of my implementation of using Azure ML native data passing between nodes instead of a temporary storage.
The idea is simply to save and load the data from and to Pickle files located at the path passed to the node via the
--az-input
and--az-output
arguments.The
AzureMLPipelineDataset
adds the option to save intermediate datasets in the format as specified in the catalog.