DRAFT: Track engine and storage format at query compiler level. #7424
+140
−21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Demo: https://github.com/sfc-gh-mvashishtha/modin/blob/mvashishtha/SNOW-1856014/track-engine-and-storage-format-at-query-compiler-level/per_dataframe_engine_demo.ipynb
Summary:
modin.config
environment variableEngine
, though we should probably rename this variable toDefaultEngine
. Likewise forStorageFormat
. These variables represent the execution method that we use for new dataframes created out of something other than an existing Modin dataframe/series. For example, we would use those variables to choose an execution method for the result ofpd.DataFrame([[0, 1]])
or ofpd.read_csv()
.engine
andstorage_format
. We need each query compiler to know these properties because we never want to be guessing a compiler's execution.PandasQueryCompiler
, which we use forPandasOnPython
,PandasOnRay
,PandasOnDask
, andPandasOnUnidist
(including on MPI) executions, make theengine
andstorage_format
required constructor parameters so that weSnowflakeQueryCompiler
can return a static engine and storage format because it never changes engine / storage format.add
method assumes that the RHS is a pandas-on-python query compiler.QueryCompilerCaster
helps somewhat with this issue but remember that while it checks for different classes, it does not consider storage format or engine.Design / UI considerations:
DataFrame
/Series
level, but I found that tracking the execution method with this approach required changing more code than this approach did. Also, it seemed that the query compiler can already tell us that the result of most operations will preserve its engine, and we had to add a sort of redundant tracking ofengine
andstorage_format
every time we used a query compiler method at the API levelSeries
andDataFrame
expose the variablesengine
andstorage_format
, which they get from their query compilers. So far Modin has avoided providing non-underscored variables that don't match pandas (going so far as to underscoreto_pandas
). We do want users to be able to checkengine
andstorage_format
. Maybe we should add leading underscores for consistency withto_pandas
.Productionising this PR:
engine
andstorage_format
to some points where we constructPandasQueryCompiler
, so we have to fix those.