DRAFT: Track engine and storage format at query compiler level. #7424

sfc-gh-mvashishtha · 2025-01-18T06:55:08Z

Summary:

Keep the modin.config environment variable Engine, though we should probably rename this variable to DefaultEngine. Likewise for StorageFormat. These variables represent the execution method that we use for new dataframes created out of something other than an existing Modin dataframe/series. For example, we would use those variables to choose an execution method for the result of pd.DataFrame([[0, 1]]) or of pd.read_csv().
Require query compilers to implement immutable properties for their engine and storage_format. We need each query compiler to know these properties because we never want to be guessing a compiler's execution.
For PandasQueryCompiler, which we use for PandasOnPython, PandasOnRay, PandasOnDask , and PandasOnUnidist (including on MPI) executions, make the engine and storage_format required constructor parameters so that we
SnowflakeQueryCompiler can return a static engine and storage format because it never changes engine / storage format.
All the query compilers in Modin continue to assume that methods that take multiple query compilers (e.g. joins) take query compilers that all use the same execution. For example, a pandas-on-python query compiler's add method assumes that the RHS is a pandas-on-python query compiler. QueryCompilerCaster helps somewhat with this issue but remember that while it checks for different classes, it does not consider storage format or engine.

Design / UI considerations:

Initially we were thinking of tracking engine + storage format at the DataFrame / Series level, but I found that tracking the execution method with this approach required changing more code than this approach did. Also, it seemed that the query compiler can already tell us that the result of most operations will preserve its engine, and we had to add a sort of redundant tracking of engine and storage_format every time we used a query compiler method at the API level
- I tried out the dataframe-level tracking and posted a preliminary draft here: DRAFT: (not recommended) Track engine and storage format at DataFrame/Series level #7425
In this draft, Series and DataFrame expose the variables engine and storage_format, which they get from their query compilers. So far Modin has avoided providing non-underscored variables that don't match pandas (going so far as to underscore to_pandas). We do want users to be able to check engine and storage_format. Maybe we should add leading underscores for consistency with to_pandas.

Productionising this PR:

I have neglected to add the engine and storage_format to some points where we construct PandasQueryCompiler, so we have to fix those.

Signed-off-by: sfc-gh-mvashishtha <[email protected]>

Track engine and storage format at query compiler level.

a471636

Signed-off-by: sfc-gh-mvashishtha <[email protected]>

Provide feedback