Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Track engine and storage format at query compiler level. #7424

Draft
wants to merge 1 commit into
base: hybrid-execution
Choose a base branch
from

Conversation

sfc-gh-mvashishtha
Copy link
Contributor

@sfc-gh-mvashishtha sfc-gh-mvashishtha commented Jan 18, 2025

Demo: https://github.com/sfc-gh-mvashishtha/modin/blob/mvashishtha/SNOW-1856014/track-engine-and-storage-format-at-query-compiler-level/per_dataframe_engine_demo.ipynb

Summary:

  • Keep the modin.config environment variable Engine, though we should probably rename this variable to DefaultEngine. Likewise for StorageFormat. These variables represent the execution method that we use for new dataframes created out of something other than an existing Modin dataframe/series. For example, we would use those variables to choose an execution method for the result of pd.DataFrame([[0, 1]]) or of pd.read_csv().
  • Require query compilers to implement immutable properties for their engine and storage_format. We need each query compiler to know these properties because we never want to be guessing a compiler's execution.
  • For PandasQueryCompiler, which we use for PandasOnPython, PandasOnRay, PandasOnDask , and PandasOnUnidist (including on MPI) executions, make the engine and storage_format required constructor parameters so that we
  • SnowflakeQueryCompiler can return a static engine and storage format because it never changes engine / storage format.
  • All the query compilers in Modin continue to assume that methods that take multiple query compilers (e.g. joins) take query compilers that all use the same execution. For example, a pandas-on-python query compiler's add method assumes that the RHS is a pandas-on-python query compiler. QueryCompilerCaster helps somewhat with this issue but remember that while it checks for different classes, it does not consider storage format or engine.

Design / UI considerations:

  • Initially we were thinking of tracking engine + storage format at the DataFrame / Series level, but I found that tracking the execution method with this approach required changing more code than this approach did. Also, it seemed that the query compiler can already tell us that the result of most operations will preserve its engine, and we had to add a sort of redundant tracking of engine and storage_format every time we used a query compiler method at the API level
  • In this draft, Series and DataFrame expose the variables engine and storage_format, which they get from their query compilers. So far Modin has avoided providing non-underscored variables that don't match pandas (going so far as to underscore to_pandas). We do want users to be able to check engine and storage_format. Maybe we should add leading underscores for consistency with to_pandas.

Productionising this PR:

  • I have neglected to add the engine and storage_format to some points where we construct PandasQueryCompiler, so we have to fix those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant