Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOP-9787] Improve read strategies in DBReader #182

Merged
merged 1 commit into from
Dec 8, 2023

Conversation

dolfinus
Copy link
Member

@dolfinus dolfinus commented Dec 6, 2023

Change Summary

  1. Previously IncrementalStrategy was implemented like this:
  • Get table schema by making query SELECT * FROM table WHERE 1=0 (if DBReader.columns contains *)
  • Append HWM column to list of table columns and remove duplicated columns.
  • Create dataframe from query like SELECT hwm.expression AS hwm.column, ...other table columns... FROM table WHERE prev_hwm.expression > prev_hwm.value.
  • Determine HWM class by df.schema[hwm.column].dataType
  • Calculate df.select(min(hwm.column), max(hwm.column)).collect() on Spark side.
  • Use max(hwm.column) as next HWM value.
  • Return dataframe to user.

This was far from ideal:

  • Dataframe content (all rows or just changed ones) was loaded from the source to Spark only to get min/max values of specific column.

  • Step of fetching table schema and then substituting column names in the following query may cause errors.

    For example, source contains columns with mixed name case, like "MyColumn" and "My column".
    Column names were not escaped during query generation, leading to queries that cannot be executed by database.
    So users have to explicitly set proper columns list with wrapping them with ".

  • Dataframe was created from query with clause like WHERE hwm.expression > prev_hwm.value,
    not WHERE hwm.expression > prev_hwm.value AND hwm.expression <= current_hwm.value.

    So if new rows appeared in the source after HWM value is determined, these rows may be read by DBReader on the first run,
    and then again on the next run, because they are returned by WHERE hwm.expression > prev_hwm.value query.

Now it looks like this:

  • Get type of HWM expression: SELECT hwm.expression FROM table WHERE 1=0
  • Determine HWM class by df.schema[0].
  • Get min/max values by querying SELECT MIN(hwm.expression), MAX(hwm.expression) FROM table WHERE hwm.expression >= prev_hwm.value.
  • Use max(hwm.column) as next HWM value.
  • Create dataframe from query SELECT * FROM table WHERE hwm.expression > prev_hwm.value AND hwm.expression <= current_hwm.value, and return it to user.

Improvements:

  • Allow source to calculate min/max instead of loading everything to Spark. This should be really fast, and also source can use indexes to speed this up even more.
  • Restrict dataframe content to always match HWM values.
  • Don't mess up with columns list, just pass them to source as-is. So DBReader does not fail on tables with mixed column naming.

Breaking change - HWM column is not being implicitly added to dataframe.
If it was not just some column value but some expression which then used in your code by accessing dataframe column,
you should explicitly add same expression to DBReader.columns.

  1. Add 2 internal classes Edge and Window, replacing old Statement class. They represent column > edge / column >= edge / column <= edge parts of WHERE clause. Merged separated arguments start_from/end_at with window in DBConnection methods, added apply_window method to DBDialect which converts these classes to WHERE clause (replaces condition_assembler).

  2. Get rid of StrategyHelper class, move corresponding logic to DBReader itself. Also update Strategy classes, properties current/next return Edge which simplifies implementation and allows to drop current_comparator/next_comparator.

  3. Improved checks for different edge cases

  • If HWMStore returned HWM of wrong type, an exception is raised.
  • If multiple DBReaders with different HWMs are used within the same HWMStrategy, an exception is being raised.
  • If HWMStore returned HWM with different entity or expression, a warning is shown and DBReader.hwm has higher priority than attributes from HWMStore. This may be changed in next PRs.

Related issue number

Checklist

  • Commit message and PR title is comprehensive
  • Keep the change as small as possible
  • Unit and integration tests for the changes exist
  • Tests pass on CI and coverage does not decrease
  • Documentation reflects the changes where applicable
  • docs/changelog/next_release/<pull request or issue id>.<change type>.rst file added describing change
    (see CONTRIBUTING.rst for details.)
  • My PR is ready to review.

@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from d4080ad to 1fb42db Compare December 6, 2023 18:08
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 1fb42db to 81d3d8d Compare December 6, 2023 18:12
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 81d3d8d to 92a56b8 Compare December 6, 2023 18:20
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 92a56b8 to 183b93d Compare December 6, 2023 18:36
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 183b93d to 2c36e01 Compare December 6, 2023 19:03
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 2c36e01 to bdebfd4 Compare December 6, 2023 19:06
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from bdebfd4 to 522513b Compare December 6, 2023 19:28
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 522513b to 28b5492 Compare December 6, 2023 20:04
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 28b5492 to 8e44dcc Compare December 6, 2023 20:10
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 8e44dcc to 693b151 Compare December 7, 2023 12:25
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 693b151 to 4c433bf Compare December 7, 2023 12:32
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 4c433bf to d7078b8 Compare December 7, 2023 12:38
@dolfinus dolfinus self-assigned this Dec 7, 2023
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from d7078b8 to b2fe96a Compare December 7, 2023 13:04
Copy link

codecov bot commented Dec 7, 2023

Codecov Report

Attention: 22 lines in your changes are missing coverage. Please review.

Comparison is base (4092134) 94.08% compared to head (cdfb491) 94.34%.

Files Patch % Lines
onetl/db/db_reader/db_reader.py 84.90% 9 Missing and 7 partials ⚠️
onetl/strategy/hwm_strategy.py 93.75% 1 Missing and 2 partials ⚠️
...etl/connection/db_connection/mongodb/connection.py 89.47% 1 Missing and 1 partial ⚠️
onetl/strategy/base_strategy.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #182      +/-   ##
===========================================
+ Coverage    94.08%   94.34%   +0.25%     
===========================================
  Files          205      204       -1     
  Lines         7744     7671      -73     
  Branches      1400     1376      -24     
===========================================
- Hits          7286     7237      -49     
+ Misses         332      318      -14     
+ Partials       126      116      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from c1a29e0 to 8529626 Compare December 7, 2023 21:27
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 8529626 to 56a449b Compare December 7, 2023 21:33
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 56a449b to 771b0d7 Compare December 7, 2023 22:00
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 771b0d7 to 270bb3b Compare December 7, 2023 22:02
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from 270bb3b to fabd05a Compare December 7, 2023 22:11
@dolfinus dolfinus force-pushed the feature/DOP-9787-window branch from fabd05a to e90ce19 Compare December 7, 2023 22:38
@dolfinus dolfinus marked this pull request as ready for review December 8, 2023 07:58
@dolfinus dolfinus merged commit 2806f97 into develop Dec 8, 2023
47 of 48 checks passed
@dolfinus dolfinus deleted the feature/DOP-9787-window branch December 8, 2023 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants