[DOP-9787] Improve read strategies in DBReader #182

dolfinus · 2023-12-06T16:40:11Z

Change Summary

Previously IncrementalStrategy was implemented like this:

Get table schema by making query SELECT * FROM table WHERE 1=0 (if DBReader.columns contains *)
Append HWM column to list of table columns and remove duplicated columns.
Create dataframe from query like SELECT hwm.expression AS hwm.column, ...other table columns... FROM table WHERE prev_hwm.expression > prev_hwm.value.
Determine HWM class by df.schema[hwm.column].dataType
Calculate df.select(min(hwm.column), max(hwm.column)).collect() on Spark side.
Use max(hwm.column) as next HWM value.
Return dataframe to user.

This was far from ideal:

Dataframe content (all rows or just changed ones) was loaded from the source to Spark only to get min/max values of specific column.
Step of fetching table schema and then substituting column names in the following query may cause errors.

For example, source contains columns with mixed name case, like "MyColumn" and "My column".
Column names were not escaped during query generation, leading to queries that cannot be executed by database.
So users have to explicitly set proper columns list with wrapping them with ".
Dataframe was created from query with clause like WHERE hwm.expression > prev_hwm.value,
not WHERE hwm.expression > prev_hwm.value AND hwm.expression <= current_hwm.value.

So if new rows appeared in the source after HWM value is determined, these rows may be read by DBReader on the first run,
and then again on the next run, because they are returned by WHERE hwm.expression > prev_hwm.value query.

Now it looks like this:

Get type of HWM expression: SELECT hwm.expression FROM table WHERE 1=0
Determine HWM class by df.schema[0].
Get min/max values by querying SELECT MIN(hwm.expression), MAX(hwm.expression) FROM table WHERE hwm.expression >= prev_hwm.value.
Use max(hwm.column) as next HWM value.
Create dataframe from query SELECT * FROM table WHERE hwm.expression > prev_hwm.value AND hwm.expression <= current_hwm.value, and return it to user.

Improvements:

Allow source to calculate min/max instead of loading everything to Spark. This should be really fast, and also source can use indexes to speed this up even more.
Restrict dataframe content to always match HWM values.
Don't mess up with columns list, just pass them to source as-is. So DBReader does not fail on tables with mixed column naming.

Breaking change - HWM column is not being implicitly added to dataframe.
If it was not just some column value but some expression which then used in your code by accessing dataframe column,
you should explicitly add same expression to DBReader.columns.

Add 2 internal classes Edge and Window, replacing old Statement class. They represent column > edge / column >= edge / column <= edge parts of WHERE clause. Merged separated arguments start_from/end_at with window in DBConnection methods, added apply_window method to DBDialect which converts these classes to WHERE clause (replaces condition_assembler).
Get rid of StrategyHelper class, move corresponding logic to DBReader itself. Also update Strategy classes, properties current/next return Edge which simplifies implementation and allows to drop current_comparator/next_comparator.
Improved checks for different edge cases

If HWMStore returned HWM of wrong type, an exception is raised.
If multiple DBReaders with different HWMs are used within the same HWMStrategy, an exception is being raised.
If HWMStore returned HWM with different entity or expression, a warning is shown and DBReader.hwm has higher priority than attributes from HWMStore. This may be changed in next PRs.

Related issue number

Checklist

Commit message and PR title is comprehensive
Keep the change as small as possible
Unit and integration tests for the changes exist
Tests pass on CI and coverage does not decrease
Documentation reflects the changes where applicable
docs/changelog/next_release/<pull request or issue id>.<change type>.rst file added describing change
(see CONTRIBUTING.rst for details.)
My PR is ready to review.

codecov · 2023-12-07T13:15:09Z

Codecov Report

Attention: 22 lines in your changes are missing coverage. Please review.

Comparison is base (4092134) 94.08% compared to head (cdfb491) 94.34%.

Files	Patch %	Lines
onetl/db/db_reader/db_reader.py	84.90%	9 Missing and 7 partials ⚠️
onetl/strategy/hwm_strategy.py	93.75%	1 Missing and 2 partials ⚠️
...etl/connection/db_connection/mongodb/connection.py	89.47%	1 Missing and 1 partial ⚠️
onetl/strategy/base_strategy.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #182      +/-   ##
===========================================
+ Coverage    94.08%   94.34%   +0.25%     
===========================================
  Files          205      204       -1     
  Lines         7744     7671      -73     
  Branches      1400     1376      -24     
===========================================
- Hits          7286     7237      -49     
+ Misses         332      318      -14     
+ Partials       126      116      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

onetl/strategy/hwm_strategy.py

tests/tests_integration/tests_strategy_integration/test_strategy_incremental_batch.py

onetl/strategy/hwm_strategy.py

...tegy_integration/tests_incremental_strategy_integration/test_strategy_increment_greenplum.py

onetl/db/db_reader/db_reader.py

dolfinus force-pushed the feature/DOP-9787-window branch from d4080ad to 1fb42db Compare December 6, 2023 18:08

dolfinus temporarily deployed to test-pypi December 6, 2023 18:08 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 1fb42db to 81d3d8d Compare December 6, 2023 18:12

dolfinus temporarily deployed to test-pypi December 6, 2023 18:12 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 81d3d8d to 92a56b8 Compare December 6, 2023 18:20

dolfinus temporarily deployed to test-pypi December 6, 2023 18:20 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 92a56b8 to 183b93d Compare December 6, 2023 18:36

dolfinus temporarily deployed to test-pypi December 6, 2023 18:36 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 183b93d to 2c36e01 Compare December 6, 2023 19:03

dolfinus temporarily deployed to test-pypi December 6, 2023 19:03 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 2c36e01 to bdebfd4 Compare December 6, 2023 19:06

dolfinus temporarily deployed to test-pypi December 6, 2023 19:06 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from bdebfd4 to 522513b Compare December 6, 2023 19:28

dolfinus temporarily deployed to test-pypi December 6, 2023 19:28 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 522513b to 28b5492 Compare December 6, 2023 20:04

dolfinus temporarily deployed to test-pypi December 6, 2023 20:04 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 28b5492 to 8e44dcc Compare December 6, 2023 20:10

dolfinus temporarily deployed to test-pypi December 6, 2023 20:10 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 8e44dcc to 693b151 Compare December 7, 2023 12:25

dolfinus temporarily deployed to test-pypi December 7, 2023 12:25 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 693b151 to 4c433bf Compare December 7, 2023 12:32

dolfinus temporarily deployed to test-pypi December 7, 2023 12:32 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 4c433bf to d7078b8 Compare December 7, 2023 12:38

dolfinus temporarily deployed to test-pypi December 7, 2023 12:39 — with GitHub Actions Inactive

dolfinus self-assigned this Dec 7, 2023

dolfinus force-pushed the feature/DOP-9787-window branch from d7078b8 to b2fe96a Compare December 7, 2023 13:04

dolfinus temporarily deployed to test-pypi December 7, 2023 13:04 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from b2fe96a to a781c4a Compare December 7, 2023 15:42

dolfinus temporarily deployed to test-pypi December 7, 2023 15:42 — with GitHub Actions Inactive

dolfinus temporarily deployed to test-pypi December 7, 2023 20:48 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from c1a29e0 to 8529626 Compare December 7, 2023 21:27

dolfinus temporarily deployed to test-pypi December 7, 2023 21:27 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 8529626 to 56a449b Compare December 7, 2023 21:33

dolfinus temporarily deployed to test-pypi December 7, 2023 21:33 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 56a449b to 771b0d7 Compare December 7, 2023 22:00

dolfinus temporarily deployed to test-pypi December 7, 2023 22:00 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 771b0d7 to 270bb3b Compare December 7, 2023 22:02

dolfinus temporarily deployed to test-pypi December 7, 2023 22:03 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from 270bb3b to fabd05a Compare December 7, 2023 22:11

dolfinus temporarily deployed to test-pypi December 7, 2023 22:11 — with GitHub Actions Inactive

dolfinus force-pushed the feature/DOP-9787-window branch from fabd05a to e90ce19 Compare December 7, 2023 22:38

dolfinus temporarily deployed to test-pypi December 7, 2023 22:38 — with GitHub Actions Inactive

dolfinus requested review from dmitry-pedchenko and maxim-lixakov December 8, 2023 07:58

dolfinus marked this pull request as ready for review December 8, 2023 07:58