You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
The delete+insert incremental strategy creates a temporary table with the new dataset, then it executes a DELETE statement using the temporary table as the filter condition. Below is an example of the generated statement:
deletefrom"source_db"."dbt"."source_table" using "source_table__dbt_tmp173111887980"where
(
"source_table__dbt_tmp173111887980".event_time ="source_db"."dbt"."source_table".event_time
);
The current approach uses the unique_key specified in the model, and it probably works fine for most relational databases. For data warehouses like Redshift, this approach is very slow due to lack of primary and unique keys. As a result, the delete operation becomes very slow due to sequential scanning.
In order to work around this, I created a custom incremental strategy for a customer using the approach described here. The goal was to be able to provide a flexible way to specify custom predicates that works well with the underlying data and storage. The custom incremental strategy I wrote uses the existing incremental_predicates to build the delete statement. The custom incremental strategy ignores the unique_key to keep the implementation simple, and easy to switch to other strategies.
Currently, if the merge strategy is set to delete+insert but no unique_key was specified, it is as if you used an append strategy. This behavior can be seen in the macro default__get_delete_insert_merge_sql. I am not sure if this was a deliberate decision.
As such, an alternative approach to creating a separate incremental strategy is to modify the existing default__get_delete_insert_merge_sql to allow users to not specify a unique_key, and only rely on the incremental_predicates.
Who will this benefit?
Users who are working with data warehouses like Redshift. Date columns that are used as sort keys can significantly benefit from this.
Here is an example: consider a model in Redshift that was created with the following config:
The performance difference in Redshift compared to a delete join is huge. I had one model where deleting 10M records took 30 minutes using the join approach. With the custom delete+insert strategy, this was reduced to 10 seconds due to the optimizations we can do in incremental_predicates
Are you interested in contributing this feature?
Yes
Anything else?
No response
The text was updated successfully, but these errors were encountered:
Is this your first time submitting a feature request?
Describe the feature
The
delete+insert
incremental strategy creates a temporary table with the new dataset, then it executes a DELETE statement using the temporary table as the filter condition. Below is an example of the generated statement:The current approach uses the
unique_key
specified in the model, and it probably works fine for most relational databases. For data warehouses like Redshift, this approach is very slow due to lack of primary and unique keys. As a result, the delete operation becomes very slow due to sequential scanning.In order to work around this, I created a custom incremental strategy for a customer using the approach described here. The goal was to be able to provide a flexible way to specify custom predicates that works well with the underlying data and storage. The custom incremental strategy I wrote uses the existing incremental_predicates to build the delete statement. The custom incremental strategy ignores the
unique_key
to keep the implementation simple, and easy to switch to other strategies.CC: @goodjira and @schmiman
Describe alternatives you've considered
Currently, if the merge strategy is set to
delete+insert
but nounique_key
was specified, it is as if you used anappend
strategy. This behavior can be seen in the macrodefault__get_delete_insert_merge_sql
. I am not sure if this was a deliberate decision.As such, an alternative approach to creating a separate incremental strategy is to modify the existing
default__get_delete_insert_merge_sql
to allow users to not specify aunique_key
, and only rely on theincremental_predicates
.Who will this benefit?
Users who are working with data warehouses like Redshift. Date columns that are used as sort keys can significantly benefit from this.
Here is an example: consider a model in Redshift that was created with the following config:
The resulting delete SQL will be this:
The performance difference in Redshift compared to a delete join is huge. I had one model where deleting 10M records took 30 minutes using the join approach. With the custom delete+insert strategy, this was reduced to 10 seconds due to the optimizations we can do in
incremental_predicates
Are you interested in contributing this feature?
Yes
Anything else?
No response
The text was updated successfully, but these errors were encountered: