implement import-db command #10040

skshetry · 2023-10-19T12:24:12Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

$ dvc import-db --help
usage: dvc import-db [-h] [-q | -v] [--url URL] [--rev [<commit>]] [--project-dir [PROJECT_DIR]] [--sql SQL | --model MODEL] [--profile PROFILE] [--target TARGET] [--export-format [{csv,json}]] [-o [<path>]] [-f]

Download file or directory tracked by DVC or by Git into the workspace, and track it.
Documentation: <https://man.dvc.org/import>

options:
  -h, --help            show this help message and exit
  -q, --quiet           Be quiet.
  -v, --verbose         Be verbose.
  --url URL             Location of DVC or Git repository to download from
  --rev [<commit>]      Git revision (e.g. SHA, branch, tag)
  --project-dir [PROJECT_DIR]
                        Subdirectory to the dbt project location
  --sql SQL             SQL query
  --model MODEL         Model name to download
  --profile PROFILE     Profile to use
  --target TARGET       Target to use
  --export-format [{csv,json}]
                        Export format
  -o [<path>], --out [<path>]
                        Destination path to download files to
  -f, --force           Override destination file or folder if exists.

About `dbt`

dbt parses dbt_project.yml file, which is how it knows a directory is a dbt project.
It is usually at the root of the repo, but ideally, it can be anywhere. dbt reads this file and obtains the profile name.

# Example dbt_project.yml file
name: 'jaffle_shop'
profile: 'jaffle_shop'
...

Then, dbt checks the profiles.yml file, first in DBT_PROFILES_DIR envvar (if set), then the current working directory and then to ~/.dbt directory. profiles.yml has connection details for different warehouses. It will look for matching profile name in the file.

A profile can have multiple targets, and each target specifies a separate credentials/connections to the warehouses. This is usually used "to encourage the use of separate development and production environments". There is always a default target and you can use --target to change that (there is also --profile that you can change).

# example profiles.yml file
jaffle_shop:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: alice
      password: <password>
      port: 5432
      dbname: jaffle_shop
      schema: dbt_alice
      threads: 4

See https://docs.getdbt.com/docs/core/connect-data-platform/connection-profiles.

dbt connects with various data platforms with adapters, which we are trying to exploit to avoid AuthN/AuthZ/RBAC handlings.

Implementation

The current implementation uses dbt show command to download to a file. Although, dbt provides an API to call commands programmatically, it's return type is not guaranteed stable and is subject to change. This PR uses result from the command which includes a agate.Table which is used to convert to csv/json format. There is no streaming support though, so it will keep everything in memory until it is exported out. But the performance is good-enough for ~1M rows range. Using dbt show - a very high level API might give us some benefits of performance optimization in dbt and all cost-effective solutions they might offer through their adapters for select queries (some data warehouses have alternative means to fetch results cheaply), and also support dbt's select syntax.

Running a sql query with --sql can be done without a dbt project, which uses adapter
directly (although it tries to find dbt project at the root of the dvc repo and use it if needed whose behaviour is subject to change).

`import-db` command

The command has two different modes:

--sql
With --sql, you can run a raw sql query and have it exported out to a file.
```
dvc import-db --sql 'select * from table'
```
This will save to a file named results.csv and create a results.csv.dvc file.
If you are inside a dbt project, it will read dbt_project.yml and try to figure out the connection profile. If not, you will need to pass --profile/--target or set them in the config. There is --export-format={csv,json} support and you can change output path with --out (default output is "results.{export_format}").
--model
With --model, you are exporting the dbt models. If you are inside a dbt project (which also happens to be a dvc project), you can just do:
```
dvc import-db --model model_name
```
If you need to import from an external dbt project, you can pass --url (and, optionally, --rev). If the dbt project is inside a subdirectory instead, you can pass --project-dir as well. Similarly, you can provide --profile/--target to override the connection profile to use for this operation.
```
dvc import-db --url <url> --rev <rev>
```
dbt has a notion of packages, where you can use models in your project from another package. A dbt model can have versions too. These are not exposed yet, but supported internally in the implementation.

--output-format is csv by default, but can be changed to --json. --out can be used to change the output path (by default, it's "{model}.{export_format}").

Config

db_profile and db_target can be set in the config, so that they apply globally. This can be used for example in CI to change to use a production environment, etc. They won't appear in .dvc file.

dvc config feature.db_profile profile
dvc config feature.db_target target

(On stabilization, we'll rename this config to db.profile and db.target respectively.)

`profiles`

This implementation modifies the way profiles.yml is discovered. DVC will try to find the profiles.yml at DBT_PROFILES_DIR at first (if set), then checks the root of the repo and then checks the ~/.dbt directory. This is almost the same as dbt, except the second place that dbt looks is in a current working directory (which for dvc happened to be a root of the repo).

The root of the repo is dvc's root_dir, except for --url where it's the root of the git repo. In case of --project-dir, it'll be root to that dbt project.

`.dvc` changes

--sql

md5: 0f6fef5da8bde7f8db2a4d08da535dd7
frozen: true
deps:
- db:
    file_format: csv
    profile: jaffle_shop
    query: select * from customers
outs:
- md5: d0c75044f6fc86297e9449c5bd97c11a
  size: 3008
  hash: md5
  path: results.csv

--model

md5: 157f0942c72f7649197061646904bb29
frozen: true
deps:
- repo:
    url: /Users/saugat/Projects/skshetry/dbt-tutorial/
    rev_lock: 76159e6c6c8f9b4af8260a9d2effc41ae920283b
  db:
    file_format: csv
    model: customers
outs:
- md5: d0c75044f6fc86297e9449c5bd97c11a
  size: 3008
  hash: md5
  path: customers.csv

codecov · 2023-11-02T05:17:37Z

Codecov Report

Attention: 315 lines in your changes are missing coverage. Please review.

Comparison is base (2b3b15f) 90.67% compared to head (80cb7d9) 89.37%.

Files	Patch %	Lines
dvc/utils/db.py	0.00%	134 Missing ⚠️
dvc/dependency/db.py	29.44%	127 Missing ⚠️
dvc/repo/imp_db.py	24.32%	28 Missing ⚠️
dvc/utils/packaging.py	0.00%	16 Missing ⚠️
dvc/commands/imp_db.py	85.18%	4 Missing ⚠️
dvc/repo/graph.py	0.00%	2 Missing and 2 partials ⚠️
dvc/repo/index.py	33.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10040      +/-   ##
==========================================
- Coverage   90.67%   89.37%   -1.30%     
==========================================
  Files         480      463      -17     
  Lines       36859    34288    -2571     
  Branches     5341     3066    -2275     
==========================================
- Hits        33421    30645    -2776     
- Misses       2825     3093     +268     
+ Partials      613      550      -63

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skshetry · 2023-11-07T15:18:01Z

dvc/dependency/db.py

+    return exporter[export_format](to)
+
+
+class AbstractDependency(Dependency):


@iterative/dvc, This is the one that I mentioned in the meeting.

I don't like this, so consider this as a temporary solution. We can discuss ideas about refactoring here or in the coming meeting and recognize potential solutions to this.

But I am also worried that it'd be a major effort (likely encompassing Stage refactoring and the pipelines).

(also see a lot of skips related to Stage.is_db_import below)

skshetry · 2023-11-09T09:59:33Z

@dmpetrov mentioned that he'd use proxy model - an intermediate model for change detection.

dberenbaum · 2023-11-09T14:51:01Z

@dmpetrov mentioned that he'd use proxy model - an intermediate model for change detection.

Can you explain more what you mean? I don't follow.

skshetry · 2023-11-09T15:00:14Z

Can you explain more what you mean? I don't follow.

Dmitry mentioned that the users can create an intermediate model, that has metadata for us to detect changes. For example, let's say we have modelA, and another proxy-model that is just select count(*) from modelA, which can be used for change detection. The model can be different depending on freshness logic that they need.

This was an idea that we could implement, if we cannot reliably figure out a change detection mechanism here.

dberenbaum · 2023-11-20T16:14:09Z

I don't think we need change detection to merge this. As a follow up, if we want it, can we query the max value of the dbt freshness loaded_at field and store that datetime value in .dvc to check for changes?

dberenbaum · 2023-11-20T16:37:22Z

--export-format [{csv,json}]

Does this sound right to you @skshetry? Not sure we should be using export in the name. Maybe we should just be using --json like we do in other commands? OTOH other commands use --json for stdout format, not output written to a file.

skshetry · 2023-11-20T16:39:48Z

How about --output-format?

dberenbaum · 2023-11-20T17:03:19Z

Maybe in the help descriptions, we should specify when terms come from dbt ("dbt profile", "dbt target", etc.)

dberenbaum · 2023-11-20T20:17:03Z

(you can also pass --profiles-dir)

Did you intend to include it as an option? I don't see it available.

dberenbaum

It works well already IMO. The hardest part will be explaining it. I guess you need to install the appropriate adapter, then setup a profile. Are there any other setup steps? Is there a way to setup and test a profile without creating a full dbt project?

skshetry · 2023-11-21T11:13:35Z

Is there a way to setup and test a profile without creating a full dbt project?

We'll have to point to dbt's docs or write our own. This is my biggest concern as well with this PR (in addition to dbt introducing backward incompatible changes, and keeping everything in memory before exporting).

Regarding testing a profile, you can use dvc import-db --sql 'select 1 as id' --profile profile (or, you can try running any query and dvc will try to check connection itself and give you an error).

skshetry · 2023-11-21T12:35:46Z

I have renamed the export_format field in .dvc to file_format. The CLI is named --output-format. It might look odd in deps field to say output-format, ideally it is a part of the stage itself outside outs/deps, but it's more difficult to pass in dvc.

I have also changed the config from db.{profile,target} to feature.db_{profile,target} to merge this PR as an experimental feature. The help message for import-db is also disabled, and a warning message is raised on import-db usage. I don't think a separate config is necessary here to enable it.

I plan to merge this PR, and create a wiki guide. We can improve CLI/workflow on successive PRs.

skshetry · 2023-11-21T15:32:03Z

(you can also pass --profiles-dir)

Did you intend to include it as an option? I don't see it available.

I have updated the description. We can implement --profiles-dir later if someone asks for it.

shcheklein · 2023-11-29T18:20:50Z

Hey folks, do we have docs update for this? What is the plan for releasing this?

skshetry · 2023-11-30T03:58:16Z

No plans for docs yet. I’ll add a guide in wiki today. We want to release after dvcx integration.

skshetry marked this pull request as draft October 19, 2023 16:14

skshetry force-pushed the import-db branch from 3dd7eaa to 0d11e39 Compare November 1, 2023 11:57

skshetry force-pushed the import-db branch from ef6027d to b5337ee Compare November 7, 2023 09:14

skshetry marked this pull request as ready for review November 7, 2023 10:19

skshetry commented Nov 7, 2023

View reviewed changes

skshetry marked this pull request as draft November 8, 2023 15:29

skshetry force-pushed the import-db branch from fda11e7 to 51c24f0 Compare November 10, 2023 10:16

skshetry added 13 commits November 20, 2023 13:41

implement import-db command

442bd62

generate .dvc file

5c54175

update

5d47540

fix tests

f3c2034

pin dbt-fal

c99ace2

try to run on ci

396d16f

use os.chdir

2c68622

fix

da2ef39

revert

db7f57e

require dbt-core>=1.5

0180402

add db config schema

0b9b848

remove platformdirs constraints

6846b86

use logger.getChild

5414acd

skshetry force-pushed the import-db branch from 51c24f0 to 5414acd Compare November 20, 2023 08:03

dberenbaum approved these changes Nov 20, 2023

View reviewed changes

improve error messaging

cd4d665

skshetry added 6 commits November 21, 2023 17:18

rename cli flag --export-format to --output-format

253e988

use feature flag for db config

d665ecb

hide command from dvc commands list

0584b02

refactor dependency._get

abc8a48

remove todo

38ba88c

refactor dependency._get

0802a72

fix dbt installation check

80cb7d9

skshetry marked this pull request as ready for review November 21, 2023 13:00

skshetry added skip-changelog Skips changelog A: data-management Related to dvc add/checkout/commit/move/remove labels Nov 21, 2023

skshetry self-assigned this Nov 21, 2023

skshetry merged commit fd7c891 into iterative:main Nov 21, 2023

skshetry deleted the import-db branch November 21, 2023 13:53

dberenbaum mentioned this pull request Jan 2, 2024

cmd-reference: document import-db iterative/dvc.org#5033

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement import-db command #10040

implement import-db command #10040

skshetry commented Oct 19, 2023 •

edited

Loading

codecov bot commented Nov 2, 2023 •

edited

Loading

skshetry Nov 7, 2023

skshetry commented Nov 9, 2023

dberenbaum commented Nov 9, 2023

skshetry commented Nov 9, 2023

dberenbaum commented Nov 20, 2023

dberenbaum commented Nov 20, 2023

skshetry commented Nov 20, 2023

dberenbaum commented Nov 20, 2023

dberenbaum commented Nov 20, 2023

dberenbaum left a comment

skshetry commented Nov 21, 2023

skshetry commented Nov 21, 2023

skshetry commented Nov 21, 2023

shcheklein commented Nov 29, 2023

skshetry commented Nov 30, 2023

		return exporter[export_format](to)


		class AbstractDependency(Dependency):

implement import-db command #10040

implement import-db command #10040

Conversation

skshetry commented Oct 19, 2023 • edited Loading

About dbt

Implementation

import-db command

Config

profiles

.dvc changes

codecov bot commented Nov 2, 2023 • edited Loading

Codecov Report

skshetry Nov 7, 2023

Choose a reason for hiding this comment

skshetry commented Nov 9, 2023

dberenbaum commented Nov 9, 2023

skshetry commented Nov 9, 2023

dberenbaum commented Nov 20, 2023

dberenbaum commented Nov 20, 2023

skshetry commented Nov 20, 2023

dberenbaum commented Nov 20, 2023

dberenbaum commented Nov 20, 2023

dberenbaum left a comment

Choose a reason for hiding this comment

skshetry commented Nov 21, 2023

skshetry commented Nov 21, 2023

skshetry commented Nov 21, 2023

shcheklein commented Nov 29, 2023

skshetry commented Nov 30, 2023

skshetry commented Oct 19, 2023 •

edited

Loading

About `dbt`

`import-db` command

`profiles`

`.dvc` changes

codecov bot commented Nov 2, 2023 •

edited

Loading