Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement import-db command #10040

Merged
merged 21 commits into from
Nov 21, 2023
Merged

implement import-db command #10040

merged 21 commits into from
Nov 21, 2023

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Oct 19, 2023

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

asciicast

$ dvc import-db --help
usage: dvc import-db [-h] [-q | -v] [--url URL] [--rev [<commit>]] [--project-dir [PROJECT_DIR]] [--sql SQL | --model MODEL] [--profile PROFILE] [--target TARGET] [--export-format [{csv,json}]] [-o [<path>]] [-f]

Download file or directory tracked by DVC or by Git into the workspace, and track it.
Documentation: <https://man.dvc.org/import>

options:
  -h, --help            show this help message and exit
  -q, --quiet           Be quiet.
  -v, --verbose         Be verbose.
  --url URL             Location of DVC or Git repository to download from
  --rev [<commit>]      Git revision (e.g. SHA, branch, tag)
  --project-dir [PROJECT_DIR]
                        Subdirectory to the dbt project location
  --sql SQL             SQL query
  --model MODEL         Model name to download
  --profile PROFILE     Profile to use
  --target TARGET       Target to use
  --export-format [{csv,json}]
                        Export format
  -o [<path>], --out [<path>]
                        Destination path to download files to
  -f, --force           Override destination file or folder if exists.

About dbt

dbt parses dbt_project.yml file, which is how it knows a directory is a dbt project.
It is usually at the root of the repo, but ideally, it can be anywhere. dbt reads this file and obtains the profile name.

# Example dbt_project.yml file
name: 'jaffle_shop'
profile: 'jaffle_shop'
...

Then, dbt checks the profiles.yml file, first in DBT_PROFILES_DIR envvar (if set), then the current working directory and then to ~/.dbt directory. profiles.yml has connection details for different warehouses. It will look for matching profile name in the file.

A profile can have multiple targets, and each target specifies a separate credentials/connections to the warehouses. This is usually used "to encourage the use of separate development and production environments". There is always a default target and you can use --target to change that (there is also --profile that you can change).

# example profiles.yml file
jaffle_shop:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: alice
      password: <password>
      port: 5432
      dbname: jaffle_shop
      schema: dbt_alice
      threads: 4

See https://docs.getdbt.com/docs/core/connect-data-platform/connection-profiles.

dbt connects with various data platforms with adapters, which we are trying to exploit to avoid AuthN/AuthZ/RBAC handlings.

Implementation

The current implementation uses dbt show command to download to a file. Although, dbt provides an API to call commands programmatically, it's return type is not guaranteed stable and is subject to change. This PR uses result from the command which includes a agate.Table which is used to convert to csv/json format. There is no streaming support though, so it will keep everything in memory until it is exported out. But the performance is good-enough for ~1M rows range. Using dbt show - a very high level API might give us some benefits of performance optimization in dbt and all cost-effective solutions they might offer through their adapters for select queries (some data warehouses have alternative means to fetch results cheaply), and also support dbt's select syntax.

Running a sql query with --sql can be done without a dbt project, which uses adapter
directly (although it tries to find dbt project at the root of the dvc repo and use it if needed whose behaviour is subject to change).

import-db command

The command has two different modes:

  1. --sql
    With --sql, you can run a raw sql query and have it exported out to a file.

    dvc import-db --sql 'select * from table'

    This will save to a file named results.csv and create a results.csv.dvc file.
    If you are inside a dbt project, it will read dbt_project.yml and try to figure out the connection profile. If not, you will need to pass --profile/--target or set them in the config. There is --export-format={csv,json} support and you can change output path with --out (default output is "results.{export_format}").

  2. --model
    With --model, you are exporting the dbt models. If you are inside a dbt project (which also happens to be a dvc project), you can just do:

    dvc import-db --model model_name

    If you need to import from an external dbt project, you can pass --url (and, optionally, --rev). If the dbt project is inside a subdirectory instead, you can pass --project-dir as well. Similarly, you can provide --profile/--target to override the connection profile to use for this operation.

    dvc import-db --url <url> --rev <rev>

    dbt has a notion of packages, where you can use models in your project from another package. A dbt model can have versions too. These are not exposed yet, but supported internally in the implementation.

    --output-format is csv by default, but can be changed to --json. --out can be used to change the output path (by default, it's "{model}.{export_format}").

Config

db_profile and db_target can be set in the config, so that they apply globally. This can be used for example in CI to change to use a production environment, etc. They won't appear in .dvc file.

dvc config feature.db_profile profile
dvc config feature.db_target target

(On stabilization, we'll rename this config to db.profile and db.target respectively.)

profiles

This implementation modifies the way profiles.yml is discovered. DVC will try to find the profiles.yml at DBT_PROFILES_DIR at first (if set), then checks the root of the repo and then checks the ~/.dbt directory. This is almost the same as dbt, except the second place that dbt looks is in a current working directory (which for dvc happened to be a root of the repo).

The root of the repo is dvc's root_dir, except for --url where it's the root of the git repo. In case of --project-dir, it'll be root to that dbt project.

.dvc changes

  1. --sql

    md5: 0f6fef5da8bde7f8db2a4d08da535dd7
    frozen: true
    deps:
    - db:
        file_format: csv
        profile: jaffle_shop
        query: select * from customers
    outs:
    - md5: d0c75044f6fc86297e9449c5bd97c11a
      size: 3008
      hash: md5
      path: results.csv
  2. --model

    md5: 157f0942c72f7649197061646904bb29
    frozen: true
    deps:
    - repo:
        url: /Users/saugat/Projects/skshetry/dbt-tutorial/
        rev_lock: 76159e6c6c8f9b4af8260a9d2effc41ae920283b
      db:
        file_format: csv
        model: customers
    outs:
    - md5: d0c75044f6fc86297e9449c5bd97c11a
      size: 3008
      hash: md5
      path: customers.csv

@skshetry skshetry marked this pull request as draft October 19, 2023 16:14
Copy link

codecov bot commented Nov 2, 2023

Codecov Report

Attention: 315 lines in your changes are missing coverage. Please review.

Comparison is base (2b3b15f) 90.67% compared to head (80cb7d9) 89.37%.

Files Patch % Lines
dvc/utils/db.py 0.00% 134 Missing ⚠️
dvc/dependency/db.py 29.44% 127 Missing ⚠️
dvc/repo/imp_db.py 24.32% 28 Missing ⚠️
dvc/utils/packaging.py 0.00% 16 Missing ⚠️
dvc/commands/imp_db.py 85.18% 4 Missing ⚠️
dvc/repo/graph.py 0.00% 2 Missing and 2 partials ⚠️
dvc/repo/index.py 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10040      +/-   ##
==========================================
- Coverage   90.67%   89.37%   -1.30%     
==========================================
  Files         480      463      -17     
  Lines       36859    34288    -2571     
  Branches     5341     3066    -2275     
==========================================
- Hits        33421    30645    -2776     
- Misses       2825     3093     +268     
+ Partials      613      550      -63     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skshetry skshetry marked this pull request as ready for review November 7, 2023 10:19
return exporter[export_format](to)


class AbstractDependency(Dependency):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iterative/dvc, This is the one that I mentioned in the meeting.

I don't like this, so consider this as a temporary solution. We can discuss ideas about refactoring here or in the coming meeting and recognize potential solutions to this.

But I am also worried that it'd be a major effort (likely encompassing Stage refactoring and the pipelines).

(also see a lot of skips related to Stage.is_db_import below)

@skshetry skshetry marked this pull request as draft November 8, 2023 15:29
@skshetry
Copy link
Member Author

skshetry commented Nov 9, 2023

@dmpetrov mentioned that he'd use proxy model - an intermediate model for change detection.

@dberenbaum
Copy link
Collaborator

@dmpetrov mentioned that he'd use proxy model - an intermediate model for change detection.

Can you explain more what you mean? I don't follow.

@skshetry
Copy link
Member Author

skshetry commented Nov 9, 2023

Can you explain more what you mean? I don't follow.

Dmitry mentioned that the users can create an intermediate model, that has metadata for us to detect changes. For example, let's say we have modelA, and another proxy-model that is just select count(*) from modelA, which can be used for change detection. The model can be different depending on freshness logic that they need.

This was an idea that we could implement, if we cannot reliably figure out a change detection mechanism here.

@dberenbaum
Copy link
Collaborator

I don't think we need change detection to merge this. As a follow up, if we want it, can we query the max value of the dbt freshness loaded_at field and store that datetime value in .dvc to check for changes?

@dberenbaum
Copy link
Collaborator

--export-format [{csv,json}]

Does this sound right to you @skshetry? Not sure we should be using export in the name. Maybe we should just be using --json like we do in other commands? OTOH other commands use --json for stdout format, not output written to a file.

@skshetry
Copy link
Member Author

How about --output-format?

@dberenbaum
Copy link
Collaborator

Maybe in the help descriptions, we should specify when terms come from dbt ("dbt profile", "dbt target", etc.)

@dberenbaum
Copy link
Collaborator

(you can also pass --profiles-dir)

Did you intend to include it as an option? I don't see it available.

Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works well already IMO. The hardest part will be explaining it. I guess you need to install the appropriate adapter, then setup a profile. Are there any other setup steps? Is there a way to setup and test a profile without creating a full dbt project?

@skshetry
Copy link
Member Author

Is there a way to setup and test a profile without creating a full dbt project?

We'll have to point to dbt's docs or write our own. This is my biggest concern as well with this PR (in addition to dbt introducing backward incompatible changes, and keeping everything in memory before exporting).

Regarding testing a profile, you can use dvc import-db --sql 'select 1 as id' --profile profile (or, you can try running any query and dvc will try to check connection itself and give you an error).

@skshetry
Copy link
Member Author

I have renamed the export_format field in .dvc to file_format. The CLI is named --output-format. It might look odd in deps field to say output-format, ideally it is a part of the stage itself outside outs/deps, but it's more difficult to pass in dvc.

I have also changed the config from db.{profile,target} to feature.db_{profile,target} to merge this PR as an experimental feature. The help message for import-db is also disabled, and a warning message is raised on import-db usage. I don't think a separate config is necessary here to enable it.

I plan to merge this PR, and create a wiki guide. We can improve CLI/workflow on successive PRs.

@skshetry skshetry marked this pull request as ready for review November 21, 2023 13:00
@skshetry skshetry added skip-changelog Skips changelog A: data-management Related to dvc add/checkout/commit/move/remove labels Nov 21, 2023
@skshetry skshetry self-assigned this Nov 21, 2023
@skshetry skshetry merged commit fd7c891 into iterative:main Nov 21, 2023
@skshetry skshetry deleted the import-db branch November 21, 2023 13:53
@skshetry
Copy link
Member Author

(you can also pass --profiles-dir)

Did you intend to include it as an option? I don't see it available.

I have updated the description. We can implement --profiles-dir later if someone asks for it.

@shcheklein
Copy link
Member

Hey folks, do we have docs update for this? What is the plan for releasing this?

@skshetry
Copy link
Member Author

No plans for docs yet. I’ll add a guide in wiki today. We want to release after dvcx integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove skip-changelog Skips changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants