-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement import-db command #10040
implement import-db command #10040
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #10040 +/- ##
==========================================
- Coverage 90.67% 89.37% -1.30%
==========================================
Files 480 463 -17
Lines 36859 34288 -2571
Branches 5341 3066 -2275
==========================================
- Hits 33421 30645 -2776
- Misses 2825 3093 +268
+ Partials 613 550 -63 ☔ View full report in Codecov by Sentry. |
return exporter[export_format](to) | ||
|
||
|
||
class AbstractDependency(Dependency): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iterative/dvc, This is the one that I mentioned in the meeting.
I don't like this, so consider this as a temporary solution. We can discuss ideas about refactoring here or in the coming meeting and recognize potential solutions to this.
But I am also worried that it'd be a major effort (likely encompassing Stage refactoring and the pipelines).
(also see a lot of skips related to Stage.is_db_import
below)
@dmpetrov mentioned that he'd use |
Can you explain more what you mean? I don't follow. |
Dmitry mentioned that the users can create an intermediate model, that has metadata for us to detect changes. For example, let's say we have This was an idea that we could implement, if we cannot reliably figure out a change detection mechanism here. |
I don't think we need change detection to merge this. As a follow up, if we want it, can we query the max value of the dbt freshness loaded_at field and store that datetime value in |
Does this sound right to you @skshetry? Not sure we should be using |
How about |
Maybe in the help descriptions, we should specify when terms come from dbt ("dbt profile", "dbt target", etc.) |
Did you intend to include it as an option? I don't see it available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works well already IMO. The hardest part will be explaining it. I guess you need to install the appropriate adapter, then setup a profile. Are there any other setup steps? Is there a way to setup and test a profile without creating a full dbt project?
We'll have to point to dbt's docs or write our own. This is my biggest concern as well with this PR (in addition to dbt introducing backward incompatible changes, and keeping everything in memory before exporting). Regarding testing a profile, you can use |
I have renamed the I have also changed the config from I plan to merge this PR, and create a wiki guide. We can improve CLI/workflow on successive PRs. |
I have updated the description. We can implement |
Hey folks, do we have docs update for this? What is the plan for releasing this? |
No plans for docs yet. I’ll add a guide in wiki today. We want to release after dvcx integration. |
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
About
dbt
dbt
parsesdbt_project.yml
file, which is how it knows a directory is a dbt project.It is usually at the root of the repo, but ideally, it can be anywhere.
dbt
reads this file and obtains theprofile
name.Then,
dbt
checks theprofiles.yml
file, first inDBT_PROFILES_DIR
envvar (if set), then the current working directory and then to~/.dbt
directory.profiles.yml
has connection details for different warehouses. It will look for matchingprofile
name in the file.A profile can have multiple targets, and each target specifies a separate credentials/connections to the warehouses. This is usually used "to encourage the use of separate development and production environments". There is always a default target and you can use
--target
to change that (there is also--profile
that you can change).See https://docs.getdbt.com/docs/core/connect-data-platform/connection-profiles.
dbt
connects with various data platforms withadapters
, which we are trying to exploit to avoid AuthN/AuthZ/RBAC handlings.Implementation
The current implementation uses
dbt show
command to download to a file. Although, dbt provides an API to call commands programmatically, it's return type is not guaranteed stable and is subject to change. This PR uses result from the command which includes aagate.Table
which is used to convert to csv/json format. There is no streaming support though, so it will keep everything in memory until it is exported out. But the performance is good-enough for ~1M rows range. Usingdbt show
- a very high level API might give us some benefits of performance optimization indbt
and all cost-effective solutions they might offer through their adapters forselect
queries (some data warehouses have alternative means to fetch results cheaply), and also support dbt's select syntax.Running a sql query with
--sql
can be done without a dbt project, which usesadapter
directly (although it tries to find dbt project at the root of the dvc repo and use it if needed whose behaviour is subject to change).
import-db
commandThe command has two different modes:
--sql
With
--sql
, you can run a raw sql query and have it exported out to a file.dvc import-db --sql 'select * from table'
This will save to a file named
results.csv
and create aresults.csv.dvc
file.If you are inside a dbt project, it will read
dbt_project.yml
and try to figure out the connection profile. If not, you will need to pass--profile
/--target
or set them in the config. There is--export-format={csv,json}
support and you can change output path with--out
(default output is "results.{export_format}").--model
With
--model
, you are exporting the dbt models. If you are inside a dbt project (which also happens to be a dvc project), you can just do:dvc import-db --model model_name
If you need to import from an external dbt project, you can pass
--url
(and, optionally,--rev
). If the dbt project is inside a subdirectory instead, you can pass--project-dir
as well. Similarly, you can provide--profile
/--target
to override the connection profile to use for this operation.dvc import-db --url <url> --rev <rev>
dbt has a notion of packages, where you can use models in your project from another package. A
dbt model
can have versions too. These are not exposed yet, but supported internally in the implementation.--output-format
iscsv
by default, but can be changed to--json
.--out
can be used to change the output path (by default, it's "{model}.{export_format}").Config
db_profile
anddb_target
can be set in the config, so that they apply globally. This can be used for example in CI to change to use a production environment, etc. They won't appear in.dvc
file.(On stabilization, we'll rename this config to
db.profile
anddb.target
respectively.)profiles
This implementation modifies the way
profiles.yml
is discovered. DVC will try to find theprofiles.yml
atDBT_PROFILES_DIR
at first (if set), then checks the root of the repo and then checks the~/.dbt
directory. This is almost the same asdbt
, except the second place that dbt looks is in a current working directory (which for dvc happened to be a root of the repo).The root of the repo is dvc's
root_dir
, except for--url
where it's the root of the git repo. In case of--project-dir
, it'll be root to that dbt project..dvc
changes--sql
--model