-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd-reference: document import-db
#5033
Merged
Merged
Changes from 10 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
dddfc66
import-db: add documentation
skshetry 320ff00
import-db: document --output-format
skshetry 357d3c0
remove dbt information
skshetry ee19fa1
remove dbt config
skshetry 655ffa4
remove Examples intro
skshetry e8374c7
make example db connection name consistent
skshetry ab36f61
Fix internal links
skshetry 387d80c
update
skshetry d7265fc
add link to sqlachemy
skshetry a4a54c6
fix review suggestions
skshetry 161a83b
use lowercase metavar
skshetry File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,178 @@ | ||
# import-db | ||
|
||
<admon type="warn" title="Experimental"> | ||
|
||
This is an experimental command, and is subject to breaking changes. | ||
|
||
</admon> | ||
|
||
Snapshot a table or a SQL query results from a database into CSV/JSON format. | ||
|
||
```usage | ||
usage: dvc import-db [-h] [-q | -v] | ||
[--sql SQL | --table TABLE] [--conn CONN] | ||
[--output-format [{csv,json}]] [-o [<path>]] [-f] | ||
``` | ||
|
||
## Description | ||
|
||
With `import-db`, you can snapshot your ETL/database to a file to use in your | ||
data pipelines. This commands supports importing your table or a SQL query | ||
results into different file formats. To do so, you have to set connection | ||
strings to connect to a database, which can be setup in config as `db.<name>`. | ||
Check [Database Connections](#database-connections) for more information. | ||
|
||
At the moment, `import-db` supports two different output format: | ||
|
||
- JSON records | ||
- CSV (with header, and no index) | ||
|
||
An _import `.dvc` file_ is created in the same location e.g. | ||
`customers.txt.dvc`. This makes it possible to update the import later, if the | ||
data source has changed (see `dvc update`). | ||
|
||
<admon type="info"> | ||
|
||
You can `dvc push` and `dvc pull` data imported from the databases to/from | ||
remote storage normally. | ||
|
||
</admon> | ||
|
||
## Database Connections | ||
|
||
To connect to a database, DVC needs a database connection string URI. This has | ||
to be configured in the [`db`] section. | ||
|
||
```dvc | ||
$ dvc config db.pgsql.url postgresql://user@hostname:port/database | ||
$ dvc config --local db.pgsql.password password | ||
``` | ||
|
||
<admon type="warn" title="Security Alert"> | ||
|
||
Configure `password` with `--local` option so they are written to a Git-ignored | ||
config file. | ||
|
||
</admon> | ||
|
||
<admon type="warn" title="Security Alert"> | ||
|
||
Use an user account with limited access to databases with read-only privileges, | ||
as `--sql` can run arbitrary queries. Different databases have different | ||
approaches to this. Refer to their documentation for more details. | ||
|
||
</admon> | ||
|
||
You need to specify the name of database connection to use, when using | ||
`import-db`. | ||
|
||
```dvc | ||
$ dvc import-db --table customers_table --conn pgsql | ||
``` | ||
|
||
In addition to a connection string, DVC needs a driver to connect to the | ||
database. Check [Installing database drivers](#installing-database-drivers) for | ||
connection string format and necessary driver for your specific database. | ||
|
||
[`db`]: /doc/user-guide/project-structure/configuration#db | ||
|
||
## Installing database drivers | ||
|
||
DVC does not come preinstalled with all the drivers for databases, you’ll need | ||
to install the required packages for the database you want to use. | ||
|
||
Some of the recommended packages are shown below, with their expected connection | ||
strings: | ||
|
||
| **Database** | **PyPI package** | **Connection String** | | ||
dberenbaum marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ------------------- | --------------------------------- | --------------------------------------------------------------------------------------------------- | | ||
| **Amazon Redshift** | `sqlalchemy-redshift` | `redshift+psycopg2://{username}:{password}@{aws_endpoint}:5439/{database_name}` | | ||
| **Big Query** | `pip install sqlalchemy-bigquery` | `bigquery://{project_id}` | | ||
| **Databricks** | `databricks-sql-connector` | `databricks://token:{token}@{hostname}:{port}/{database}?http_path={http_path}` | | ||
| **MySQL** | `mysqlclient` | `mysql://{username}:{password}@{hostname}/{database_name}` | | ||
| **Oracle** | `cx_Oracle` | `oracle://{username}:{password}@{hostname}/{database_name}` | | ||
| **PostgreSQL** | `psycopg2` | `postgresql://{username}:{password}@{hostname}/{database_name}` | | ||
| **Snowflake** | `snowflake-sqlalchemy` | `snowflake://{user}:{password}@{account}.{region}/{database}?role={role}&warehouse={warehouse}` | | ||
| **SQLite** | - | `sqlite://path/to/file.db` | | ||
| **SQL Server** | `pyodbc` | `mssql+pyodbc://{username}:{password}@{hostname}:{port}/{database_name}` | | ||
| **Trino** | `trino` | `trino://{username}:{password}@{hostname}:{port}/{catalog}` | | ||
|
||
DVC uses [`sqlalchemy`](https://www.sqlalchemy.org/) internally. So DVC should | ||
support any SQL databases that provide dialects for SQLAlchemy. Refer to their | ||
[documentation](https://docs.sqlalchemy.org/en/20/core/engines.html#backend-specific-urls) | ||
for more details. | ||
|
||
## Options | ||
|
||
- `-o <path>`, `--out <path>` - specify a `path` to the desired location in the | ||
workspace to place the file. If not specified, the filename will be generated | ||
using the arguments from `--output-format` and `--table`, or for `--sql`, it | ||
starts with "results" by default. | ||
|
||
- `--table <table>` - table to snapshot. | ||
|
||
- `--sql <query>` - execute SQL query and snapshot its result. | ||
|
||
- `--output-format` - type of format to materialize into. `csv` (default) and | ||
`json` is supported. | ||
|
||
- `--conn connection` - name of the database connection to use. The connection | ||
has to be set in the | ||
[config](/doc/user-guide/project-structure/configuration#db). | ||
|
||
- `-f`, `--force` - when using `--out` to specify a local target file or | ||
directory, the operation will fail if those paths already exist. this flag | ||
will force the operation causing local files/dirs to be overwritten by the | ||
command. | ||
|
||
- `-h`, `--help` - prints the usage/help message, and exit. | ||
|
||
- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no | ||
problems arise, otherwise 1. | ||
|
||
- `-v`, `--verbose` - displays detailed tracing information. | ||
|
||
## Examples | ||
|
||
### Downloading a table | ||
|
||
To import a table from a database using a `db` config set: | ||
|
||
```dvc | ||
$ dvc import-db --table "customers_table" --conn pgsql | ||
... | ||
``` | ||
|
||
`dvc import-db` will snapshot the complete table, and save to a file named | ||
`customers_table.csv`. It will also create a `customers_table.csv.dvc` file with | ||
the following contents: | ||
|
||
```yaml | ||
md5: ddd4654188815dcae6ce4d4a37f83bde | ||
frozen: true | ||
deps: | ||
- db: | ||
file_format: csv | ||
connection: pgsql | ||
table: customers_table | ||
outs: | ||
- md5: 131543a828b297ce0a5925800bd88810 | ||
size: 15084226 | ||
hash: md5 | ||
path: customers_table.csv | ||
``` | ||
|
||
You can use `dvc update` to update the snapshot. | ||
|
||
### Downloading SQL query result | ||
|
||
Similarly, you can also snapshot a SQL query result as follows: | ||
|
||
```dvc | ||
$ dvc import-db --sql "select * from customers" --conn pgsql | ||
... | ||
``` | ||
|
||
`dvc import-db` will snapshot the query results, and save to a file named | ||
`results.csv`. Similarly, it will also create a `results.csv.dvc` file, which | ||
can be used to `dvc update` later. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, this should be:
But auto-linking seems to be broken with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change it in the DVC itself also. We don't use this style. I mean all uppercase CONN or TABLE. Pleas check other commands.
If this
(--sql SQL | --table TABLE)
is correct (why?) - we should also fix first on the DVC side - and adjust auto-linker here, I hope it's not that complicatedbut let's check if we do this in some other DVC commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are a mutually-exclusive argument, and one of them is required. I can change the casings of
TABLE
/SQL
with metavar (uppercased name is just default, and we do those in many places when we are lazy).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed this in the docs. Will handle this in DVC separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there examples? I checked a few commands and did't find it.
anyways, I think we need to update it on the DVC side for all the commands
good point, then let's change on the DVC side first and we can fix the website later ... DVC is more important I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See [-m MESSAGE] and [-c COPY_PATHS]. It's just the default from argparse, so it's easy to miss here. There are other commands like
du
/artifacts get
/exp save
/fetch
/ls-url
/ls
/get
/import
, etc that does this. :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this in dvc by iterative/dvc#10226.