Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd-reference: document import-db #5033

Merged
merged 11 commits into from
Jan 11, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 178 additions & 0 deletions content/docs/command-reference/import-db.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# import-db

<admon type="warn" title="Experimental">

This is an experimental command, and is subject to breaking changes.

</admon>

Snapshot a table or a SQL query results from a database into CSV/JSON format.

```usage
usage: dvc import-db [-h] [-q | -v]
[--sql SQL | --table TABLE] [--conn CONN]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this should be:

Suggested change
[--sql SQL | --table TABLE] [--conn CONN]
(--sql SQL | --table TABLE) --conn CONN

But auto-linking seems to be broken with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change it in the DVC itself also. We don't use this style. I mean all uppercase CONN or TABLE. Pleas check other commands.

If this (--sql SQL | --table TABLE) is correct (why?) - we should also fix first on the DVC side - and adjust auto-linker here, I hope it's not that complicated

but let's check if we do this in some other DVC commands

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are a mutually-exclusive argument, and one of them is required. I can change the casings of TABLE/SQL with metavar (uppercased name is just default, and we do those in many places when we are lazy).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this in the docs. Will handle this in DVC separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we do those in many places when we are lazy

are there examples? I checked a few commands and did't find it.

anyways, I think we need to update it on the DVC side for all the commands

They are a mutually-exclusive argument, and one of them is required.

good point, then let's change on the DVC side first and we can fix the website later ... DVC is more important I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usage: dvc experiments run [-h] [-q | -v] [-f] [-i] [-s] [-p] [-P] [-R] [--downstream] [--force-downstream] [--pull]
                           [--allow-missing] [--dry] [-k] [--ignore-errors] [-n <name>] [-S [<filename>:]<param_name>=<param_value>]
                           [--queue] [--run-all] [-j <number>] [--temp] [-C COPY_PATHS] [-m MESSAGE]
                           [targets ...]

See [-m MESSAGE] and [-c COPY_PATHS]. It's just the default from argparse, so it's easy to miss here. There are other commands like du/artifacts get/exp save/fetch/ls-url/ls/get/import, etc that does this. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this in dvc by iterative/dvc#10226.

[--output-format [{csv,json}]] [-o [<path>]] [-f]
```

## Description

With `import-db`, you can snapshot your ETL/database to a file to use in your
data pipelines. This commands supports importing your table or a SQL query
results into different file formats. To do so, you have to set connection
strings to connect to a database, which can be setup in config as `db.<name>`.
Check [Database Connections](#database-connections) for more information.

At the moment, `import-db` supports two different output format:

- JSON records
- CSV (with header, and no index)

An _import `.dvc` file_ is created in the same location e.g.
`customers.txt.dvc`. This makes it possible to update the import later, if the
data source has changed (see `dvc update`).

<admon type="info">

You can `dvc push` and `dvc pull` data imported from the databases to/from
remote storage normally.

</admon>

## Database Connections

To connect to a database, DVC needs a database connection string URI. This has
to be configured in the [`db`] section.

```dvc
$ dvc config db.pgsql.url postgresql://user@hostname:port/database
$ dvc config --local db.pgsql.password password
```

<admon type="warn" title="Security Alert">

Configure `password` with `--local` option so they are written to a Git-ignored
config file.

</admon>

<admon type="warn" title="Security Alert">

Use an user account with limited access to databases with read-only privileges,
as `--sql` can run arbitrary queries. Different databases have different
approaches to this. Refer to their documentation for more details.

</admon>

You need to specify the name of database connection to use, when using
`import-db`.

```dvc
$ dvc import-db --table customers_table --conn pgsql
```

In addition to a connection string, DVC needs a driver to connect to the
database. Check [Installing database drivers](#installing-database-drivers) for
connection string format and necessary driver for your specific database.

[`db`]: /doc/user-guide/project-structure/configuration#db

## Installing database drivers

DVC does not come preinstalled with all the drivers for databases, you’ll need
to install the required packages for the database you want to use.

Some of the recommended packages are shown below, with their expected connection
strings:

| **Database** | **PyPI package** | **Connection String** |
dberenbaum marked this conversation as resolved.
Show resolved Hide resolved
| ------------------- | --------------------------------- | --------------------------------------------------------------------------------------------------- |
| **Amazon Redshift** | `sqlalchemy-redshift` | `redshift+psycopg2://{username}:{password}@{aws_endpoint}:5439/{database_name}` |
| **Big Query** | `pip install sqlalchemy-bigquery` | `bigquery://{project_id}` |
| **Databricks** | `databricks-sql-connector` | `databricks://token:{token}@{hostname}:{port}/{database}?http_path={http_path}` |
| **MySQL** | `mysqlclient` | `mysql://{username}:{password}@{hostname}/{database_name}` |
| **Oracle** | `cx_Oracle` | `oracle://{username}:{password}@{hostname}/{database_name}` |
| **PostgreSQL** | `psycopg2` | `postgresql://{username}:{password}@{hostname}/{database_name}` |
| **Snowflake** | `snowflake-sqlalchemy` | `snowflake://{user}:{password}@{account}.{region}/{database}?role={role}&amp;warehouse={warehouse}` |
| **SQLite** | - | `sqlite://path/to/file.db` |
| **SQL Server** | `pyodbc` | `mssql+pyodbc://{username}:{password}@{hostname}:{port}/{database_name}` |
| **Trino** | `trino` | `trino://{username}:{password}@{hostname}:{port}/{catalog}` |

DVC uses [`sqlalchemy`](https://www.sqlalchemy.org/) internally. So DVC should
support any SQL databases that provide dialects for SQLAlchemy. Refer to their
[documentation](https://docs.sqlalchemy.org/en/20/core/engines.html#backend-specific-urls)
for more details.

## Options

- `-o <path>`, `--out <path>` - specify a `path` to the desired location in the
workspace to place the file. If not specified, the filename will be generated
using the arguments from `--output-format` and `--table`, or for `--sql`, it
starts with "results" by default.

- `--table <table>` - table to snapshot.

- `--sql <query>` - execute SQL query and snapshot its result.

- `--output-format` - type of format to materialize into. `csv` (default) and
`json` is supported.

- `--conn connection` - name of the database connection to use. The connection
has to be set in the
[config](/doc/user-guide/project-structure/configuration#db).

- `-f`, `--force` - when using `--out` to specify a local target file or
directory, the operation will fail if those paths already exist. this flag
will force the operation causing local files/dirs to be overwritten by the
command.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

### Downloading a table

To import a table from a database using a `db` config set:

```dvc
$ dvc import-db --table "customers_table" --conn pgsql
...
```

`dvc import-db` will snapshot the complete table, and save to a file named
`customers_table.csv`. It will also create a `customers_table.csv.dvc` file with
the following contents:

```yaml
md5: ddd4654188815dcae6ce4d4a37f83bde
frozen: true
deps:
- db:
file_format: csv
connection: pgsql
table: customers_table
outs:
- md5: 131543a828b297ce0a5925800bd88810
size: 15084226
hash: md5
path: customers_table.csv
```

You can use `dvc update` to update the snapshot.

### Downloading SQL query result

Similarly, you can also snapshot a SQL query result as follows:

```dvc
$ dvc import-db --sql "select * from customers" --conn pgsql
...
```

`dvc import-db` will snapshot the query results, and save to a file named
`results.csv`. Similarly, it will also create a `results.csv.dvc` file, which
can be used to `dvc update` later.
4 changes: 4 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,10 @@
"label": "get",
"slug": "get"
},
{
"label": "import-db",
"slug": "import-db"
},
{
"label": "import-url",
"slug": "import-url"
Expand Down
33 changes: 33 additions & 0 deletions content/docs/user-guide/project-structure/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ within:
- [`remote`](#remote) - sections in the config file that describe [remote
storage]
- [`cache`](#cache) - options that affect the project's <abbr>cache</abbr>
- [`db`](#db) - sections in the config file that describe [database connections]
- [`hydra`](#hydra) - options around [Hydra Composition] for experiment
configuration.
- [`parsing`](#parsing) - options around the parsing of [dictionary unpacking].
Expand All @@ -72,6 +73,7 @@ within:

[remote storage]: /doc/user-guide/data-management/remote-storage
[hydra composition]: /doc/user-guide/experiment-management/hydra-composition
[database connections]: /doc/command-reference/import-db#database-connections
[dictionary unpacking]:
/doc/user-guide/project-structure/dvcyaml-files#dictionary-unpacking
[internals]: /doc/user-guide/project-structure/internal-files
Expand Down Expand Up @@ -215,6 +217,37 @@ section):

<details>

## db

Similar to `remote`, configuration files may have more than one `'db'`. All of
them require a unique `"name"` and a `url` value which is a connection string to
connect to the database. They can also specify `username` and `password`
options, which is used to combine with provided `url`, which is what is passed
to the appropriate database drivers to connect to the database.

<admon type="warn">

Set `password` to a Git-ignored local config file (`.dvc/config.local`) so that
no secrets are leaked through Git.

</admon>

As an example, the following config file defines a `pgsql` database connection
to connect to the `dbname` database as a user `user` hosted at `host` url. The
`postgresql://` defines a driver to be used to connect to that database.

```ini
['db "pgsql"']
url = "postgresql://user@host/dbname
```

The name, `pgsql` for example, can be used to specify what database to connect
to, in commands like `import-db`.

</details>

<details>

## hydra

Sets the defaults for <abbr>experiment</abbr> configuration via [Hydra
Expand Down
Loading