diff --git a/content/docs/command-reference/import-db.md b/content/docs/command-reference/import-db.md new file mode 100644 index 0000000000..285db1112d --- /dev/null +++ b/content/docs/command-reference/import-db.md @@ -0,0 +1,178 @@ +# import-db + + + +This is an experimental command, and is subject to breaking changes. + + + +Snapshot a table or a SQL query results from a database into CSV/JSON format. + +```usage +usage: dvc import-db [-h] [-q | -v] + [--sql sql | --table table] [--conn conn] + [--output-format [{csv,json}]] [-o []] [-f] +``` + +## Description + +With `import-db`, you can snapshot your ETL/database to a file to use in your +data pipelines. This commands supports importing your table or a SQL query +results into different file formats. To do so, you have to set connection +strings to connect to a database, which can be setup in config as `db.`. +Check [Database Connections](#database-connections) for more information. + +At the moment, `import-db` supports two different output format: + +- JSON records +- CSV (with header, and no index) + +An _import `.dvc` file_ is created in the same location e.g. +`customers.txt.dvc`. This makes it possible to update the import later, if the +data source has changed (see `dvc update`). + + + +You can `dvc push` and `dvc pull` data imported from the databases to/from +remote storage normally. + + + +## Database Connections + +To connect to a database, DVC needs a database connection string URI. This has +to be configured in the [`db`] section. + +```dvc +$ dvc config db.pgsql.url postgresql://user@hostname:port/database +$ dvc config --local db.pgsql.password password +``` + + + +Configure `password` with `--local` option so they are written to a Git-ignored +config file. + + + + + +Use an user account with limited access to databases with read-only privileges, +as `--sql` can run arbitrary queries. Different databases have different +approaches to this. Refer to their documentation for more details. + + + +You need to specify the name of database connection to use, when using +`import-db`. + +```dvc +$ dvc import-db --table customers_table --conn pgsql +``` + +In addition to a connection string, DVC needs a driver to connect to the +database. Check [Installing database drivers](#installing-database-drivers) for +connection string format and necessary driver for your specific database. + +[`db`]: /doc/user-guide/project-structure/configuration#db + +## Installing database drivers + +DVC does not come preinstalled with all the drivers for databases, you’ll need +to install the required packages for the database you want to use. + +Some of the recommended packages are shown below, with their expected connection +strings: + +| **Database** | **PyPI package** | **Connection String** | +| ------------------- | --------------------------------- | --------------------------------------------------------------------------------------------------- | +| **Amazon Redshift** | `sqlalchemy-redshift` | `redshift+psycopg2://{username}:{password}@{aws_endpoint}:5439/{database_name}` | +| **Big Query** | `pip install sqlalchemy-bigquery` | `bigquery://{project_id}` | +| **Databricks** | `databricks-sql-connector` | `databricks://token:{token}@{hostname}:{port}/{database}?http_path={http_path}` | +| **MySQL** | `mysqlclient` | `mysql://{username}:{password}@{hostname}/{database_name}` | +| **Oracle** | `cx_Oracle` | `oracle://{username}:{password}@{hostname}/{database_name}` | +| **PostgreSQL** | `psycopg2` | `postgresql://{username}:{password}@{hostname}/{database_name}` | +| **Snowflake** | `snowflake-sqlalchemy` | `snowflake://{user}:{password}@{account}.{region}/{database}?role={role}&warehouse={warehouse}` | +| **SQLite** | - | `sqlite://path/to/file.db` | +| **SQL Server** | `pyodbc` | `mssql+pyodbc://{username}:{password}@{hostname}:{port}/{database_name}` | +| **Trino** | `trino` | `trino://{username}:{password}@{hostname}:{port}/{catalog}` | + +DVC uses [`sqlalchemy`](https://www.sqlalchemy.org/) internally. So DVC should +support any SQL databases that provide dialects for SQLAlchemy. Refer to their +[documentation](https://docs.sqlalchemy.org/en/20/core/engines.html#backend-specific-urls) +for more details. + +## Options + +- `-o `, `--out ` - specify a `path` to the desired location in the + workspace to place the file. If not specified, the filename will be generated + using the arguments from `--output-format` and `--table`, or for `--sql`, it + starts with "results" by default. + +- `--table ` - table to snapshot. + +- `--sql ` - execute SQL query and snapshot its result. + +- `--output-format` - type of format to materialize into. `csv` (default) and + `json` is supported. + +- `--conn connection` - name of the database connection to use. The connection + has to be set in the + [config](/doc/user-guide/project-structure/configuration#db). + +- `-f`, `--force` - when using `--out` to specify a local target file or + directory, the operation will fail if those paths already exist. this flag + will force the operation causing local files/dirs to be overwritten by the + command. + +- `-h`, `--help` - prints the usage/help message, and exit. + +- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no + problems arise, otherwise 1. + +- `-v`, `--verbose` - displays detailed tracing information. + +## Examples + +### Downloading a table + +To import a table from a database using a `db` config set: + +```dvc +$ dvc import-db --table "customers_table" --conn pgsql +... +``` + +`dvc import-db` will snapshot the complete table, and save to a file named +`customers_table.csv`. It will also create a `customers_table.csv.dvc` file with +the following contents: + +```yaml +md5: ddd4654188815dcae6ce4d4a37f83bde +frozen: true +deps: + - db: + file_format: csv + connection: pgsql + table: customers_table +outs: + - md5: 131543a828b297ce0a5925800bd88810 + size: 15084226 + hash: md5 + path: customers_table.csv +``` + +You can use `dvc update` to update the snapshot. + +### Downloading SQL query result + +Similarly, you can also snapshot a SQL query result as follows: + +```dvc +$ dvc import-db --sql "select * from customers" --conn pgsql +... +``` + +`dvc import-db` will snapshot the query results, and save to a file named +`results.csv`. Similarly, it will also create a `results.csv.dvc` file, which +can be used to `dvc update` later. diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 000fe74587..8d64a04261 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -408,6 +408,10 @@ "label": "get", "slug": "get" }, + { + "label": "import-db", + "slug": "import-db" + }, { "label": "import-url", "slug": "import-url" diff --git a/content/docs/user-guide/project-structure/configuration.md b/content/docs/user-guide/project-structure/configuration.md index 8975658698..586d8ff4ee 100644 --- a/content/docs/user-guide/project-structure/configuration.md +++ b/content/docs/user-guide/project-structure/configuration.md @@ -59,6 +59,7 @@ within: - [`remote`](#remote) - sections in the config file that describe [remote storage] - [`cache`](#cache) - options that affect the project's cache +- [`db`](#db) - sections in the config file that describe [database connections] - [`hydra`](#hydra) - options around [Hydra Composition] for experiment configuration. - [`parsing`](#parsing) - options around the parsing of [dictionary unpacking]. @@ -72,6 +73,7 @@ within: [remote storage]: /doc/user-guide/data-management/remote-storage [hydra composition]: /doc/user-guide/experiment-management/hydra-composition +[database connections]: /doc/command-reference/import-db#database-connections [dictionary unpacking]: /doc/user-guide/project-structure/dvcyaml-files#dictionary-unpacking [internals]: /doc/user-guide/project-structure/internal-files @@ -215,6 +217,37 @@ section):
+## db + +Similar to `remote`, configuration files may have more than one `'db'`. All of +them require a unique `"name"` and a `url` value which is a connection string to +connect to the database. They can also specify `username` and `password` +options, which is used to combine with provided `url`, which is what is passed +to the appropriate database drivers to connect to the database. + + + +Set `password` to a Git-ignored local config file (`.dvc/config.local`) so that +no secrets are leaked through Git. + + + +As an example, the following config file defines a `pgsql` database connection +to connect to the `dbname` database as a user `user` hosted at `host` url. The +`postgresql://` defines a driver to be used to connect to that database. + +```ini +['db "pgsql"'] + url = "postgresql://user@host/dbname +``` + +The name, `pgsql` for example, can be used to specify what database to connect +to, in commands like `import-db`. + +
+ +
+ ## hydra Sets the defaults for experiment configuration via [Hydra