Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Replace examples of Hadoop catalog with JDBC & REST catalog #11845

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Dec 22, 2024

Closes #11284
devlist discussion

This PR replaces examples of Hadoop catalog with examples of JDBC catalog and add examples of setting up a REST catalog

Testing

spark-quickstart.md using JDBC catalog

Using spark-sql CLI config:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.xerial:sqlite-jdbc:3.46.1.3 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=jdbc \
    --conf spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
    --conf spark.sql.defaultCatalog=local

Using spark-defaults.conf file:

spark.jars.packages                                  org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.xerial:sqlite-jdbc:3.46.1.3
spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type                 hive
spark.sql.catalog.local                              org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type                         jdbc
spark.sql.catalog.local.uri                          jdbc:sqlite:iceberg_catalog_db.sqlite
spark.sql.catalog.local.warehouse                    warehouse
spark.sql.defaultCatalog                             local
spark-sql --properties-file ./spark-defaults.conf

spark-quickstart.md using REST catalog

With spark-sql CLI config:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.rest=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.rest.type=rest \
    --conf spark.sql.catalog.rest.uri=http://localhost:8181 \
    --conf spark.sql.catalog.rest.warehouse=s3://warehouse/ \
    --conf spark.sql.catalog.rest.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.rest.s3.endpoint=http://localhost:9000 \
    --conf spark.sql.catalog.rest.s3.path-style-access=true \
    --conf spark.sql.catalog.rest.s3.access-key-id=admin \
    --conf spark.sql.catalog.rest.s3.secret-access-key=password \
    --conf spark.sql.catalog.rest.client.region=us-east-1 \
    --conf spark.sql.defaultCatalog=rest

With spark-defaults.conf file:

spark.jars.packages                                  org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1
spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type                 hive
spark.sql.catalog.rest                               org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.rest.type                          rest
spark.sql.catalog.rest.uri                           http://localhost:8181
spark.sql.catalog.rest.warehouse                     s3://warehouse/
spark.sql.catalog.rest.io-impl                       org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.rest.s3.endpoint                   http://localhost:9000
spark.sql.catalog.rest.s3.path-style-access          true
spark.sql.catalog.rest.s3.access-key-id              admin
spark.sql.catalog.rest.s3.secret-access-key          password
spark.sql.catalog.rest.client.region                 us-east-1
spark.sql.defaultCatalog                             rest
spark-sql --properties-file ./spark-defaults.conf

Rendered Docs

site/docs/spark-quickstart.md (http://127.0.0.1:8000/spark-quickstart/#adding-catalogs)

Screenshot 2024-12-22 at 2 23 16 PM

docs/docs/spark-getting-started.md (http://127.0.0.1:8000/docs/nightly/spark-getting-started/#adding-catalogs)

Screenshot 2024-12-22 at 2 24 07 PM

site/docs/how-to-release.md (http://127.0.0.1:8000/how-to-release/#verifying-with-spark)

Screenshot 2024-12-22 at 2 24 28 PM

@github-actions github-actions bot added the docs label Dec 22, 2024
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/getting-started-without-hadoop-catalog branch from 00ca569 to 6fe50e1 Compare December 22, 2024 18:58
@kevinjqliu kevinjqliu force-pushed the kevinjqliu/getting-started-without-hadoop-catalog branch from 496e51e to 63c9a1a Compare December 22, 2024 21:01
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, there are two "getting started" docs
this one and site/docs/spark-quickstart.md

@@ -269,42 +273,104 @@ To read a table, simply use the Iceberg table's name.

### Adding A Catalog

Iceberg has several catalog back-ends that can be used to track tables, like JDBC, Hive MetaStore and Glue.
Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. In this guide,
we use JDBC, but you can follow these instructions to configure other catalog types. To learn more, check out
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird that the guide already mention JDBC here, but the example is still hadoop

Comment on lines 29 to 33
- [Configuring JDBC Catalog](#configuring-jdbc-catalog)
- [Configuring REST Catalog](#configuring-rest-catalog)
- [Next steps](#next-steps)
- [Adding Iceberg to Spark](#adding-iceberg-to-spark)
- [Learn More](#learn-more)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renders the subsection correctly
Screenshot 2024-12-22 at 1 22 48 PM

--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
```

For example configuring a REST-based catalog, see [Configuring REST Catalog](/spark-quickstart#configuring-rest-catalog)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of repeating here for configuring REST catalog, just link to site/docs/spark-quickstart.md. I double checked the link here locally

--conf spark.sql.catalog.local.type=jdbc \
--conf spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
--conf spark.sql.defaultCatalog=local
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add defaultCatalog to match other pages

spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type hive
spark.sql.catalog.local org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type hadoop
spark.sql.catalog.local.warehouse $PWD/warehouse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$PWD does not expand in spark-defaults.conf. keeping this here will create a folder named $PWD

@kevinjqliu kevinjqliu marked this pull request as ready for review December 22, 2024 22:28
@jbonofre jbonofre self-requested a review December 23, 2024 06:17

=== "CLI"

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}\
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }},org.xerial:sqlite-jdbc:3.46.1.3 \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking on this extra dep since i dont see any iceberg specific package i can use. there is a hive-jdbc package

Copy link
Contributor

@mrcnc mrcnc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks for improving this!

@kevinjqliu kevinjqliu requested a review from Fokko January 13, 2025 18:35
@kevinjqliu kevinjqliu changed the title [docs] Replace examples of Hadoop catalog with JDBC catalog [docs] Replace examples of Hadoop catalog with JDBC & REST catalog Jan 13, 2025

This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog:
This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog.
This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in `spark_catalog` using the Hive connector.

Copy link
Contributor Author

@kevinjqliu kevinjqliu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like this,

Suggested change
This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog.
This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog (`spark_catalog`) using the Hive connector.

Is it the "Hive connector" or the "Hive Metastore"?

But I'm also incline not to add this. I feel like this is too detailed for a "getting started" page.

Copy link

github-actions bot commented Mar 2, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Docs] Update Examples to Replace Hadoop Catalog with JDBC Catalog
4 participants