[docs] Replace examples of Hadoop catalog with JDBC & REST catalog #11845

kevinjqliu · 2024-12-22T02:02:23Z

This PR replaces examples of Hadoop catalog with examples of JDBC catalog and add examples of setting up a REST catalog

Testing

`spark-quickstart.md` using JDBC catalog

Using spark-sql CLI config:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.xerial:sqlite-jdbc:3.46.1.3 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=jdbc \
    --conf spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
    --conf spark.sql.defaultCatalog=local

Using spark-defaults.conf file:

spark.jars.packages                                  org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.xerial:sqlite-jdbc:3.46.1.3
spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type                 hive
spark.sql.catalog.local                              org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type                         jdbc
spark.sql.catalog.local.uri                          jdbc:sqlite:iceberg_catalog_db.sqlite
spark.sql.catalog.local.warehouse                    warehouse
spark.sql.defaultCatalog                             local

spark-sql --properties-file ./spark-defaults.conf

`spark-quickstart.md` using REST catalog

With spark-sql CLI config:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.rest=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.rest.type=rest \
    --conf spark.sql.catalog.rest.uri=http://localhost:8181 \
    --conf spark.sql.catalog.rest.warehouse=s3://warehouse/ \
    --conf spark.sql.catalog.rest.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.rest.s3.endpoint=http://localhost:9000 \
    --conf spark.sql.catalog.rest.s3.path-style-access=true \
    --conf spark.sql.catalog.rest.s3.access-key-id=admin \
    --conf spark.sql.catalog.rest.s3.secret-access-key=password \
    --conf spark.sql.catalog.rest.client.region=us-east-1 \
    --conf spark.sql.defaultCatalog=rest

With spark-defaults.conf file:

spark.jars.packages                                  org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1
spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type                 hive
spark.sql.catalog.rest                               org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.rest.type                          rest
spark.sql.catalog.rest.uri                           http://localhost:8181
spark.sql.catalog.rest.warehouse                     s3://warehouse/
spark.sql.catalog.rest.io-impl                       org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.rest.s3.endpoint                   http://localhost:9000
spark.sql.catalog.rest.s3.path-style-access          true
spark.sql.catalog.rest.s3.access-key-id              admin
spark.sql.catalog.rest.s3.secret-access-key          password
spark.sql.catalog.rest.client.region                 us-east-1
spark.sql.defaultCatalog                             rest

spark-sql --properties-file ./spark-defaults.conf

Rendered Docs

`site/docs/spark-quickstart.md` (`http://127.0.0.1:8000/spark-quickstart/#adding-catalogs`)

`docs/docs/spark-getting-started.md` (`http://127.0.0.1:8000/docs/nightly/spark-getting-started/#adding-catalogs`)

`site/docs/how-to-release.md` (`http://127.0.0.1:8000/how-to-release/#verifying-with-spark`)

kevinjqliu · 2024-12-22T19:57:10Z

docs/docs/spark-getting-started.md

note, there are two "getting started" docs
this one and site/docs/spark-quickstart.md

kevinjqliu · 2024-12-22T19:59:22Z

site/docs/spark-quickstart.md

@@ -269,42 +273,104 @@ To read a table, simply use the Iceberg table's name.

 ### Adding A Catalog

-Iceberg has several catalog back-ends that can be used to track tables, like JDBC, Hive MetaStore and Glue.
-Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. In this guide,
-we use JDBC, but you can follow these instructions to configure other catalog types. To learn more, check out


weird that the guide already mention JDBC here, but the example is still hadoop

kevinjqliu · 2024-12-22T20:15:53Z

site/docs/spark-quickstart.md

+    - [Configuring JDBC Catalog](#configuring-jdbc-catalog)
+    - [Configuring REST Catalog](#configuring-rest-catalog)
+- [Next steps](#next-steps)
+    - [Adding Iceberg to Spark](#adding-iceberg-to-spark)
+    - [Learn More](#learn-more)


renders the subsection correctly

kevinjqliu · 2024-12-22T21:02:16Z

docs/docs/spark-getting-started.md

    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse
 ```

+For example configuring a REST-based catalog, see [Configuring REST Catalog](/spark-quickstart#configuring-rest-catalog)


instead of repeating here for configuring REST catalog, just link to site/docs/spark-quickstart.md. I double checked the link here locally

kevinjqliu · 2024-12-22T22:20:29Z

docs/docs/spark-getting-started.md

+    --conf spark.sql.catalog.local.type=jdbc \
+    --conf spark.sql.catalog.local.uri=jdbc:sqlite:$PWD/iceberg_catalog_db.sqlite \
+    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
+    --conf spark.sql.defaultCatalog=local


add defaultCatalog to match other pages

kevinjqliu · 2024-12-22T22:21:20Z

site/docs/spark-quickstart.md

    spark.sql.extensions                                 org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.sql.catalog.spark_catalog                      org.apache.iceberg.spark.SparkSessionCatalog
    spark.sql.catalog.spark_catalog.type                 hive
    spark.sql.catalog.local                              org.apache.iceberg.spark.SparkCatalog
-    spark.sql.catalog.local.type                         hadoop
-    spark.sql.catalog.local.warehouse                    $PWD/warehouse


$PWD does not expand in spark-defaults.conf. keeping this here will create a folder named $PWD

kevinjqliu · 2024-12-24T18:05:19Z

site/docs/spark-quickstart.md


 === "CLI"

    ```sh
-    spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}\
+    spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }},org.xerial:sqlite-jdbc:3.46.1.3 \


taking on this extra dep since i dont see any iceberg specific package i can use. there is a hive-jdbc package

mrcnc

LGTM 👍 Thanks for improving this!

docs/docs/spark-getting-started.md

site/docs/spark-quickstart.md

RussellSpitzer · 2025-01-29T19:00:28Z

docs/docs/spark-getting-started.md


-This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog:
+This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog. 


Suggested change

This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog.

This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in `spark_catalog` using the Hive connector.

maybe something like this,

Suggested change

This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog.

This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog (`spark_catalog`) using the Hive connector.

Is it the "Hive connector" or the "Hive Metastore"?

But I'm also incline not to add this. I feel like this is too detailed for a "getting started" page.

github-actions · 2025-03-02T00:17:12Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions bot added the docs label Dec 22, 2024

hadoop -> jdbc

6fe50e1

kevinjqliu force-pushed the kevinjqliu/getting-started-without-hadoop-catalog branch from 00ca569 to 6fe50e1 Compare December 22, 2024 18:58

kevinjqliu added 6 commits December 22, 2024 11:56

add jdbc and rest example to spark iceberg quickstart

eb5d30c

nit

0a7b829

render toc subsection correctly

fd6efc7

format

81b3d0a

format

96d59ee

format

63c9a1a

kevinjqliu force-pushed the kevinjqliu/getting-started-without-hadoop-catalog branch from 496e51e to 63c9a1a Compare December 22, 2024 21:01

kevinjqliu added 4 commits December 22, 2024 13:18

indent!

94d6a10

fix S3FileIO

9c877dd

dont use $PWD in spark-defaults.conf

e6a64b4

match config

d1c9a16

kevinjqliu commented Dec 22, 2024

View reviewed changes

kevinjqliu marked this pull request as ready for review December 22, 2024 22:28

jbonofre self-requested a review December 23, 2024 06:17

kevinjqliu commented Dec 24, 2024

View reviewed changes

nickdelnano mentioned this pull request Jan 7, 2025

[Docs] Update spark-getting-started docs page to make the example valid #11923

Open

2 tasks

mrcnc approved these changes Jan 9, 2025

View reviewed changes

Fokko reviewed Jan 13, 2025

View reviewed changes

docs/docs/spark-getting-started.md Outdated Show resolved Hide resolved

Fokko reviewed Jan 13, 2025

View reviewed changes

site/docs/spark-quickstart.md Outdated Show resolved Hide resolved

Fokko reviewed Jan 13, 2025

View reviewed changes

site/docs/spark-quickstart.md Outdated Show resolved Hide resolved

Fokko reviewed Jan 13, 2025

View reviewed changes

site/docs/spark-quickstart.md Show resolved Hide resolved

kevinjqliu added 3 commits January 13, 2025 10:22

address pr comments

501cead

catalogs

4e221d9

reorder

8a30f4e

kevinjqliu requested a review from Fokko January 13, 2025 18:35

kevinjqliu changed the title ~~[docs] Replace examples of Hadoop catalog with JDBC catalog~~ [docs] Replace examples of Hadoop catalog with JDBC & REST catalog Jan 13, 2025

RussellSpitzer reviewed Jan 29, 2025

View reviewed changes

RussellSpitzer approved these changes Jan 29, 2025

View reviewed changes

github-actions bot added the stale label Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Replace examples of Hadoop catalog with JDBC & REST catalog #11845

[docs] Replace examples of Hadoop catalog with JDBC & REST catalog #11845

kevinjqliu commented Dec 22, 2024 •

edited

Loading

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 22, 2024

kevinjqliu Dec 24, 2024

mrcnc left a comment

RussellSpitzer Jan 29, 2025

kevinjqliu Jan 30, 2025 •

edited

Loading

github-actions bot commented Mar 2, 2025


		This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog:
		This command creates a JDBC-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog.

[docs] Replace examples of Hadoop catalog with JDBC & REST catalog #11845

Are you sure you want to change the base?

[docs] Replace examples of Hadoop catalog with JDBC & REST catalog #11845

Conversation

kevinjqliu commented Dec 22, 2024 • edited Loading

Testing

spark-quickstart.md using JDBC catalog

spark-quickstart.md using REST catalog

Rendered Docs

site/docs/spark-quickstart.md (http://127.0.0.1:8000/spark-quickstart/#adding-catalogs)

docs/docs/spark-getting-started.md (http://127.0.0.1:8000/docs/nightly/spark-getting-started/#adding-catalogs)

site/docs/how-to-release.md (http://127.0.0.1:8000/how-to-release/#verifying-with-spark)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Mar 2, 2025

kevinjqliu commented Dec 22, 2024 •

edited

Loading

`spark-quickstart.md` using JDBC catalog

`spark-quickstart.md` using REST catalog

`site/docs/spark-quickstart.md` (`http://127.0.0.1:8000/spark-quickstart/#adding-catalogs`)

`docs/docs/spark-getting-started.md` (`http://127.0.0.1:8000/docs/nightly/spark-getting-started/#adding-catalogs`)

`site/docs/how-to-release.md` (`http://127.0.0.1:8000/how-to-release/#verifying-with-spark`)

kevinjqliu Jan 30, 2025 •

edited

Loading