Skip to content

Commit

Permalink
README update on 0.6.2 (#234)
Browse files Browse the repository at this point in the history
* README update on 0.6.1

* resolve comments

* resolve comments

* update to 0.6.2

* update connector status

* resolve comments

* update content
  • Loading branch information
linzhou-db authored Jan 13, 2023
1 parent 4afc7f2 commit 92fe196
Show file tree
Hide file tree
Showing 3 changed files with 242 additions and 8 deletions.
146 changes: 139 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ This repo includes the following components:
- [Apache Spark](http://spark.apache.org/) Connector: An Apache Spark connector that implements the Delta Sharing Protocol to read shared tables from a Delta Sharing Server. The tables can then be accessed in SQL, Python, Java, Scala, or R.
- Delta Sharing Server: A reference implementation server for the Delta Sharing Protocol for development purposes. Users can deploy this server to share existing tables in Delta Lake and Apache Parquet format on modern cloud storage systems.


# Python Connector

The Delta Sharing Python Connector is a Python library that implements the [Delta Sharing Protocol](PROTOCOL.md) to read tables from a Delta Sharing Server. You can load shared tables as a [pandas](https://pandas.pydata.org/) DataFrame, or as an [Apache Spark](http://spark.apache.org/) DataFrame if running in PySpark with the Apache Spark Connector installed.
Expand Down Expand Up @@ -80,6 +79,16 @@ delta_sharing.load_as_pandas(table_url)
delta_sharing.load_as_spark(table_url)
```

If the table supports history sharing(`tableConfig.cdfEnabled=true` in the OSS Delta Sharing Server), the connector can query table changes.
```python
# Load table changes from version 0 to version 5, as a Pandas DataFrame.
delta_sharing.load_table_changes_as_pandas(table_url, starting_version=0, ending_version=5)

# If the code is running with PySpark, you can load table changes as Spark DataFrame.
delta_sharing.load_table_changes_as_spark(table_url, starting__version=0, ending_version=5)
```


You can try this by running our [examples](examples/README.md) with the open, example Delta Sharing Server.

### Details on Profile Paths
Expand All @@ -100,7 +109,7 @@ The Apache Spark Connector implements the [Delta Sharing Protocol](PROTOCOL.md)

## Accessing Shared Data

The connector loads user credentials from profile files. Please see [Download the share profile file](#download-the-share-profile-file) to download a profile file for our example server or for your own data sharing server.
The connector loads user credentials from profile files. Please see [Accessing Shared Data](#accessing-shared-data) to download a profile file for our example server or for your own data sharing server.

## Configuring Apache Spark

Expand All @@ -118,13 +127,13 @@ To use Delta Sharing connector interactively within the Spark’s Scala/Python s
#### PySpark shell

```
pyspark --packages io.delta:delta-sharing-spark_2.12:0.5.0
pyspark --packages io.delta:delta-sharing-spark_2.12:0.6.2
```

#### Scala Shell

```
bin/spark-shell --packages io.delta:delta-sharing-spark_2.12:0.5.0
bin/spark-shell --packages io.delta:delta-sharing-spark_2.12:0.6.2
```

### Set up a standalone project
Expand All @@ -139,7 +148,7 @@ You include Delta Sharing connector in your Maven project by adding it as a depe
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-sharing-spark_2.12</artifactId>
<version>0.5.0</version>
<version>0.6.2</version>
</dependency>
```

Expand All @@ -148,7 +157,7 @@ You include Delta Sharing connector in your Maven project by adding it as a depe
You include Delta Sharing connector in your SBT project by adding the following line to your `build.sbt` file:

```scala
libraryDependencies += "io.delta" %% "delta-sharing-spark" % "0.5.0"
libraryDependencies += "io.delta" %% "delta-sharing-spark" % "0.6.2"
```

## Quick Start
Expand Down Expand Up @@ -200,11 +209,134 @@ df <- read.df(table_path, "deltaSharing")

You can try this by running our [examples](examples/README.md) with the open, example Delta Sharing Server.

### CDF
Starting from release 0.5.0, querying [Change Data Feed](https://docs.databricks.com/delta/delta-change-data-feed.html) is supported with Delta Sharing.
Once the provider turns on CDF on the original delta table and shares it through Delta Sharing, the recipient can query
CDF of a Delta Sharing table similar to CDF of a delta table.
```scala
val tablePath = "<profile-file-path>#<share-name>.<schema-name>.<table-name>"
val df = spark.read.format("deltaSharing")
.option("readChangeFeed", "true")
.option("startingVersion", "3")
.load(tablePath)
```

### Streaming
Starting from release 0.6.0, Delta Sharing table can be used as a data source for [Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html).
Once the provider shares a table with history, the recipient can perform a streaming query on the table.
```scala
val tablePath = "<profile-file-path>#<share-name>.<schema-name>.<table-name>"
val df = spark.readStream.format("deltaSharing")
.option("startingVersion", "1")
.option("ignoreChanges", "true")
.load(tablePath)
```

### Table paths

- A profile file path can be any URL supported by Hadoop FileSystem (such as `s3a://my_bucket/my/profile/file`).
- A table path is the profile file path following with `#` and the fully qualified name of a table (`<share-name>.<schema-name>.<table-name>`).

# The Community
<div align="center">
<img src="https://user-images.githubusercontent.com/87341375/212409874-a4ef350f-3b32-4031-b2cd-8c4e47cc42e2.jpeg" alt="Delta Sharing OSS Connectors" width="400" />
</div>
<table>
<tr>
<th>Connector</th>
<th>Link</th>
<th>Status</th>
<th>Supported Features</th>
</tr>
<tr>
<td>Power BI</td>
<td>Databricks owned</td>
<td>Released</td>
<td>QueryTableVersion<br>QeuryTableMetadata<br>QueryTableLatestSnapshot</td>
</tr>
<tr>
<td>Node.js</td>
<td>

[goodwillpunning/nodejs-sharing-client](https://github.com/goodwillpunning/nodejs-sharing-client)
</td>
<td>Released</td>
<td>QueryTableVersion<br>QeuryTableMetadata<br>QueryTableLatestSnapshot</td>
</tr>
<tr>
<td>Java</td>
<td>

[databrickslabs/delta-sharing-java-connector](https://github.com/databrickslabs/delta-sharing-java-connector)
</td>
<td>Released</td>
<td>QueryTableVersion<br>QeuryTableMetadata<br>QueryTableLatestSnapshot</td>
</tr>
<tr>
<td>Arcuate</td>
<td>

[databrickslabs/arcuate](https://github.com/databrickslabs/arcuate)
</td>
<td>Released</td>
<td>QueryTableVersion<br>QeuryTableMetadata<br>QueryTableLatestSnapshot</td>
</tr>
<tr>
<td>Rust</td>
<td>

[r3stl355/delta-sharing-rust-client](https://github.com/r3stl355/delta-sharing-rust-client)
</td>
<td>Released</td>
<td>QueryTableVersion<br>QeuryTableMetadata<br>QueryTableLatestSnapshot</td>
</tr>
<tr>
<td>Go</td>
<td>

[magpierre/delta-sharing](https://github.com/magpierre/delta-sharing/tree/golangdev/golang/delta_sharing_go)
</td>
<td>Released</td>
<td>QueryTableVersion<br>QeuryTableMetadata<br>QueryTableLatestSnapshot</td>
</tr>
<tr>
<td>C++</td>
<td>

[magpierre/delta-sharing](https://github.com/magpierre/delta-sharing/tree/cppdev/cpp/DeltaSharingClient)
</td>
<td>Released</td>
<td>QeuryTableMetadata<br>QueryTableLatestSnapshot</td>
</tr>
<tr>
<td>Airflow</td>
<td>

[apache/airflow](https://github.com/apache/airflow/pull/22692)
</td>
<td>Un-released</td>
<td>N/A</td>
</tr>
<tr>
<td>Excel-Connector</td>
<td>

[https://www.exponam.com/solutions/](https://www.exponam.com/solutions/)
</td>
<td>limited-release</td>
<td>N/A</td>
</tr>
<tr>
<td>R</td>
<td>

[zacdav-db/delta-sharing-r](https://github.com/zacdav-db/delta-sharing-r)
</td>
<td>Released</td>
<td>QueryTableVersion<br>QeuryTableMetadata<br>QueryTableLatestSnapshot</td>
</tr>
</table>

# Delta Sharing Reference Server

The Delta Sharing Reference Server is a reference implementation server for the [Delta Sharing Protocol](PROTOCOL.md). This can be used to set up a small service to test your own connector that implements the [Delta Sharing Protocol](PROTOCOL.md). Please note that this is not a completed implementation of secure web server. We highly recommend you to put this behind a secure proxy if you would like to expose it to public.
Expand Down Expand Up @@ -332,7 +464,7 @@ You can use the pre-built docker image from https://hub.docker.com/r/deltaio/del
```
docker run -p <host-port>:<container-port> \
--mount type=bind,source=<the-server-config-yaml-file>,target=/config/delta-sharing-server-config.yaml \
deltaio/delta-sharing-server:0.5.0 -- --config /config/delta-sharing-server-config.yaml
deltaio/delta-sharing-server:0.6.2 -- --config /config/delta-sharing-server-config.yaml
```

Note that `<container-port>` should be the same as the port defined inside the config file.
Expand Down
102 changes: 102 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Release Notes

## Delta Sharing 0.5.4 (Released on 2023-01-11)
Improvements:
- Spark connector changes to consume size from metadata.

## Delta Sharing 0.6.2 (Released on 2022-12-20)
Bug fixes:
- Fix comparison of the expiration time to current time for pre-signed urls.


## Delta Sharing 0.5.3 (Released on 2022-12-20)
Bug fixes:
- Extends DeltaSharingProfileProvider to customize tablePath and refresher.
- Refresh pre-signed urls for cdf queries.
- Fix partitionFilters issue for cdf queries.
- Fix comparison of the expiration time to current time for pre-signed urls.


## Delta Sharing 0.6.1 (Released on 2022-12-19)
Improvements:
- Spark connector changes to consume size from metadata.
- Improve delta sharing error messages.

Bug fixes:
- Extends DeltaSharingProfileProvider to customize tablePath and refresher.
- Refresh pre-signed urls for cdf and streaming queries.
- Allow 0 for versionAsOf parameter, to be consistent with Delta.
- Fix partitionFilters issue: apply it to all file indices.

## Delta Sharing 0.6.0 (Released on 2022-12-02)
Improvements:
- Support using a delta sharing table as a source in spark structured streaming, which allows recipients to stay up to date with the shared data.
- Fix a few nits in the PROTOCOL documentation.
- Support timestampAsOf parameter in delta sharing data source.

## Delta Sharing 0.5.2 (Released on 2022-10-10)
Fixes:
- Add a Custom Http Header Provider.

## Delta Sharing 0.5.1 (Released on 2022-09-08)
Improvements:
- Upgrade AWS SDK to 1.12.189.
- More tests on the error message when loading table fails.
- Add ability to configure armeria server request timeout.
- documentation improvements.

Bug fixes:
- Fix column selection bug on Delta Sharing CDF spark dataframe.
- Fix GCS path reading.

## Delta Sharing 0.5.0 (Released on 2022-08-30)
Improvements:
- Support for Change Data Feed which allows clients to fetch incremental changes for the shared tables.
- Include response body in HTTPError exception in Python library.
- Improve the error message for the /share/schema/table APIs.
- Protocol and REST API documentation improvements.
- Add query_table_version to the rest client.

## Delta Sharing 0.4.0 (Released on 2022-01-13)
Improvements:
- Support Google Cloud Storage on Delta Sharing Server.
- Add a new API to get the metadata of a Share.
- Protocol and REST API documentation enhancements.
- Allow for customization of recipient profile in Apache Spark connector.

Bug fixes:
- Block managed table creation for Delta Sharing to prevent user confusions.

## Delta Sharing 0.3.0 (Released on 2021-12-01)
Improvements:
- Support Azure Blob Storage and Azure Data Lake Gen2 in Delta Sharing Server.
- Apache Spark Connector now can send the limitHint parameter when a user query is using limit.
- `load_as_pandas` in Python Connector now accepts a limit parameter to allow users fetching only a few rows to explore.
- Apache Spark Connector will re-fetch pre-signed urls before they expire to support long running queries.
- Add a new API to list all tables in a share to save network round trips.
- Add a User-Agent header to request sent from Apache Spark Connector and Python.
- Add an optional expirationTime field to Delta Sharing Profile File Format to provide the token expiration time.

Bug fixes:
- Fix a corner case that list_all_tables may not return correct results in the Python Connector.

## Delta Sharing 0.2.0 (Released on 2021-08-10)
Improvements:
- Added official Docker images for Delta Sharing Server.
- Added an examples project to show how to try the open Delta Sharing Server.
- Added the conf directory to the Delta Sharing Server classpath to allow users to add their Hadoop configuration files in the directory.
- Added retry with exponential backoff for REST requests in the Python connector.

Bug fixes:
- Added the minimum fsspec requirement in the Python connector.
- Fixed an issue when files in a table have no stats in the Python connector.
- Improve error handling in Delta Sharing Server to report 400 Bad Request properly.
- Fixed the table schema when a table is empty in the Python connector.
- Fixed KeyError when there are no shared tables in the Python connector.

## Delta Sharing 0.1.0 (Released on 2021-05-25)
Components:
- Delta Sharing protocol specification.
- Python Connector: A Python library that implements the Delta Sharing Protocol to read shared tables as pandas DataFrame or Apache Spark DataFrames.
- Apache Spark Connector: An Apache Spark connector that implements the Delta Sharing Protocol to read shared tables from a Delta Sharing Server. The tables can then be accessed in SQL, Python, Java, Scala, or R.
- Delta Sharing Server: A reference implementation server for the Delta Sharing Protocol for development purposes. Users can deploy this server to share existing tables in Delta Lake and Apache Parquet format on modern cloud storage systems.
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@ The profile file from the open, example Delta Sharing Server is downloaded and l
* For Python examples, Python3.6+, Delta-Sharing Python Connector, PySpark need to be installed, see [the project docs](https://github.com/delta-io/delta-sharing) for details.

### Instructions
* To run the example of PySpark in Python run `spark-submit --packages io.delta:delta-sharing-spark_2.12:0.1.0 ./python/quickstart_spark.py`
* To run the example of PySpark in Python run `spark-submit --packages io.delta:delta-sharing-spark_2.12:0.6.2 ./python/quickstart_spark.py`
* To run the example of pandas DataFrame in Python run `python3 ./python/quickstart_pandas.py`

0 comments on commit 92fe196

Please sign in to comment.