0.12.0 (2024-09-03)
Breaking Changes
-
Change connection URL used for generating HWM names of S3 and Samba sources:
smb://host:port
->smb://host:port/share
s3://host:port
->s3://host:port/bucket
(#304)
-
Update DB connectors/drivers to latest versions:
- Clickhouse
0.6.0-patch5
→0.6.5
- MongoDB
10.3.0
→10.4.0
- MSSQL
12.6.2
→12.8.1
- MySQL
8.4.0
→9.0.0
- Oracle
23.4.0.24.05
→23.5.0.24.07
- Postgres
42.7.3
→42.7.4
- Clickhouse
-
Update
Excel
package from0.20.3
to0.20.4
, to include Spark 3.5.1 support. (#306)
Features
-
Add support for specifying file formats (
ORC
,Parquet
,CSV
, etc.) inHiveWriteOptions.format
(#292):Hive.WriteOptions(format=ORC(compression="snappy"))
-
Collect Spark execution metrics in following methods, and log then in DEBUG mode:
DBWriter.run()
FileDFWriter.run()
Hive.sql()
Hive.execute()
This is implemented using custom
SparkListener
which wraps the entire method call, and then report collected metrics. But these metrics sometimes may be missing due to Spark architecture, so they are not reliable source of information. That's why logs are printed only in DEBUG mode, and are not returned as method call result. (#303) -
Generate default
jobDescription
based on currently executed method. Examples:DBWriter.run(schema.table) -> Postgres[host:5432/database]
MongoDB[localhost:27017/admin] -> DBReader.has_data(mycollection)
Hive[cluster].execute()
If user already set custom
jobDescription
, it will left intact. (#304) -
Add log.info about JDBC dialect usage (#305):
|MySQL| Detected dialect: 'org.apache.spark.sql.jdbc.MySQLDialect'
-
Log estimated size of in-memory dataframe created by
JDBC.fetch
andJDBC.execute
methods. (#303)
Bug Fixes
- Fix passing
Greenplum(extra={"options": ...})
during read/write operations. (#308) - Do not raise exception if yield-based hook whas something past (and only one)
yield
.