Releases: vmware/versatile-data-kit
Versatile Data Kit 0.12
Major features include:
Open-sourcing VDK Operations UI
VDK Operations UI would enable data practitioners to efficiently manage (operate and monitor) their data jobs.
It has been used internally in VMware for some time and the team open source it last month.
Check out more details at the Operations UI VEP
Look forward to the official launch soon.
Documentation Improvements
Significantly simplified and improve the main README and the CONTRIBUTING.md thanks to @gary-tai and @zverulacis
VDK Meta Jobs Preparation for Alpha release
implemented a limit on starting jobs at once
META_JOBS_MAX_CONCURRENT_RUNNING_JOBS=<number>
Learn more about the VDK Meta Jobs features in VDK Meta Jobs VEP
Started initiative to support multiple python versions
We are working on introducing an optional python_version property to the Control Service API, which allows users to specify the Python version they want to use for their job deployment. This means users no longer have to rely on the service administrator to make changes to the configuration and can deploy their jobs with the version they need.
See more information in the Multiple Python Versions VEP
Started initiative to create Secrets Interface
So far the way VDK recommended to store secrets was to use Properties API. Though it works well, it doesn't really meet the criteria for storing properly restricted data and likely also confidential data
The team is working on providing similar to Properties interface Secrets (backed by HashiCorp Vault).
See more information in the Vault Integration For Secrets Storage VEP
What's Changed
- control-service: Update helm template license headers by @DeltaMichael in #1670
- control-service: don't include needless service token by @tozka in #1679
- control-service: protect against v1 cron batch not being supported by @murphp15 in #1777
- docs: Documentation improvement: improve main README VDK images - gif by @zverulacis in #1676
- docs: Fix CICD build for control service by @murphp15 in #1790
- docs: Hotfix: Update the URL for gif by @zverulacis in #1702
- docs: Improve documentation: before and after image by @zverulacis in #1696
- docs: Issue 1574: Update Contributing document by @gary-tai in #1741
- docs: Update VEP-1507 with CI/CD description by @DeltaMichael in #1707
- docs: update contributor documentation by @DeltaMichael in #1779
- frontend: Create build and release jobs for vdk/shared package by @DeltaMichael in #1672
- frontend: Fix shebang in build_shared.sh by @DeltaMichael in #1743
- frontend: Rename vdk/shared to versatiledatakit/shared by @DeltaMichael in #1694
- frontend: bump data-pipelines dependency versions by @DeltaMichael in #1706
- frontend: compile fix by @ivakoleva in #1697
- frontend: data-pipelines build and test running CI/CD by @ivakoleva in #1671
- frontend: data-pipelines publish artifact CI/CD by @ivakoleva in #1675
- frontend: data-pipelines sync to freezepoint by @ivakoleva in #1746
- frontend: e2e tests (#6) by @ivakoleva in #1633
- frontend: e2e tests wiring and clientid by @ivakoleva in #1754
- frontend: gitlab ci variables for change locations by @ivakoleva in #1685
- frontend: package naming fixes by @ivakoleva in #1695
- specs: Add user journeys to VEP-1507 by @DeltaMichael in #1750
- specs: VDK Enhancement Proposal for Metajobs (complete) by @yonitoo in #1757
- specs: VEP-1493: Vault Integration For Secrets Storage by @dakodakov in #1756
- specs: VEP-1739 Support for Multiple Python Versions by @doks5 in #1748
- specs: edit vep-1507 for style and grammar by @DeltaMichael in #1758
- specs: vep-1416: Change status of vep by @doks5 in #1721
- vdk-audit-plugin: expand forbidden events list by @mivanov1988 in #1683
- vdk-heartbeat: fix test failures by @tozka in #1703
- vdk-impala: Add optional parameter for staging table prefix by @sbuldeev in #1666
- vdk-ipython: add finalize job by @duyguHsnHsn in #1744
- vdk-ipython: change the way job_input is introduced in a notebook by @duyguHsnHsn in #1678
- vdk-jupyter: jupyter UI refactor code and handle react-dom version mismatch by @duyguHsnHsn in #1727
- vdk-jupyter: Jupyter UI change the strategy on storing the data from user input by @duyguHsnHsn in #1624
- vdk-jupyter: add deploy job component to the UI by @duyguHsnHsn in #1704
- vdk-jupyter: add py-to-ts-interfaces to build by @duyguHsnHsn in #1765
- vdk-jupyter: load data before vdk operations by @duyguHsnHsn in #1762
- vdk-jupyter: remove py-to-ts-interfaces because of build problems by @duyguHsnHsn in #1691
- vdk-jupyter: rename python module by @duyguHsnHsn in #1784
- vdk-meta-jobs: Introduce configuration module by @gageorgiev in #1692
- vdk-meta-jobs: first version of the VEP Summary and Motivation sections by @yonitoo in #1698
- vdk-meta-jobs: implement a limit on starting jobs at once by @yonitoo in #1681
- vdk-meta-jobs: make _start_job() limit hit log more descriptive by @yonitoo in #1771
- vdk-plugins: bump docker version and resuse common dind by @tozka in #1701
- vdk-plugins: generate test report by @tozka in #1732
- vdk-plugins: upgrade docker used by @tozka in #1693
- vdk-quickstart: remove use of global varaibles in CI by @tozka in #1726
- vdk-trino: stabilize vdk-trino tests by @tozka in #1677
- versatile-data-kit: add hosts entry to gitlab runners by @ivakoleva in #1731
- versatile-data-kit: automate merging of dependabot PRs by @tozka in #1725
- versatile-data-kit: dependabot auto-merge fix by @ivakoleva in #1688
- versatile-data-kit: mechanism for targetted PR notifications. by @tozka in #1733
- versatile-data-kit: pre-commit hook for (S)CSS/JS/TS/HTML formatting by @ivakoleva in #1684
- versatile-data-kit: simplify release process by @tozka in #1668
New Contributors
Full Changelog: v0.11...v0.12
Versatile Data Kit 0.11
Major features include:
Introduce data quality checks (pre-alpha) (for scd1 template)
Allow quality checks to be made before the data is inserted into the target table.
Currently, the checks done on the processing step are not covering if the semantics of the data is correct. Therefore, bad data could went into the target table which could be unwanted behavior.
Example:
def sample_check_true(tmp_table_name):
return False if "bad" in tmp_table_name else True
template_args["check"] = sample_check
job_input.execute_template(
template_name="load/dimension/scd1",
template_args=template_args,
)
Jobs Query API (GraphQL) wildcard matching filter for team and job names
When querying information about jobs now users of the Jobs QUery API can use wildcard matches :
wildcard matching for example *search*
in graphQl filters for job name
and team name
as well as before exact matching of search strings
Provide User Agent when using VDK CLI
Users are looking to be able to determine where requests originated from when analyzing and browsing the telemetry data about VDK Control Service usage.
export VDK_CONTROL_SERVICE_USER_AGENT = foo
or in config.ini
[vdk]
vdk_control_service_user_agent=foo
If not set it would default to "vdk-control-cli/{version} ({os.name}; {sys.platform})" + {python version}
New plugin: vdk-notebook
A new VDK plugin that supports running data jobs which consists of .ipynb files. You can see VDK Notebook plugin page for more information.
vdk-ipython
This extension introduces a magic command for Jupyter. The command enables the user to load job_input for his current data job and use it freely while working with Jupyter.
You can see VDK ipython plugin page for more information.
Installation
Check the installation page
What's Changed
- control service: remove deprecated dependency by @murphp15 in #1589
- control-service: Remove dependency on old docker image which is not needed by @murphp15 in #1548
- control-service: Fixed data job status in case of OOM by @mivanov1988 in #1586
- control-service: base-job-image: automatic image cleanup by @mivanov1988 in #1636
- control-service: Cronjob API backwards compatibility by @doks5 in #1580
- control-service: Fix release step in pipeline by @murphp15 in #1550
- control-service: Remove duplicated CICD job runs by @mivanov1988 in #1596
- control-service: cleanup tests to ease testing on control service (v2) by @murphp15 in #1607
- control-service: cleanup tests to ease testing on control service by @murphp15 in #1604
- control-service: configurable job initContainer resources by @mivanov1988 in #1599
- control-service: graphql revert part of the wildcard filter matching by @mrMoZ1 in #1615
- control-service: handle init container OOM by @mivanov1988 in #1658
- control-service: integration tests refactoring by @mivanov1988 in #1562
- control-service: java 17 by @murphp15 in #1439
- control-service: only delete file in test path by @murphp15 in #1545
- control-service: remove deprecated classes in codebase. by @murphp15 in #1611
- control-service: remove old kerberous test dependency by @murphp15 in #1539
- control-service: upgrade gradle. by @murphp15 in #1543
- control-service: use latest docker image by @murphp15 in #1538
- frontend: data-pipelines (#1) bundle root by @ivakoleva in #1626
- frontend: data-pipelines (#2) lib bundle root by @ivakoleva in #1629
- frontend: data-pipelines (#3) lib bundle sources by @ivakoleva in #1630
- frontend: data-pipelines (#4) ui bundle root by @ivakoleva in #1631
- frontend: data-pipelines (#5) ui bundle sources by @ivakoleva in #1632
- frontend: open source shared components package by @DeltaMichael in #1618
- job-base-image-secure: remove unused parameter from publication script by @mivanov1988 in #1650
- job-builder: address docker image vulnerabilities by @mivanov1988 in #1523
- job-builder: fix ci/cd steps by @mivanov1988 in #1555
- job-builder: introduced secure base-job-image by @mivanov1988 in #1546
- job-builder: remove toybox from the base job image by @mivanov1988 in #1552
- vdk-cicd: cleanup cicd rules by @murphp15 in #1554
- vdk-cicd: during a scheduled run publish_artifacts and release shouldn't run by @murphp15 in #1551
- vdk-control-service: fix null dereferences by @dakodakov in #1512
- vdk-control-service: fix possible NPE by @dakodakov in #1522
- vdk-control-service: potential resource leak fixes by @dakodakov in #1513
- vdk-core: track configuration sensitivity by @DeltaMichael in #1579
- vdk-frontend: docker image for running end to end tests in gitlab by @murphp15 in #1563
- vdk-frontend: include readmes for the data-pipelines folders by @murphp15 in #1598
- vdk-frontend: open source readmes by @murphp15 in #1537
- vdk-impala: Introduce checks for scd1 template by @sbuldeev in #1472
- vdk-jobs-troubleshooting: Run troubleshooting server as deamon thread by @dakodakov in #1499
- vdk-jupyter: add create job command to jupyter by @duyguHsnHsn in #1581
- vdk-jupyter: add download job command to jupyter by @duyguHsnHsn in #1492
- vdk-jupyter: create iPython extension by @duyguHsnHsn in #1483
- vdk-jupyter: fixes on tsconfig and bad file naming by @duyguHsnHsn in #1594
- vdk-jupyter: improve error handling on the UI by @duyguHsnHsn in #1528
- vdk-jupyter: make VEP more accessible and informative by @duyguHsnHsn in #1635
- vdk-jupyter: modify the way we read notebooks in notebook plugin by @duyguHsnHsn in #1520
- vdk-jupyter: modify the way we work with notebooks in notebook plugin by @duyguHsnHsn in #1564
- vdk-jupyter: ui end-to-end testing by @duyguHsnHsn in #1617
- vdk-jupyter: vdk-notebook README improvements by @duyguHsnHsn in #1642
- vdk-meta-jobs: Better error message for misspelled job name by @gageorgiev in #1592
- vdk-snowflake: upgrade to Python 3.11 by @tozka in #1609
- vdk-spec: cleanup template by @murphp15 in #1518
- vdk-spec: describe package publishing by @murphp15 in #1536
- vdk-spec: folder structure by @murphp15 in #1525
- vdk-spec: remove api section because the frontend will have no impact on the api by @murphp15 in #1524
- vdk-spec: summary, glossary, motivation by @murphp15 in #1521
- vdk-test-utils: add cli_assert_output_contains by @tozka in #1540
- versatile-data-kit: update changelog instructions by @tozka in #1541
- versatile-data-kit: Meta Job example by @gageorgiev in #1640
- versatile-data-kit: copyright notice year update by @ivakoleva in #1634
- versatile-data-kit: git pre-commit hooks config by @i...
Versatile Data Kit 0.10
Summary
Major features include:
vdk-jobs-troubleshooting - new plugin
Introduces thread-dump capabilities in the Data Jobs
See more details in the plugin home page and the VDK Enhancement Proposal
Support for Python 3.11
Introduces support for Python 3.11 in vdk-core and other plugins
Package versions
See installation instructions here.
The versions of VDK components released under VDK 0.10 are:
Main components
control-service 1.5.707959356
vdk-core==0.3.723457889
Plugins
vdk-lineage-model==0.0.723435904
vdk-meta-jobs==0.1.723435904
vdk-sqlite==0.1.730902357
vdk-jobs-troubleshooting==0.2.741769066
vdk-lineage==0.3.723435904
vdk-control-cli==1.3.736732752
What's Changed
- control-service: add docs on using different versions of k8s by @murphp15 in #1473
- control-service: fix secret in helm chart by @murphp15 in #1379
- control-service: graphql wildcard matching filter for team and job names by @mrMoZ1 in #1459
- control-service: latest graphql version by @murphp15 in #1384
- control-service: migrate from springfox to springdocs by @murphp15 in #1424
- control-service: release helm chart with correct image tag by @murphp15 in #1383
- control-service: reset termination status when job is disabled by @doks5 in #1405
- control-service: run ci on gradle version change by @murphp15 in #1371
- control-service: run release test on dependency version change by @tozka in #1400
- control-service: set registry name correctly. by @murphp15 in #1331
- control-service: use correct secret type by @murphp15 in #1370
- control-service: user-agent tag should have the correct format by @murphp15 in #1412
- examples: clarify README sample anonymize plugin by @tozka in #1394
- Update README.md by @dimirapetrova in #1373
- Update README.md for INSERT by @dimirapetrova in #1364
- vdk-control-cli and some plugins: Support for 3.11 by @tozka in #1409
- vdk-control-cli: address vulnerability in python dependency by @tozka in #1470
- vdk-control-cli: allow cli users to explicitly set the user agent tag by @murphp15 in #1403
- vdk-core: get_managed_connection should return opened connection by @tozka in #1410
- vdk-core: support for 3.11 by @tozka in #1395
- vdk-gitlab: upgrade the gitlab runner to latest version by @tozka in #1398
- vdk-gitlab-runners: increase concurrent pipelines by @tozka in #1396
- vdk-jobs-trobleshooting: Introduce plugin API and configuration by @doks5 in #1447
- vdk-jobs-troubleshooting: add thread-dump utility by @doks5 in #1456
- vdk-jobs-troubleshooting: improve robustness of the plugin by @dakodakov in #1487
- vdk-jobs-troubleshooting: release the plugin by @dakodakov in #1481
- vdk-jupyter: splitting functionalities of vdk-notebook Cell class by @duyguHsnHsn in #1465
- vdk-jupyter: add create job command to jupyter front-end extension by @duyguHsnHsn in #1478
- vdk-jupyter: add delete job command to jupyter by @duyguHsnHsn in #1488
- vdk-jupyter: changes on diagrams and definition in notebook-plugin section in VEP by @duyguHsnHsn in #1427
- vdk-jupyter: create notebook-plugin by @duyguHsnHsn in #1411
- vdk-jupyter: deleting the yarn.lock file because of security issue by @duyguHsnHsn in #1382
- vdk-jupyter: notebook-plugin by @duyguHsnHsn in #1415
- vdk-jupyter: python subprocess security problem by @duyguHsnHsn in #1463
- vdk-jupyter: run VDK job by @duyguHsnHsn in #1454
- vdk-jupyter: VEP - adding the definition of Notebook step by @duyguHsnHsn in #1386
- vdk-lineage, vdk-lineage-model, vdk-meta-jobs: support for Python 3.11 by @tozka in #1448
- vdk-plugins: introduce vdk-jobs-troubleshooting plugin by @doks5 in #1428
- vdk-sqlite: support for Python 3.11 by @tozka in #1466
- vdk-trino: support for Python 3.11 by @tozka in #1471
- vep-1416: address feedback and update proposal by @doks5 in #1491
- versatile-data-kit: VEP-1416 vdk-troubleshooting-tools by @doks5 in #1423
New Contributors
- @dimirapetrova made their first contribution in #1364
Full Changelog: v0.9...v0.10
Versatile Data Kit 0.9
Summary
Major features include:
vdk-meta-jobs new plugin
Using this plugin you can specify dependencies between data jobs as a direct acyclic graph (DAG).
For example
def run(job_input):
jobs = [
{
"job_name": "name-of-job",
"team_name": "team-of-job",
"fail_meta_job_on_error": True or False,
"depends_on": [name-of-job1, name-of-job2]
},
...
]
MetaJobInput().run_meta_job(jobs)
See more details in the plugin home page
Control Service security hardening
- Options for jobs to run in read-only file system
- Provide credentials configuration for using private images during by the Control Service
- Use a separate file system for storing temporary user-supplied files by Control Service
- Enhanced job upload validation for zip exploits and unallowed files
Data Job Upload validation allow list
During the installation of Control Service administrators can limit what type of files can be uploaded as part of a data job.
A new configuration option is added called uploadValidationFileTypesAllowList
.
uploadValidationFileTypesAllowList
is comma separated list with file types.
For example Setting
uploadValidationFileTypesAllowList=image/png,text/plain
then only png images and plain text files can be uploaded. Otherwise, upload requests will fail.
See more details in helm chart documentation
vdk-logging-format - new plugin
This plugin allows for the configuration of the format of VDK logs.
Before there were separate plugins for each format, but they are not deprecated in favour of this one.
The plugin introduces a new configuration option LOGGING_FORMAT
with possible values 'json', 'ltsv', 'text'
export LOGGING_FORMAT=JSON
Control Service helm chart support for Postgres
For embedded DB for control-service metadata storage, the Bitnami-available chart of PostgreSQL is added.
Now user can install it with
helm install vdk-control-service --postgresql.enabled=true cockroachdb.enabled=false
Package versions
See installation instructions here.
The versions of VDK components released under VDK 0.7 are:
Main components
control-service 1.5.707959356
vdk-core==0.3.692414765
Plugins
vdk-logging-json==0.1.693641831
vdk-meta-jobs==0.1.684477187
vdk-postgres== 0.0.692283840
vdk-trino== 0.4.703555598
What's Changed
- control-service: Container read-only file system by @gageorgiev in #1291
- control-service: Expose LOGGING_FORMAT through helm chart by @gageorgiev in #1329
- control-service: a directory can be manually set as a location to store databjobs when processing them to git. by @murphp15 in #1290
- control-service: add empty dir storage by @murphp15 in #1293
- control-service: add support for allowlist in helm chart. by @murphp15 in #1283
- control-service: add tests for some zip exploits by @tozka in #1266
- control-service: builder base image in helm by @murphp15 in #1359
- control-service: builder images load secrets from k8s by @murphp15 in #1358
- control-service: create the secret in the correct namespace. by @murphp15 in #1318
- control-service: deprecated jobsList endpoint cleanup by @ivakoleva in #1296
- control-service: fix helm template by @murphp15 in #1295
- control-service: fix ingress template by @murphp15 in #1277
- control-service: helm chart for private builder by @murphp15 in #1336
- control-service: namespace can be null by @murphp15 in #1349
- control-service: postgresql embedded by @ivakoleva in #1273
- control-service: refactor db query to mitigate race condition by @mrMoZ1 in #1269
- control-service: release newer version of job builder by @murphp15 in #1362
- control-service: set registry name correctly. by @murphp15 in #1323
- control-service: test cleanup with the goal of making tests easier to run locally by @murphp15 in #1343
- control-service: upload validation by @tozka in #1268
- vdk-jupyter: Expand details on extensions design by @duyguHsnHsn in #1304
- quickstart-vdk: Include vdk-logging-format by @gageorgiev in #1313
- vdk-audit: set python requires >= 3.8 by @tozka in #1289
- vdk-control-api-auth: Fix error message formatting by @gageorgiev in #1303
- vdk-control-cli: fix cicd by @mrMoZ1 in #1327
- vdk-control-cli: update doc for deployment of multiple jobs w/single command by @mrMoZ1 in #1325
- vdk-core: Allow for modification of dynamic params by @doks5 in #1267
- vdk-core: resolve library error classification on startup by @mrMoZ1 in #1241
- vdk-events: add presentation slides of DSC event by @tozka in #1335
- vdk-jupyter: introduce JupterLab extension by @duyguHsnHsn in #1338
- vdk-logging-format: Fix path to readme in setup.py by @gageorgiev in #1322
- vdk-logging-format: Join JSON and LTSV logging plugins into one by @gageorgiev in #1312
- vdk-logging-json, vdk-logging-ltsv: Delete deprecated plugins by @gageorgiev in #1319
- vdk-meta-jobs: Initial implementation by @tozka in #1249
- vdk-postgres: add ingest plugin by @tozka in #1314
- vdk-trino: Fix typo in the documentation by @tozka in #1340
New Contributors
- @dependabot made their first contribution in #1299
Full Changelog: v0.8...v0.9
Versatile Data Kit 0.8
Summary
Major features include:
New plugin: VDK Audit
This plugin provides the ability to audit and potentially limit user operations. It requires Python 3.8 or newer. These operations can be deep within the Python runtime or standard libraries, such as dynamic code compilation, module imports, or OS command invocations.
If we want to forbid some os.* operations we can do it like this:
export AUDIT_HOOK_ENABLED=true
export AUDIT_HOOK_FORBIDDEN_EVENTS_LIST='os.removexattr;os.rename;os.rmdir;os.scandir'
export AUDIT_HOOK_EXIT_ON_FORBIDDEN_EVENT=true
vdk run <job-name>
See more details in the vdk-audit plugin page
Any version of python in VDK Control Service
Deployed jobs by Control Service can now use any version of Python and not just 3.7 automatically.
Insert only impala load template
This template can be used to load raw data from Data Lake to target Table in Data Warehouse. In summary, it appends all records from the source table to the target table. Similar to all other SQL modeling templates there is also schema validation, table refresh and statistics are computed when necessary.
Example:
def run(job_input):
# . . .
template_args = {
'source_schema': 'source',
'source_view': 'view_source',
'target_schema': 'target',
'target_table': 'destination_table'
}
job_input.execute_template('insert', template_args)
See more details in the template documentation page
Package versions
See installation instructions here.
The versions of VDK components released under VDK 0.7 are:
Main components
control-service 1.5.671965442
vdk-core==0.3.662978536
Plugins
vdk-ingest-http==0.2.670842377
vdk-impala==0.4.672320306
What's Changed
- control-service: CVE fix - upgrade commons-text by @tozka in #1255
- control-service: Dynamic python site-packages directory detection by @mivanov1988 in #1247
- control-service: fix cicd deployment by @tozka in #1226
- control-service: fix integration tests by @tozka in #1211
- control-service: fix race condition in test by @murphp15 in #1227
- control-service: refactor job cancellation method due to 404 errors by @mrMoZ1 in #1114
- control-service: remove executables from secure job builder by @mivanov1988 in #1202
- control-plane: better error logging for transient error in tests by @murphp15 in #1222
- control-service: improve docs and local runability of integration tests by @murphp15 in #1217
- control-service: upgrade java client k8s version by @murphp15 in #1216
- vdk-core: errors occurred and the state (handled or not) context missing by @ivakoleva in #1182
- vdk-core: errors occurred and the state (handled or not) context missing by @tozka in #1212
- vdk-core: platform error no longer logged when skipping execution steps by @mrMoZ1 in #1223
- vdk-impala: Fix parsing while analysing profile for lineage information by @kostoww in #1206
- vdk-impala: Refactor query classifier for data lineage by @kostoww in #1239
- vdk-impala: improve explanation in readme by @tozka in #1248
- vdk-impala: stop using errors.get_exception_message by @tozka in #1224
- vdk-impala: update documentation with link by @tozka in #1237
- vdk-ingest-http: Adopt simplejson in place of json by @doks5 in #1229
- vdk-ingest-http: Move data conversion above size calc by @doks5 in #1245
- vdk-ingest-http: fix default value for backoff factor, add retry test by @dakodakov in #1218
- vdk-plugins: fix broken link by @tozka in #1204
- vdk-plugins: introduced vdk-audit plugin by @mivanov1988 in #1221
- vdk-plugins: run tests on release of vdk-core by @tozka in #1210
- vdk-plugins: set dind tempalte job for default build of plugins by @tozka in #1225
- versatile-data-kit: required approving reviewers update by @ivakoleva in #1220
- versatile-data-kit: update contributing.md by @tozka in #1214
New Contributors
Full Changelog: v0.7...v0.8
v0.7
Summary
Major features include:
VDK Template running state detection capability
Since template executions are autonomous data job runs, we need to be able to determine if a template is running at any time.
For example, to distinguish between root data job finalization, and template finalization
For example if we want to send telemetry somewhere:
@hookimpl
def finalize_job(self, context: JobContext) -> None:
template = context.core_context.state.get(ExecutionStateStoreKeys.TEMPLATE_NAME)
if template:
telemetry.send(phase="finalize_template", template_name = template)
else:
telemetry.send(phase="finalize_job", job_name=context.name)
New Logging configuration LOG_LEVEL_MODULE
Enable users to override logs per module, temporarily (e.g for debugging or prototyping reasons to increase the verbosity of certain
module).
For example assuming default log level is INFO we can enable verbose logs for 2 modules "vdk.api" and "custom.module":
export LOG_LEVEL_MODULE="vdk.api=DEBUG;custom.module=DEBUG"
vdk run job-name
Or in specific job config.ini:
[vdk]
log_level_module=vdk.api=DEBUG;custom.module=DEBUG
New plugin backend for Properties: from local file system
A simplistic plugin, that allows a developer or presenter to quickly store properties on the local FS.
It can be used to store secrets/configuration for a dev/demo session, that does not require a prerequisite of the entire Control Service installed and running.
It can be used to test a job run locally only without updating the state of the deployed job.
Example:
export PROPERTIES_DEFAULT_TYPE="fs-properties-client"
or in specific job config.ini
[vdk]
properties_default_type=fs-properties-client
Now properties are stored in a local file. The file location can be further configured using FS_PROPERTIES_FILENAME
and FS_PROPERTIES_DIRECTORY
Coockiecutter for new plugins
Create new plugin skeleton very easy
cookiecutter https://github.com/tozka/cookiecutter-vdk-plugin.git
and follow the instructions
Add the ability to cancel remaining job steps
Now a job (or a template) can be canceled from any step and all remaining steps in the job (or template) will be skipped.
For example, it can be used if a data job depends on processing data from a source that has indicated no new entries since the last run, then we can skip the remaining steps.
Example:
def run(job_input: IJobInput):
data = get_last_delta()
if not data:
job_input.skip_remaining_steps()
Package versions
See installation instructions here.
The versions of VDK components released under VDK 0.7 are:
Main components
control-service 1.5.622899758
vdk-control-cli==1.3.626767210
vdk-core==0.3.652866366
Plugins
vdk-properties-fs==0.0.651770458
vdk-kerberos-auth==0.3.631374202
vdk-impala==0.4.651849986
What's Changed
- vdk-control-cli: Drop requirement pluggy to be 0.* by @gageorgiev in #1116
- vdk-core: Add log before query result fetch by @doks5 in #1195
- vdk-core: Fix issue with serializing Decimal values during payload check by @gageorgiev in #946
- vdk-core: add ability to cancel remaining job steps by @mrMoZ1 in #1188
- vdk-core: add new configuration log_level_module by @tozka in #1167
- vdk-core: added default values to write termination message method by @mivanov1988 in #1185
- vdk-core: avoid circular references in print results by @tozka in #1176
- vdk-core: extend classification error test by @tozka in #1180
- vdk-core: fix error classification of vdk code by @tozka in #1173
- vdk-core: fix flakey test in test checking logs output by @murphp15 in #1194
- vdk-core: template running state detection capability by @ivakoleva in #941
- vdk-csv: Updates on vdk-csv README by @duyguHsnHsn in #952
- vdk-impala: Add validation for queries that doesn't provide lineage info by @kostoww in #1175
- vdk-impala: fix error classification in impala by @tozka in #1178
- vdk-impala: fix impala template empty source view usr err by @mrMoZ1 in #1189
- vdk-impala: fixed platform error missclasified when running template by @mrMoZ1 in #944
- vdk-impala: improve vdk-impala documentation by @tozka in #948
- vdk-kerberos-auth: Pinned minikerberos in vdk-kerberos-auth plugin by @mivanov1988 in #1168
- vdk-kerberos-auth: add KerberosClient for authenticating API calls by @tozka in #879
- vdk-plugins: improve plugin project creation with cookiecutter by @tozka in #942
- vdk-properties-fs: new plugin for local FS properties storage by @ivakoleva in #1190
- vep: Jupyter Notebook Integration Goals and Requirements by @duyguHsnHsn in #1165
- vep: Jupyter Notebook Integration by @duyguHsnHsn in #1113
- versatile-data-kit: Without and with VDK image by @zverulacis in #1184
- versatile-data-kit: set automatic java formatter by @tozka in #757
- versatile-data-kit: simplify release process by @tozka in #951
- versatile-data-kit: update contact instructions by @tozka in #1172
New Contributors
Full Changelog: v0.6...v0.7
Versatile Data Kit 0.6
Summary
Major features include:
Configuration auto-wiring improvement: detect non vdk_ prefixed environment variables
Before configuration option must have been prefixed with "vdk_" when set as an environment variable in order to be recognized.
This was very error prone since the options are documented without the prefix.
Now they can be set without a prefix as well.
The following are equivalent:
export VDK_DB_DEFAULT_TYPE='impala'
export DB_DEFAULT_TYPE='impala'
If both are set, the "prefixed" variable has a higher priority.
New plugin/library: vdk-lineage-model
VDK Lineage Model plugin aims to abstract emitting lineage data from VDK data jobs, so that different lineage loggers can be configured at run time in any plugin that supports emitting lineage data
Check out more at the plugin page.
New export-csv command
Alongside vdk ingest-csv
which enabled users to import (or ingest) CSV data into a table.
Users can now export CSV with a simple command from SQL query:
vdk export-csv -q "select * from my_table --file 'output.csv'
Checkout out more at the plugin page
In memory properties client
Until now properties required Control Service to be able to work. Sometimes for prototyping and testing purposes, you do not need to connect to external services.
- New configuration value can be set.
In a specific job's config file (config.ini
[vdk]
properties_default_type = memory
Or as an environment variable
export properties_default_type="memory"
- Now the properties would be entirely in memory. That means they will be "deleted" after the job's run.
New example: Ingest and anonymize
Example how to anonymize any data being ingested using VDK with a plugin.
Check out more at the example page
New example: Airflow integration
Example how to create dependencies between data job in Airflow.
Check out more at the example page
Package versions
See installation instructions here.
The versions of VDK components released under VDK 0.6 are:
Main components
control-service 1.5.620438292
vdk-core==0.3.620677184
Plugins
airflow-provider-vdk==0.0.602273476
vdk-lineage-model== 0.0.581430542
vdk-kerberos-auth==0.3.584577337
vdk-ingest-http==0.2.616713987
vdk-impala==0.4.613570906
vdk-lineage== 0.3.604201902
vdk-trino== 0.4.605101952
What's Changed
- airflow-provider-vdk: Add hidden fields to VDK Connection by @doks5 in #883
- control-service: Atomic job cancellation by @gageorgiev in #860
- control-service: Fluentd integration for data jobs by @mivanov1988 in #940
- control-service: Secure job builder image by @gageorgiev in #936
- control-service: add default jwt jwk uri by @mrMoZ1 in #873
- control-service: fix the examples in swagger by @tozka in #945
- control-service: fix vdk-server startup issues by @mrMoZ1 in #908
- control-service: increase integration test builder memory by @mrMoZ1 in #929
- control-service: upgrade docker container used in cicd by @mrMoZ1 in #911
- vdk-airflow: populate readme by @tozka in #924
- vdk-control-cli: remove hidden flag for CLI commands by @tozka in #902
- vdk-control-cli: use latest dependencies version during build by @tozka in #903
- vdk-core,vdk-impala,vdk-lineage,vdk-trino: Support for pluggy 1.0 by @gageorgiev in #931
- vdk-core: Add printed output to set-default and reset-default by @gageorgiev in #884
- vdk-core: BaseVdkError exception propagation flaw fix by @ivakoleva in #917
- vdk-core: Improve ingestion error logging by @gageorgiev in #930
- vdk-core: add memory properties client by @tozka in #921
- vdk-core: add option to disable version check by @tozka in #876
- vdk-core: detect non vdk_ prefixed environment values for config by @tozka in #874
- vdk-core: execution result missing exception and blamee fix by @ivakoleva in #938
- vdk-core: hide native cursor from execute hook by @tozka in #875
- vdk-core: make db_default_type case insensitive by @tozka in #935
- vdk-core: show log_level_vdk in help by @tozka in #905
- vdk-core: step loading failure misclassified as Platform error fix by @ivakoleva in #920
- vdk-core: termination message now idempotent by @mrMoZ1 in #909
- vdk-core: vdk_exception hook exit code fix by @ivakoleva in #912
- vdk-core: vdk_exception hook exit code fix by @ivakoleva in #915
- vdk-csv: add export-csv command by @duyguHsnHsn in #934
- vdk-examples: add ingest and anonymize example by @tozka in #922
- vdk-impala, vdk-trino: Remove deprecated use of result field by @gageorgiev in #933
- vdk-impala: Add performance logs by @VladimirPetkov1 in #939
- vdk-impala: Add support for lineage in vdk-impala by @VladimirPetkov1 in #932
- vdk-ingest-http: reduce verbosity of ingestion logs by @tozka in #943
- vdk-kerberos-auth: Separate async event loop by @doks5 in #885
- vdk-lineage-model: Extract Lineage Model in separate plugin by @VladimirPetkov1 in #896
- vdk-server: Pin kubernetes API version by @doks5 in #919
- vdk-server: fix for vdk server crashing on startup by @mrMoZ1 in #907
- vdk-trino, vdk-linage: Switch to vdk-lineage-model by @VladimirPetkov1 in #898
- vdk-trino: fix broken tests by @tozka in #900
- versatile-data-kit: Add Data lifecycle image and minor changes by @zverulacis in #887
- versatile-data-kit: Add getting started, ask for help, PR checklist by @zverulacis in #881
- versatile-data-kit: Add intro part to contributing.md from the template by @zverulacis in #880
- versatile-data-kit: Airflow Documentation by @gageorgiev in #857
- versatile-data-kit: add link to csv example doc by @tozka in #893
- versatile-data-kit: add logo image by @tozka in #877
- versatile-data-kit: make easier slack instructions by @tozka in #925
- versatile-data-kit: update link in examples by @tozka in #892
- versatile-data-kit: update logo for dark mode by @tozka in #878
New Contributors
- @VladimirPetkov1 made their first contribution in #896
- @duyguHsnHsn made their first contribution in #934
Full Changelog: v0.5...v0.6
Versatile Data Kit 0.5
Summary
Major features include:
New managed db_connection_execute_operation hook
The hooks enable users to add behavior to existing SQL queries without modifying the code itself.
It is invoked for reach query before and after enabling to track its full execution. For example
@hookimpl(hookwrapper=True)
db_connection_execute_operation(execution_cursor: ExecutionCursor) -> Optional[int]:
start = time.time()
outcome = yield # we yield the execution so that query is executed
end = time.time()
log.info(f" duration: {end - start}. ")
Airflow Provider VDK release (beta)
Users can integrate with Apache Airflow to orchestrate in a DAG (workflow) Data Jobs.
Check out more at airflow-provider-vdk
What's Changed
- airflow-provider-vdk: Adopt auth plugin by @doks5 in #856
- airflow-provider-vdk: Example DAG by @gageorgiev in #847
- airflow-provider-vdk: Fix VDKSensor templating issue, improve example DAG by @gageorgiev in #852
- control-service: clear execution fail alert when failing with user error by @mrMoZ1 in #850
- control-service: fix graphql team filter not retrieving special chars by @mrMoZ1 in #863
- control-service: improve api message on oom job execution errors by @mrMoZ1 in #861
- documentation improvements by @zverulacis in #853
- vdk-control-cli: Adopt new auth exceptions by @doks5 in #846
- vdk-core: Add unit test for destination_table in empty queue by @doks5 in #865
- vdk-core: Fix destination_table referenced early by @doks5 in #864
- vdk-core: Split execution summary into chunks by @doks5 in #867
- vdk-core: add new managed db_connection_execute_operation hook by @tozka in #805
- vdk-core: fix buggy (false positive) connection unit test by @tozka in #841
- vdk-control-api-auth: New VDK Auth exceptions by @doks5 in #845
- vdk-heartbeat: pipelines-control-service-integration-tests image rebuild by @ivakoleva in #848
- vdk-plugins: Add Managed Database Connection cycle plugin by @tozka in #859
- vdk-test-utils: enable back tests by @tozka in #855
New Contributors
- @zverulacis made their first contribution in #842
Full Changelog: v0.4...v0.5
Versatile Data Kit 0.4
Summary
Major features include:
Standalone Data Job run
Until now the only way to run a data job was with CLI command "vdk run". Now users can run a job entirely programmatically using it.
For example:
with StandaloneDataJobFactory.create(
data_job_directory=Path(__file__), extra_plugins=[hook_tracker]
) as job_input:
print(job_input.get_name())
Check out more in the new API documentation here
New Plugin: vdk-control-api-auth
A new library plugin, not a runnable plugin, that is intended to be used as a dependency for other plugins, which need to authenticate users against the Control Service.
Check more in the plugin documentation
What's Changed
- Scenario 3 - Created the Energy Scenario by @alod83 in #781
- [vdk-plugins] vdk-control-api-auth: Add api-token flow by @doks5 in #822
- [vdk-plugins] vdk-control-api-auth: Add authorization code flow by @doks5 in #834
- [vdk-plugins] vdk-control-api-auth: Enable plugin release by @doks5 in #837
- [vdk-plugins] vdk-control-api-auth: Fix query key type by @doks5 in #838
- airflow-provider-vdk: Fix CICD release step by @gageorgiev in #824
- airflow-provider-vdk: VDKOperator execute method by @gageorgiev in #823
- airflow-provider-vdk: VDKOperator initial structure by @gageorgiev in #820
- airflow-provider-vdk: VDKSensor poke method by @gageorgiev in #818
- control-service: Allow for jobs with no schedule to be deployed by @gageorgiev in #835
- control-service: Kerberos authentication IT by @mrMoZ1 in #798
- control-service: cicd unit tests should run on pull requests by @mrMoZ1 in #830
- control-service: kerberos authentication IT test by @mrMoZ1 in #831
- vdk-control-api-auth: Add core auth logic by @doks5 in #815
- vdk-control-cli: Adopt new vdk-control-api-auth library by @doks5 in #840
- vdk-control-cli: Fix command printed on successful deploy by @gageorgiev in #839
- vdk-control-cli: Make schedule_cron config param optional by @gageorgiev in #827
- vdk-core: New feature: StandaloneDataJob by @mrdavidlaing in #793
- vdk-core: encapsulate router-specific properties logic by @ivakoleva in #817
- vdk-core: new version check built-in plugin false positive fix by @ivakoleva in #816
- vdk-core: properties write pre-processing support by @ivakoleva in #819
- vdk-heartbeat: null datetime conversion fix by @ivakoleva in #813
- versatile-data-kit: allow commit with any newer version of python by @tozka in #826
- versatile-data-kit: link examples wiki in the git examples by @tozka in #812
- versatile-data-kit: update readme with clear slack instructions by @tozka in #806
New Contributors
- @mrdavidlaing made their first contribution in #793
Full Changelog: v0.3...v0.4
Versatile Data Kit 0.3
Summary
Major features include:
Support for Kerberos Authentication provider in the Control Service
Alongside support for Oauth2, now organizations can integrate with their Kerberos infrastructure.
Users can specify Kerberos as an authentication provider for accessing VDK Control Service.
For more information on how to configure Kerberos see VDK helm documentation here
A new plugin: vdk-lineage (alpha)
VDK Lineage plugin provides lineage data (input data -> job -> output data) information for any SQL query (regardless of the database) executed using VDK and sends it to a pre-configured destination using OpenLineage standard
We also have introduced a utility command vdk marquez-server --start
which starts Marquez UI locally so that lineage is visualized.
For more information check out vdk-lineage plugin documentation
Support for Kuberentes 1.23
Now VDK Control Service can work seamlessly with the newest versions of Kubernetes and make use of its features:
- VDK Control Service can now work with CronJob controller V2 (alongside V1).
- With TTL Controller, any jobs launched by VDK Control Service can be cleaned up after preconfigured time.
Users can override the VDK version of a deployed data jobs
Users can now specify the vdk version both using API or CLI when deploying a Data Job.
For example, with CLI it's as simple as vdk deploy --update --vdk-version old-vdk-version
This would enable canary deployments or rolling deployments of VDK.
Introducing VEP (VDK Enhancement Proposal) process and first VEP
Versatile Data Kit has a process in place for proposing and adding large changes in an efficient and consistent manner.
For more information check the process here.
We also have used the process for our first major feature change - Apache Airflow Integration
Package versions
See installation instructions here.
The versions of VDK components released under VDK 0.3 are:
Main components
control-service 1.5.520417292
vdk-control-cli==1.3.520417292
vdk-core==0.2.520417292
vdk-heartbeat==0.6.520417292
Plugins
vdk-trino==0.3.520417292
vdk-lineage==0.2.520417292
vdk-kerberos-auth==0.3.520417292
vdk-impala==0.3.520417292
What's Changed
- VEP-554: Apache Airflow Integration by @mivanov1988 in #748 and @doks5 in #786
- airflow-provider-vdk: Initial Airflow provider structure by @gageorgiev in #772
- airflow-provider-vdk: Job execution status and logs method by @gageorgiev in #796
- airflow-provider-vdk: Start and cancel job execution methods by @gageorgiev in #778
- airflow-provider-vdk: VDKSensor initial structure by @gageorgiev in #800
- control-service: Adopt kubernetes-client 14.0.1 by @gageorgiev in #761
- control-service: add kerberos auth properties to helm chart by @mrMoZ1 in #764
- control-service: Adopt use of the V1CronJob API by @gageorgiev in #767
- control-service: Bump pipelines-control-service version by @doks5 in #762
- control-service: Set TTLAfterFinished period for K8s CronJobs by @gageorgiev in #776
- control-service: Update CHANGELOG.md by @doks5 in #760
- control-service: add OAuth2 enable/disable flag by @mrMoZ1 in #765
- control-service: add kerberos auth provider by @mrMoZ1 in #755
- control-service: builder job configurable security context by @mivanov1988 in #708
- control-service: configurable builder job service account by @mivanov1988 in #791
- control-service: fix builder security context by @mivanov1988 in #784
- control-service: fix concatAddresses NPE by @mivanov1988 in #782
- control-service: fix job builder unit tests by @mivanov1988 in #792
- control-service: fix log link to set endTime always by @tozka in #735
- vdk-control-cli: Adopt click version 8 by @ivakoleva in #770
- vdk-control-cli: set vdk version and enabled when deploying new job by @tozka in #752
- vdk-core: JobInput get_name and get_job_directory implementation by @ivakoleva in #745
- vdk-core: Verify payload after pre-processing it by @YanaZhivkova in #777
- vdk-core: clarify run descriptions on --arguments option by @tozka in #731
- vdk-core: ensure sql args are subsituted in correct priority by @tozka in #749
- vdk-core: lowercase env variables are inferred as configuration by @tozka in #751
- vdk-core: minor refactoring in managed_cursor to reduce long method by @tozka in #803
- vdk-core: print query duration by @mrMoZ1 in #804
- vdk-core: refactor test to use job_path method by @tozka in #747
- vdk-core: update plugin hook diagrams by @tozka in #775
- vdk-core: Adopt click version 8.0 by @doks5 in #769
- vdk-heartbeat: Fix initial job executions with specific vdk version by @YanaZhivkova in #758
- vdk-heartbeat: Handle execution end_time not string by @doks5 in #750
- vdk-impala: unify names of templates betwen trino and impala by @tozka in #787
- vdk-kerberos-auth: support kerberos auth for all CLI commands by @tozka in #774
- vdk-kerberos-auth: upgrade minikerberos and requests-kerberos to latest by @ivakoleva in #742
- vdk-lineage: introducing POC (pre-alpha) implementation by @tozka in #783
- vdk-plugins: Introduce vdk-control-api-auth plugin by @doks5 in #801
- vdk-snowflake: Enable support for Python 3.10 by @gageorgiev in #746
- vdk-trino: add link to template examples by @tozka in #788
- vdk-trino: collect lineage for select/insert and rename table only by @philip-alexiev in #756
- vdk-trino: fix ingesting value with bool type failing by @tozka in #753
- vdk: add VDK enhancement proposal (VEP) spec template by @tozka in #727
- versatile-data-kit: Update CONTRIBUTING.md with links to coding standard by @tozka in #794
New Contributors
- @philip-alexiev made their first contribution in #756
Full Changelog: 0.2...v0.3