Releases: vmware/versatile-data-kit
v1.7
Major features include:
vdk-structlog
By setting structlog_config_preset
users can choose a configuration preset to either LOCAL or CLOUD grouping best logging configuration for those use cases. Any config options set together with the preset will override the preset options..
Example RAG Pipeline
An example of how to build end to end chatbot using VDK:
vdk dag local execution
To be able to test now you can execute the entire dag locally on your machine without needing to deploy
Make sure all data job directories are on the same level
export DAGS_JOB_EXECUTOR_TYPE=local
Then run dag job as normal:
vdk run dag-job
Or from IDE as explained here and set DAGS_JOB_EXECUTOR_TYPE=local
as an environment variable in the run configuration
See more in VDK DAG documentation
Support for Python 3.12
Added official support and testing for Python 3.12 in VDK plugins and main components.
What's Changed
- control-service: fix data job deployment by @mivanov1988 in #3110
- control-service: fix data job deployment cpu conversion by @mivanov1988 in #3109
- control-service: optimize job builder pip install by @antoniivanov in #3151
- documentation: fix monthly download badges closing HTML tag by @yonitoo in #3090
- examples: Make RAG examples a bit more generic and demoable by @antoniivanov in #3085
- examples: RAG Chat UI by @gageorgiev in #3108
- examples: RAG Question-answering Web Service by @gageorgiev in #3098
- examples: add Embed and Ingest Confluence JSON data example data job by @yonitoo in #3073
- examples: add Fetch And Embed Data Job Example by @yonitoo in #3065
- examples: add chunker job to support configurable chunking by @yonitoo in #3093
- examples: adopt RAG examples for remote execution by @antoniivanov in #3117
- examples: refactor a bit the examples by @antoniivanov in #3127
- quickstart-vdk: make sure it is released by @antoniivanov in #3113
- specs: update vector database vep with more explanation by @antoniivanov in #3054
- vdk-confluence: add data source plugin for confluence by @duyguHsnHsn in #3094
- vdk-core: Add 'method' to pre_ingest_process API by @doks5 in #3072
- vdk-core: Adopt 'method' argument in pre-process plugins by @doks5 in #3074
- vdk-core: add is_default() function to config by @DeltaMichael in #3076
- vdk-core: do not count memory properties toward the count by @antoniivanov in #3099
- vdk-core: enable/disable structlog based on config by @DeltaMichael in #3102
- vdk-core: relevant info in step result by @DeltaMichael in #3062
- vdk-core: remove structlog logging override by @DeltaMichael in #3066
- vdk-dag: add local executor by @antoniivanov in #3097
- vdk-dag: fix unnecessary authorization failure by @antoniivanov in #3096
- vdk-dag: improve error handling and error messages by @antoniivanov in #3152
- vdk-examples: example job with confluence reader by @duyguHsnHsn in #3070
- vdk-gdp-execution-id: adopt ingester changes by @dakodakov in #3120
- vdk-kerberos-auth: adopt unreleased oscrypto library by @dakodakov in #3130
- vdk-kerberos-auth: adopt unreleased oscrypto library by @dakodakov in #3131
- vdk-kerberos-auth: revert recent changes by @dakodakov in #3134
- vdk-postgres: batch inserts during ingestion by @antoniivanov in #3121
- vdk-quickstart: add vdk-structlog by @duyguHsnHsn in #2956
- vdk-server: fix ingress settings by @antoniivanov in #3101
- vdk-structlog: add default logging format values by @DeltaMichael in #3055
- vdk-structlog: put vdk init logs config behind flag by @DeltaMichael in #3107
- vdk-test-utils: Measure payload size with len, not getsizeof by @gageorgiev in #3157
- vdk-test-utils: adopt ingester changes by @dakodakov in #3119
- versatile-data-kit: Change copyright notice by @gageorgiev in #3116
- versatile-data-kit: Support for Py3.12 by @gageorgiev in #3143
- versatile-data-kit: add link to architecture to contributing.md by @antoniivanov in #3071
Full Changelog: v1.6...v1.7
v1.6
Major features include:
vdk-oracle database plugin
A new oracle plugin can be used to execute queries against Oracle DB in both thick and thin mode.
Ingesting data is now supported including with automatic shema inference.
To see more information check the vdk-oracle plugin documentation
vdk-structlog
Various enhancements in VDK-Structlog, including syslog handler support, log level parsing, and configuration updates
Check out more about vdk-structlog in its documentation
VDK Ingestion into Vector Database for RAG initiative started
![](https://private-user-images.githubusercontent.com/2536458/301182204-80296573-c043-451f-8c90-1d2afab01e1d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDI2NTYsIm5iZiI6MTczOTMwMjM1NiwicGF0aCI6Ii8yNTM2NDU4LzMwMTE4MjIwNC04MDI5NjU3My1jMDQzLTQ1MWYtOGM5MC0xZDJhZmFiMDFlMWQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTFUMTkzMjM2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZDQ2NTI4NzY1ZWQ0MWZkOWYzMTgzZDU1ZDAzNzA4NTgwNjNlMmZmZGUyMzAxM2MwYTAwMDI1ZGJjZTVlMTcwYSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.L7kw_euvF8BIoVMdZoppt3wghZetPrx4NDCiZ6XuWJY)
With the rise in popularity of LLMs and RAG we see VDK as a core component to getting the data where we need it to be. VDK's strengths are ETL tasks. We see that its very well suited to populating the databases needed for RAG.
For more information check out the VEP
What's Changed
- control-service: Add support for vdk version by @doks5 in #2943
- control-service: base job image - update README.md by @mivanov1988 in #2971
- control-service: fix job builder secure by @mivanov1988 in #2976
- control-service: getting all jobs should now be consistent by @mrMoZ1 in #2978
- control-service: oracle support for vdk-oracle plugin by @mivanov1988 in #3042
- control-service: write resources to db by @mrMoZ1 in #2897
- documentation: Update to READMEs: Add a monthly download badge by @chrfoyer in #2983
- examples: add Confluence data retrieval example Data Job by @yonitoo in #3040
- frontend: [SC-2632] Save Executions grid state by @hzhristova in #2991
- specs: update Vector DB Ingestion VEP by @antoniivanov in #3033
- vdk-control-cli: Modify how data is written to job config.ini by @doks5 in #2982
- vdk-core: Short job summary by @gageorgiev in #3038
- vdk-core: replace deprecated imp module with importlib by @yonitoo in #2981
- vdk-core: replace report_and_rethrow by @DeltaMichael in #3056
- vdk-impala: handle decorate operation errors by @DeltaMichael in #2975
- vdk-jupyter: e2e tests fix by @duyguHsnHsn in #2966
- vdk-jupyter: update npm packages by @antoniivanov in #3043
- vdk-kerberos-auth: fix dependency issue by @murphp15 in #2969
- vdk-license: change year by @duyguHsnHsn in #2987
- vdk-lineage: pin sqllineage version by @duyguHsnHsn in #3005
- vdk-oracle: add new config options (host,port,sid,thick_mode_lib_dir) by @antoniivanov in #3032
- vdk-oracle: escape special chars in column names by @DeltaMichael in #3045
- vdk-oracle: fixes in tests and passing secrets by @antoniivanov in #3031
- vdk-oracle: set thik mode default to true by @antoniivanov in #3037
- vdk-oracle: support secrets to connect to database by @DeltaMichael in #2961
- vdk-oracle: support thick mode by default by @DeltaMichael in #2970
- vdk-plugins: replace report_and_rethrow by @DeltaMichael in #3057
- vdk-specs: Ingestion into vector db for RAG use by @murphp15 in #3004
- vdk-structlog: Document log level module config var by @gageorgiev in #2984
- vdk-structlog: add syslog handler by @duyguHsnHsn in #2985
- vdk-structlog: filter vdk_step_name and vdk_step_type correctly by @DeltaMichael in #2968
- vdk-structlog: rename structlog configs by @DeltaMichael in #3002
- vdk-structlog: support parse log level module and logger config granularity by @yonitoo in #2980
- vdk-structlog: update README.md by @DeltaMichael in #3047
- vdk-structlog: update VEP-2448 to reflect current state of development by @DeltaMichael in #3035
New Contributors
Full Changelog: v1.5...v1.6
Versatile Data Kit 1.5
Major features include:
Control Service
Data Job Configuration Persistence feature improvements
Adding the next level improvement over the pre-alpha version of the feature, including: GraphQL read data from DB, documentation improvements and improved test coverage.
vdk-structlog: Log Plugin
Adding improvements for the VDK Structs logs plugin and preparation for final release.
vdk-datasources: Data sources POC
Adding Data sources initial PoC version which includes:
- Data Source APIs handling sources, streams and state
- New Data Source is implemented by implementing IDataSource, IDataSourceConfiguraiton and IDataSourceStream
- Data Source connection management partialy
- Data Source Ingester that reads from data sources and writes to existing IIngeser
- An example data source AutoGeneratedDataSource
- An example job in the function test suite
vdk-oracle: Create oracle plugin
Adding pre-alpha VDK support for connecting and ingesting to an Oracle DB. For further usage details consult the VDK Oracle Plugin readme.
vdk-jupyter: Add alpha support for Jupyter Nodebooks
Adding full alpha support for VDK Jupyter integration.
How to get started?
We have prepared a few guides How to Create a Data Job With VDK Notebook, How To Develop a Data Job With VDK Notebook,
How to Convert a Data Job with VDK Notebook and How to Deploy a Data Job with VDK Notebook to help with your Jupyter journey.
What's Changed
- control-plane: remove needless step in docker build. by @murphp15 in #2947
- control-service: Add GraphQL read from DB by @doks5 in #2837
- control-service: add MeterRegistry counters for DataJobsSynchronizer by @mrMoZ1 in #2844
- control-service: add pod disruption budget by @dakodakov in #2882
- control-service: add resource constraints by @dakodakov in #2915
- control-service: add support for pymssql by @mivanov1988 in #2908
- control-service: deployment cannot be suspended by @mivanov1988 in #2941
- control-service: fix deployment resources by @mivanov1988 in #2955
- control-service: fix pod disruption budget template by @dakodakov in #2885
- control-service: force aws cred provider refresh by @mrMoZ1 in #2879
- control-service: ingress allow for multiple hosts by @mivanov1988 in #2911
- control-service: integration test for async job deploy by @mrMoZ1 in #2829
- control-service: make new release of job builder images by @murphp15 in #2950
- control-service: make timeout configurable by @murphp15 in #2951
- control-service: reduce logging by @mivanov1988 in #2857
- control-service: refactor service user doc by @mrMoZ1 in #2436
- control-service: unit tests for data job persistence classes by @mrMoZ1 in #2935
- control-service: update ingress by @mivanov1988 in #2853
- support: update the ci notification by @DeltaMichael in #2877
- vdk-core: add datetime and bytes to decimal json encoder by @DeltaMichael in #2924
- vdk-core: add logging plugin warning and check if the vdk-structlog plugin is used by @yonitoo in #2944
- vdk-core: create config option for logging execution result by @DeltaMichael in #2850
- vdk-core: ensure early logs are available by @antoniivanov in #2846
- vdk-core: fix bug in error classification by @DeltaMichael in #2840
- vdk-core: fix exception cause swallowing by @DeltaMichael in #2949
- vdk-core: handle fetchall errors for oracledb by @DeltaMichael in #2917
- vdk-core: implement config option for logging execution result by @DeltaMichael in #2831
- vdk-core: ingest logging formatting bug by @antoniivanov in #2836
- vdk-core: remove redundant logs by @DeltaMichael in #2841
- vdk-core: VdkBoundLogger by @gageorgiev in #2823
- vdk-data-source-git: data source for git POC by @antoniivanov in #2859
- vdk-data-sources: add sources command by @antoniivanov in #2864
- vdk-data-sources: address review comments by @antoniivanov in #2865
- vdk-datasources: data sources POC by @antoniivanov in #2805
- vdk-duckdb: fix ingestion by @antoniivanov in #2843
- vdk-events: add explore23 to events by @duyguHsnHsn in #2873
- vdk-events: Add ingest and anonymize workshop by @antoniivanov in #2833
- vdk-events: improve Productionizing Jupyter Notebooks README by @duyguHsnHsn in #2896
- vdk-events: update Ingest and Anonymize workshop by @antoniivanov in #2891
- vdk-huggingface: add new ingest plugin by @antoniivanov in #2858
- vdk-impala: enhance memory error handling by @dakodakov in #2938
- vdk-impala: uncomment tests that were not passing due to core change by @DeltaMichael in #2845
- vdk-ipython: add support for %%vdkingest by @antoniivanov in #2866
- vdk-jupyter: add retries by @murphp15 in #2957
- vdk-jupyter: fix bug for failed requests and improve error handling by @yonitoo in #2916
- vdk-jupyter: fix formatting issues by @yonitoo in #2890
- vdk-jupyter: fix skipped tests by @murphp15 in #2871
- vdk-jupyter: include test report by @murphp15 in #2876
- vdk-jupyter: introduce Task Runner, a polling mechanism that runs tasks in the background and tracks their status by @yonitoo in #2869
- vdk-jupyter: run tests in CI by @murphp15 in #2868
- vdk-jupyterlab-extensions: update dependencies by @murphp15 in #2863
- vdk-kerberos-auth: fix unit test failing from bad logging config by @murphp15 in #2923
- vdk-logging-format: Deprecate plugin by @gageorgiev in #2888
- vdk-notebook: Support for "%%vdkingest" cell type in Notebook Steps by @antoniivanov in #2867
- vdk-oracle: create oracle plugin by @DeltaMichael in #2927
- vdk-oracle: support type inference by @DeltaMichael in #2948
- vdk-plugins: Audit log statements by @gageorgiev in #2878
- vdk-singer: Singer.io plugin for data sources by @antoniivanov in #2821
- vdk-smarter: pin openai version to 0.28 by @yonitoo in #2886
- vdk-structlog: default formatter by @duyguHsnHsn in #2936
- vdk-structlog: fix filtering of metadata fields for json by @DeltaMichael in #2874
- vdk-structlog: LTSV formatting by @gageorgiev in #2887
- vdk-structlog: Test for compatibility for log_level_module propagation by @gageorgiev in #2922
- vdk-structlog: Tests by @gageorgiev in #2838
Full Changelog: v1.4...v1.5
Versatile Data Kit 1.4
Major features include:
Control Service
Complete Data Job Configuration Persistence (Pre-alpha)
The current two-step process of storing data job deployment configurations in both Kubernetes and a database leads to performance degradation, potential data loss, and complexity; optimizing storage by consistently keeping all essential properties in the database can enhance efficiency, system reliability, and user experience
Another important benefit would be to allow to track deployment status using the API.
vdk-structlog log plugin
The plugin allows users to configure logging metadata and logging format. It also works with bound loggers.
This plugin allows users to:
select the log output format
configure the logging metadata
display metadata added by bound loggers
See more in its documentation page
vdk-core Error handling changes
Deprecated error reporting patterns
![](https://private-user-images.githubusercontent.com/2536458/278037168-caf0f22d-5a15-48a7-a38b-892864c4ee0d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDI2NTcsIm5iZiI6MTczOTMwMjM1NywicGF0aCI6Ii8yNTM2NDU4LzI3ODAzNzE2OC1jYWYwZjIyZC01YTE1LTQ4YTctYTM4Yi04OTI4NjRjNGVlMGQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTFUMTkzMjM3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzNhOWU2OGE0ODEzNGZhOGEyMjU5ZWVmYmRmYmU4ZjExODc3Zjc1YWMyNzU4OTliM2RjOTQxMjhmMDA4Y2U3MCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ._r7uoef1nPxcJY750LkiyMneFIKAJdljrLP0_jDLhFQ)
Most vdk-core generic Exceptions replaced with Domain specific
![](https://private-user-images.githubusercontent.com/2536458/278037274-759bb748-cd3f-4ab7-be2a-a514c1cb3863.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDI2NTcsIm5iZiI6MTczOTMwMjM1NywicGF0aCI6Ii8yNTM2NDU4LzI3ODAzNzI3NC03NTliYjc0OC1jZDNmLTRhYjctYmUyYS1hNTE0YzFjYjM4NjMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTFUMTkzMjM3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MzhhMDk3YzcyNGRlNmNlODQ1Y2JmYzNhZTY2ZjFlNWNiNzgwMTljYWU5NGNhN2QzZDk3NDY0N2YyMmQ3Y2VhYiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.FJPjXeUZbX_W5of_tqJ-ZqE_QJ588Cwxing0TXpS9KA)
Test exception propagation to user code
VDK stopped wrapping non-vdk errors in vdk errors. This should result in errors coming from libraries, templates, etc. being propagated to user code. Users should then be able to handle those errors. So now something like this should be easy:
def run(job_input: IJobInput):
args = dict()
try:
job_input.execute_template("csv-risky", args)
except pd.errors.EmptyDataError as e:
log.info("Handling empty data error")
log.exception(e)
What's Changed
- control-service: Enhance Exception Handling for DataJobsSynchronizer by @mivanov1988 in #2758
- control-service: [bug fix] Add freetype2 and libpng to secure builder by @doks5 in #2744
- control-service: add IT tests for async job deployment by @mivanov1988 in #2794
- control-service: add new deployment tables by @mivanov1988 in #2719
- control-service: asynchronous deployment deletion by @mivanov1988 in #2781
- control-service: data job synchronizer error handling by @mivanov1988 in #2742
- control-service: deployment controller reads from db by @mrMoZ1 in #2800
- control-service: depoyment controller writes deployment entity by @mrMoZ1 in #2731
- control-service: enable scheduled execution for data jobs' synchronizer by @mivanov1988 in #2771
- control-service: fix control service post deployment test by @mivanov1988 in #2790
- control-service: fix data job image building by @mivanov1988 in #2832
- control-service: fix infinite redeployment by @mivanov1988 in #2822
- control-service: fix post deployment test by @mrMoZ1 in #2815
- control-service: fix read deployment job version by @mivanov1988 in #2819
- control-service: handle deployment deletion in case of a job being deleted by @mivanov1988 in #2816
- control-service: implement multi-threading for synchronization process by @mivanov1988 in #2775
- control-service: improve async deployment logging by @mivanov1988 in #2826
- control-service: job resources validation on job deployment by @mivanov1988 in #2793
- control-service: reduce logging by @mivanov1988 in #2834
- control-service: resolve dependabot alert by @antoniivanov in #2751
- control-service: user-initiated deployment notifications by @mivanov1988 in #2757
- control-service: utilize new deployment tables by @mivanov1988 in #2714
- vdk-audit: Clean up some audit events by @doks5 in #2792
- vdk-control-cli: fix CI/CD tests by @yonitoo in #2782
- vdk-control-cli: pin werkzeug to version 2.3.8 or less by @DeltaMichael in #2743
- vdk-core: add error formatter configuration by @DeltaMichael in #2754
- vdk-core: create ingestion exceptions by @antoniivanov in #2752
- vdk-core: domain specific properties/secrets exceptions by @antoniivanov in #2770
- vdk-core: fix postgres and greenplum tests by @yonitoo in #2825
- vdk-core: move error classifying logic by @duyguHsnHsn in #2769
- vdk-core: pass exceptions from data job steps in results by @DeltaMichael in #2774
- vdk-core: remove code duplication in ingestion router by @antoniivanov in #2760
- vdk-core: simplify error message for send_**_for_ingestion by @antoniivanov in #2787
- vdk-core: test exception propagation to user code by @DeltaMichael in #2820
- vdk-core: test ingestion with multiple threads by @antoniivanov in #2796
- vdk-core: tests passing custom iterator to ingestion methods by @antoniivanov in #2761
- vdk-coverity: Adding Coverity Scan by @shanmathik in #2753
- vdk-dag, vdk-control-cli, airflow-provider-vdk: step using deprecated field by @antoniivanov in #2706
- vdk-dag: fix failing validation tests by @DeltaMichael in #2712
- vdk-heartbeat: Introduce additional sleep when checking deployments by @doks5 in #2824
- vdk-impala: add Out Of Memory error handling by @dakodakov in #2747
- vdk-impala: introduce new error handling by @duyguHsnHsn in #2759
- vdk-jupyter: Enable PYTHONUNBUFFERED to ensure correct log ordering by @gageorgiev in #2711
- vdk-jupyter: add Tutorial link in getting-started.ipynb by @antoniivanov in #2707
- vdk-jupyter: cicd fix by @duyguHsnHsn in #2780
- vdk-jupyter: fix bug in detecting run functions by @antoniivanov in #2721
- vdk-jupyter: fix ci/cd by @duyguHsnHsn in #2773
- vdk-jupyter: print summary output to temp dir by @antoniivanov in #2715
- vdk-jupyter: update getting started to incldue vdksql by @antoniivanov in #2713
- vdk-jupyter: use
vdksql
for SQL cells and steps by @antoniivanov in #2729 - vdk-notebook: [bug fix] ignore missing id field in cell by @doks5 in #2717
- vdk-notebook: remove obsolete code by @antoniivanov in #2716
- vdk-notebook: set summary file path as configuration by @antoniivanov in #2709
- vdk-plugins: add new error handling methods by @duyguHsnHsn in #2750
- vdk-structlog: create structured logging plugin by @DeltaMichael in #2801
- vdk-test-utils: make IngestIntoMemoryPlugin method configurable by @antoniivanov in #2783
New Contributors
- @shanmathik made their first contribution in #2753
Full Changelog: v1.3...v1.4
Versatile Data Kit 1.3
Major features include:
VDK SDK
Add vdk sql-query command (experimental)
CLI command to execute SQL query against VDK managed database.
It should replace vdk <db>-query
commands.
export db_default_type=trino
vdk sql-query -q "select * from trino_table"
id memory_size_mb num_vcpus
-------------- ---------------- -----------
50181506DB2F7 256 1
5018A2223FC32 128 1
501883404870A 256 1
vdk sql-query -o json -q "select * from trino_table"
[
{"id": "50181506DB2F7", "memory_size_mb": 256, "num_vcpus": 1},
{"id": "5018A2223FC32", "memory_size_mb": 128, "num_vcpus": 1},
{"id": "501883404870A", "memory_size_mb": 256, "num_vcpus": 1}
]
VDK Notebook Getting Started
Introduction to the development of VDK Jobs using Notebooks.
- Learn how to create data jobs
- Learn how to deploy data jobs
VDK SQL Notebook Cell
VDK Errors APIs
(Relevant for plugin developers)
VDK is deprecating the user of errors.log_and_rethrow
and errors.log_and_throw
in favour of
errors.report(error_type, exception: BaseException)
errors.report_and_throw(exception: BaseVdkException)
errors.report_and_rethrow(error_type, exception: BaseException)
The aim is to reduce "double" logging and verbosity of logs.
Control Service
Add support for multiple jwt issuers
New properly security.oauth2.jwtIssuerUris
is introduced and replaced jwtIssuerUrl
security:
oauth2:
## [ Required if security.enabled = True ]
## Deprecated in favor of jwtIssuerUris.
jwtIssuerUrl: ""
## [Required if security.enabled = True]
## Comma separated list of issuers to use.
jwtIssuerUris: ""
Implement Webhook APIs authentication
Enable setting system or service account for Control Service webhook authentication
This introduces 2 new properties for each webhook
authorizationServerEndpoint: ""
authorizationRefreshToken: ""
If set they will be used when making HTTP Webhook request. If not it will fall back to the user provided authentication token
What's Changed
- control-service: add amazon rds ca certificates to data job image by @mrMoZ1 in #2660
- control-service: add data job deployment dynamic property source selection mechanism by @mrMoZ1 in #2641
- control-service: add missing packages to secure job builder by @mivanov1988 in #2662
- control-service: add support for multiple jwt issuers by @mrMoZ1 in #2628
- control-service: data jobs synchronizer initial implementation by @mivanov1988 in #2633
- control-service: implement Webhook APIs authentication by @mivanov1988 in #2655
- control-service: introduce data job deployment entity by @mivanov1988 in #2659
- control-service: webhook authentication comments by @mivanov1988 in #2663
- specs: VDK Run Logs: add detailed design for Progress tracker functionality by @antoniivanov in #2647
- specs: VEP-2421 Universal Database Plugin by @Maximiliaan72 in #2616
- vdk-control-api-auth: [bug-fix] authorization header base64 encoding by @doks5 in #2658
- vdk-control-cli: download keytab along with downloading job by @duyguHsnHsn in #2617
- vdk-core: add log stacktrace flag by @DeltaMichael in #2648
- vdk-core: add more unit tests to for job_input_error_classfier by @DeltaMichael in #2621
- vdk-core: add vdk sql-query command by @antoniivanov in #2649
- vdk-core: change jobs behavior when logging unavailable by @mrMoZ1 in #2656
- vdk-core: provide more info about the machine vdk is running on by @antoniivanov in #2629
- vdk-core: remove box around exception when printed by @DeltaMichael in #2700
- vdk-core: separate error logging, error reporting and error classification by @DeltaMichael in #2666
- vdk-csv: uncomment test by @duyguHsnHsn in #2704
- vdk-duckdb: update readme by @antoniivanov in #2657
- vdk-heartbeat: Add python version configuration by @doks5 in #2675
- vdk-ipython: add support for %%vdksql magic command by @antoniivanov in #2619
- vdk-jupyter: Change root FS position for path parameter by @gageorgiev in #2550
- vdk-jupyter: Log file is not refreshed on new vdk run by @duyguHsnHsn in #2687
- vdk-jupyter: Update ConvertJobToNotebook doc by @antoniivanov in #2702
- vdk-jupyter: VDK menu fix by @duyguHsnHsn in #2635
- vdk-jupyter: add first version of getting started page by @duyguHsnHsn in #2516
- vdk-jupyter: add opionated jupyter extension as extra by @antoniivanov in #2625
- vdk-jupyter: change Create and Download default path by @duyguHsnHsn in #2669
- vdk-jupyter: fix arguments input size by @duyguHsnHsn in #2685
- vdk-jupyter: fix conversion by @duyguHsnHsn in #2699
- vdk-jupyter: improve init message by @antoniivanov in #2688
- vdk-jupyter: pin traitlets to 5.9 by @antoniivanov in #2674
- vdk-jupyter: rename log file by @duyguHsnHsn in #2703
- vdk-jupyter: show status on starting new operation by @antoniivanov in #2627
- vdk-jupyter: use npm in build.sh by @antoniivanov in #2665
- vdk-jupyter: show status dialog on deploy and create by @antoniivanov in #2631
- vdk-notebook: add support for execute %%vdksql cells by @antoniivanov in #2622
- versatile-data-kit: CONTRIBUTING.md feedback / improvements #2146 by @zverulacis in #2661
- versatile-data-kit: README user testing improvements by @zverulacis in #2654
Full Changelog: v1.2...v1.3
Versatile Data Kit 1.2
Major features include:
New Github Landing Page
![](https://private-user-images.githubusercontent.com/2536458/264348723-be5bfd79-c7ce-4840-a77a-f030cb267eb1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDI2NTcsIm5iZiI6MTczOTMwMjM1NywicGF0aCI6Ii8yNTM2NDU4LzI2NDM0ODcyMy1iZTViZmQ3OS1jN2NlLTQ4NDAtYTc3YS1mMDMwY2IyNjdlYjEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTFUMTkzMjM3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YTAwZTQxMWRmYzg4MzgzNjU0MTkzMjVkYzViMTUyM2E3MmE0ZjdkOTVmMmJhYjcxZTUzNWU4YmE2MGIzOWUwMyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.GyMlrkHKLxWM7jkfuGhbfSupvjQDTKMYLg6EBhReprw)
The new landing page of our open-source project. The new landing page aims to allow users to see and understand what is VDK and what they can do with VDK much easier by showing them.
Check it out at https://github.com/vmware/versatile-data-kit
Control Service improvements
Operators can set builder image per Python version
Operator can easily control the image of
- The operator-managed VDK (system) library,
- The base image used to build the user data job
- And now the builder image with which the user data job is build
deploymentSupportedPythonVersions:
3.9:
baseImage: "registry.hub.docker.com/versatiledatakit/data-job-base-python-3.7:latest"
vdkImage: "registry.hub.docker.com/versatiledatakit/quickstart-vdk:release"
builderImage: "registry.hub.docker.com/versatiledatakit/job-builder:latest"
More information can be found in the Control Service Helm Chart documentation
Operator can configured to automatically ignore files on deploy
When users deploy job operator can control which files are actually accepted and either return error or simply ignore them:
This allows much better security while also allowing flexibility of operators to change without impacting users directly:
# Instead to allow only sql and ini text files specify "text/x-sql,text/x-ini"
# Full list of file types are documented in https://tika.apache.org
# If set to empty, then all file types are allowed.
uploadValidationFileTypesAllowList: ""
# List of file extensions that are allowed to be uploaded. Comma separated list e.g: "py,csv,sql"
# only files with extensions that are present in this list will be allowed to be uploaded.
# if the list is empty all extensions are allowed.
uploadValidationFileExtensionsAllowList: ""
# Works as the uploadValidationFileTypesAllowList above, only it deletes the files instead of failing
# the job upload. Runs before the allow list, therefore if only files of the same types are present in
# both lists, job upload will succeed.
uploadValidationFileTypesFilterList: ""
# List of file extensions that are automatically deleted from data job source code before upload.
# Comma separated list e.g: "pyc,exe,sh". If the list is empty no files will be deleted.
# Files are first deleted before the allow list performs its checks.
uploadValidationFileExtensionsFilterList: ""
More information can be found in the Control Service Helm Chart documentation
New initiative: VDK Run Logs: Simplified And Readable
Take a look at the VEP which would simplify troubleshooting and development using VDK .
We are focused on those goals:
- Data job run logs provide progress-tracking information
- User logs stand out
- Long-running operations (like DAGs) are traceable in the logs
- The root cause is immediately visible from the logs.
- Clean Error Handling
Versatile Data Kit Architecture.md
Design architecture of Versatile Data Kit outlining all main interfaces and how they work can be seen at architecture.md
Notebook UI improvements
Add UI element indicating a VDK operation is running
Provides visual feedback to the user when a VDK operation is in progress.
![Status button](https://private-user-images.githubusercontent.com/36246462/259115446-957fb5e3-5b4e-41d7-b795-46e99b3b8d15.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDI2NTcsIm5iZiI6MTczOTMwMjM1NywicGF0aCI6Ii8zNjI0NjQ2Mi8yNTkxMTU0NDYtOTU3ZmI1ZTMtNWI0ZS00MWQ3LWI3OTUtNDZlOTliM2I4ZDE1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDE5MzIzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVhY2VkYjQ0ZWVmZjgzMTlmYWMzZDcwY2RkM2UzMTEyZjZmN2IzYWM2NjBkNWQ3MzVjNmJhZDQyYmE0MmFkNmYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.6CEkM4MpwfULTC-H0FtKL_8xFxbcAQp-Au_9a-FTHKA)
![Hover](https://private-user-images.githubusercontent.com/36246462/259115288-2ea1c4f3-6668-4662-a70a-0bcf11b6c87f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDI2NTcsIm5iZiI6MTczOTMwMjM1NywicGF0aCI6Ii8zNjI0NjQ2Mi8yNTkxMTUyODgtMmVhMWM0ZjMtNjY2OC00NjYyLWE3MGEtMGJjZjExYjZjODdmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDE5MzIzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM1ODdhN2IxZWVlZmUyMzRhODYwMWFlM2E4YzZiMTg2OGE5YzYxZTVlOGIzN2E3MzEwNTQ3ZDMwYWI4ZDM1MGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.BUXYtHikFEYWHaZWW8e4LjPkelu5JPiRJsxHDy1ykes)
Add icons to vdk operation result dialogs
Enhances user experience by adding icons to result dialog boxes
![Screenshot 2023-08-09 at 12 54 00](https://private-user-images.githubusercontent.com/36246462/259369602-c94c7f40-53a9-45a8-84da-7adfc8caaba0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDI2NTcsIm5iZiI6MTczOTMwMjM1NywicGF0aCI6Ii8zNjI0NjQ2Mi8yNTkzNjk2MDItYzk0YzdmNDAtNTNhOS00NWE4LTg0ZGEtN2FkZmM4Y2FhYmEwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDE5MzIzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTRhYWE5M2QxMWMwODBjOTYyMzBlYmMyODgxM2ZjN2VhOTMyYjFiNjdhMWRjYzQ4Y2Q5NTFkYTU1ZGFjY2E3OWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.EBzSodk_6fcdSHyl63CS9YM9oAhb53YKlkhq7Rf8by8)
![Screenshot 2023-08-09 at 12 53 49](https://private-user-images.githubusercontent.com/36246462/259369610-bcfe9f0a-ad93-430f-a001-0e4cd5c73539.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDI2NTcsIm5iZiI6MTczOTMwMjM1NywicGF0aCI6Ii8zNjI0NjQ2Mi8yNTkzNjk2MTAtYmNmZTlmMGEtYWQ5My00MzBmLWEwMDEtMGU0Y2Q1YzczNTM5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDE5MzIzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWExYTM2Njk4YThjZjNkNmNiZDhkZTg0M2UzODIwNTA2ZjI0MWJhNGEzOTAxNWRkMDFiYjMwY2M2Mjc1ZjI4YWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.vVF9sLJ36DrNU4WM75tHnzl5Jr5Umwh6nhK2D79SAzw)
![Screenshot 2023-08-09 at 12 53 41](https://private-user-images.githubusercontent.com/36246462/259369622-349f3649-fa65-4815-8a70-90bd4e2024ea.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDI2NTcsIm5iZiI6MTczOTMwMjM1NywicGF0aCI6Ii8zNjI0NjQ2Mi8yNTkzNjk2MjItMzQ5ZjM2NDktZmE2NS00ODE1LThhNzAtOTBiZDRlMjAyNGVhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDE5MzIzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWRkN2U0YjE1ODVlOGVkM2YzODNkY2RhNWMzMTJhNDAxMDM4MjAyNTRhNTc2ZTE0ZDM0NjQ1ZDg4ZDUzZTY5OTQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Y-6WFfkv5l2ijXObzZYMaWPEl0XEEDd1PBddonGlD0Y)
VDK Login UI: Semi-automated authentication workflow in the Jupyter Notebook
New database POC plugin vdk-duckdb
Check out more at vkd-duckdb
What's Changed
- control-service: Ability to disable VDK UI in Helm by @antoniivanov in #2545
- control-service: ability to set builder image per python version by @mivanov1988 in #2490
- control-service: add file filter before job upload by @mrMoZ1 in #2540
- control-service: add libffi to secure base job image by @mivanov1988 in #2515
- control-service: allow delete-all-secrets command by @dakodakov in #2523
- control-service: enable WebHook authentication by @mivanov1988 in #2551
- control-service: exclude pyc files from dj validation by @mivanov1988 in #2492
- control-service: fine-tune the job-builder-secure by @mivanov1988 in #2497
- control-service: fix secure base image by @mivanov1988 in #2517
- control-service: fix webhooks authentication helm chart by @mivanov1988 in #2560
- control-service: introduce data job deployment entity by @mivanov1988 in #2613
- control-service: release job builder secure 1.3.0 by @mivanov1988 in #2496
- control-service: release secure job builder by @mivanov1988 in #2534
- control-service: remove hardcoded image pull policy from job deployer by @mrMoZ1 in #2557
- control-service: support for pyodbc by @mivanov1988 in #2524
- control-service: update supported python version example by @mivanov1988 in #2494
- frontend: fix pushing images by @antoniivanov in #2491
- frontend: Fix bug in Cypress plugin by @gorankokin in #2486
- frontend: Fix for e2e test by @gorankokin in #2559
- frontend: bump cicd-base-gui image version by @DeltaMichael in #2366
- frontend: set favicon for vdk by @antoniivanov in #2433
- specs: VEP-2420: Getting started with your Data by @murphp15 in #2519
- specs: VEP-2448: VDK Run Logs: Simplified And Readable by @DeltaMichael in #2456
- specs: add architecture.md by @antoniivanov in #2265
- specs: try to make it clear what deliverables should be by @antoniivanov in #2495
- specs: update Notebook integration with Oauth2 authentication by @antoniivanov in #2533
- specs: update VEPs metadata by @antoniivanov in #2532
- specs: vep-2448 detailed design section by @DeltaMichael in #2558
- specs: vep-2448 high-level design by @DeltaMichael in #2520
- vdk-audit: [bug fix] Fix incorrectly detected event by @doks5 in #2548
- vdk-control-api-auth: add better error message for refresh token failure by @antoniivanov in #2607
- vdk-control-api-auth: add get_authenticated_username by @antoniivanov in #2518
- vdk-control-api-auth: vdk credentials cache refactoring by @antoniivanov in #2606
- vdk-control-cli: Add python_version to sample config.ini by @doks5 in #2555
- vdk-control-cli: add --set-prompt option for secrets by @dakodakov in #2514
- vdk-core: Add flag to JobConfig in case config file is required by @doks5 in #2521
- vdk-core: add vdk sql-query command by @antoniivanov in #2512
- vdk-core: adopt pluggy 1.3 by @antoniivanov in #2614
- vdk-core: make sure standalone data job doesn't run steps by @antoniivan...
v1.0.1
Major features include:
Secrets Service Helm Chart installation
Vault integration configuration for storing Data Job Secrets has been added to the Helm chart:
secrets:
vault:
enabled: false
uri: "http://localhost:8200"
externalSecretName: ""
## Alternatively provide the uri and Approle Settings here. externalSecretName takes precedence if both are set.
approle:
roleid: foo
secretid: foo
sizeLimitBytes: "1048576"
VDK Secrets CLI
Job secrets are used to store credentials/tokens/sensitive data securely. They can be updated using vdk-control-cli
now:
Install vdk-control-cli if needed (it comes pre-installed in quickstart-vdk
)
pip install vdk-control-cli
vdk secrets --help
For example:
# Set single secret with key "my-key" and value "my-value". If no value is passed you'll get prompted so it's not printed on the screen.
vdk secrets --set my-key "my-value"
# Update multiple secrets at once.
vdk secrets --set "key1" "value1" --set "key2" "value2" --set "secret1" --set "secret2"
Convert Directory-style To Notebook-style Data Job
With the introduction of Notebook-style data jobs, the user has the option to Convert Directory-style to Notebook-style Data Job.
VDK Jupyter Extension published in PyPi
Users can now install the Jupyter extension with VDK in their own Python and jupyter environment with a single line :
pip install vdk-jupyterlab-extension
Then start Jupyter lab as usual:
jupyter lab
Users can now see the notebook:
New plugin: vdk-smarter
VDK Smarter introduces proof of concept (pre-alpha) integration with OpenAI.
In the POC it does a review of all SQL queries managed by VDK.
For more details see the plugin home page
What's Changed
- control-service: Add helm chart entries for Vault Configuation by @dakodakov in #2418
- control-service: Update contributing.md with correct java requirements by @danail-georgiev in #2430
- control-service: add configurable smtp host property by @mrMoZ1 in #2411
- control-service: add helm template for alertmanager by @mrMoZ1 in #2326
- control-service: add timestamps to helm chart by @DeltaMichael in #2344
- control-service: better error logging for failed test by @murphp15 in #2374
- control-service: fix helm chart by @dakodakov in #2449
- control-service: fix publish-job-base-image script by @mivanov1988 in #2473
- control-service: fix typo in helm chart read only root filesystem property by @mrMoZ1 in #2476
- control-service: install necessary dependencies to job builder secure by @mivanov1988 in #2472
- control-service: job-builder using kaniko fix by @tozka in #2429
- control-service: job-builder-secure using kaniko fix by @tozka in #2447
- control-service: logs endpoint doesn't hang by @murphp15 in #2370
- control-service: prevent integer translation in helm chart by @dakodakov in #2470
- control-service: push to multiple registries by @tozka in #2381
- control-service: release job builder in 2 repos by @tozka in #2413
- control-service: remove default vault token by @dakodakov in #2475
- control-service: remove unused dependency influxdb by @tozka in #2388
- control-service: run integration tests on multiple namespace. by @murphp15 in #2446
- control-service: set Execution and JobQuery APIs to stable by @tozka in #2417
- control-service: split build job base image CI/CD step by @mivanov1988 in #2348
- control-service: switch to Approle Vault authentication by @dakodakov in #2435
- control-service: use full url for heartbeat tests and heartbeat tests run in multiple namespaces by @murphp15 in #2295
- frontend: Fix navigation in Data Jobs by @gorankokin in #2356
- frontend: Fix router event handling in base class by @gorankokin in #2375
- frontend: bump toolchain versions in frontend build docker image by @DeltaMichael in #2358
- frontend: enable stable tagging by @DeltaMichael in #2378
- frontend: fix data-pipelines build scripts by @DeltaMichael in #2389
- frontend: push docker images to both repos by @tozka in #2390
- frontend: quickstart-vdk operability tests using cypress by @DeltaMichael in #2359
- frontend: remove e2e tests restrictions by @DeltaMichael in #2386
- support: slack notification on pipeline failure by @DeltaMichael in #2338
- vdk-control-cli: add vdk secrets command by @dakodakov in #2342
- vdk-control-cli: add vdk secrets command by @dakodakov in #2357
- vdk-control-cli: remove set-secret for properties by @dakodakov in #2409
- vdk-core: Allow different python versions for vdk docker images by @doks5 in #2346
- vdk-core: Set sender when checking if email exists by @doks5 in #2376
- vdk-core: [Hot Fix] Stop throwing exceptions if config.ini not present by @doks5 in #2367
- vdk-heartbeat: cover requirements.txt automatic installs by @tozka in #2393
- vdk-impala: Truncate table before inserting data by @sbuldeev in #2369
- vdk-impala: Update README.md for vdk-impala by @sbuldeev in #2355
- vdk-impala: support also pydantic 1.0 by @tozka in #2368
- vdk-impala: upgrade code to support pydantic 2.0 by @tozka in #2362
- vdk-ipython: README.md fix by @duyguHsnHsn in #2345
- vdk-jupyter: fix server error in jupyter ui and remove unneeded code by @duyguHsnHsn in #2361
- vdk-jupyter: Add a message describing how to contact the Jupyter devs by @gageorgiev in #2414
- vdk-jupyter: Create init cell when opening new notebook by @gageorgiev in #2352
- vdk-jupyter: Sample job notebook step by @gageorgiev in #2364
- vdk-jupyter: add Convert Job To Notebook UI button by @yonitoo in #2329
- vdk-jupyter: convert job operation by @duyguHsnHsn in #2406
- vdk-jupyter: publish image to pip registry by @murphp15 in #2407
- vdk-jupyter: remove delete operation by @duyguHsnHsn in #2428
- vdk-plugin-control-cli: add secrets command by @dakodakov in #2387
- vdk-plugins: fix build of multiple plugins by @tozka in #2445
- vdk-plugins: include Ingestion hooks documentation by @tozka in h...
Versatile Data Kit 1.0
Major features include:
VDK Operations UI
VDK Operations UI is a browser application that allows users to manage and monitor data jobs. It ships as part of quickstart-vdk and is available to users who run quickstart-vdk locally.
Users can now:
- View the overall health of their data jobs
- Enable/disable/re-run data jobs
- Have a list of their data jobs and view their deployment status, latest execution status, success rate, etc.
- Have easy access to individual data job details, such as description, schedule, notifications, and data job source code
- View details for each execution of a data job, e.g. the number of executions, job versions for each execution, execution duration, etc.
For more information about the architecture, check out VEP-1507.
See the UI in action:
Control Service Secrets API
With the release of Secrets API, users can now securely store sensitive data such as passwords, credentials, tokens, ensuring compliance with industry standards and reducing the risk of unauthorized access and data breaches.
The new Secrets API allows users to configure a Vault instance in the Control Service, enabling the storage and retrieval of secrets for data jobs. Data jobs can now easily set and retrieve secrets during runtime, enhancing security and enabling seamless integration with third-party systems.
To store and retrieve secrets, we have introduced new API methods under the path
/data-jobs/for-team/{team_name}/jobs/{job_name}/deployments/{deployment_id}/secrets
Users can make GET requests to retrieve secrets and PUT requests to update secrets for a specific data job deployment.
For more details on API usage and examples, please refer to our documentation.
vdk-impala: Introduce checks for snapshot and insert template
With the introduction of snapshot and insert template checks, we can now ensure the quality and correctness of the data before it is inserted into the target table.
Previously, the processing step checks were unable to validate the semantics of the data, potentially allowing erroneous data to be inserted. With the new checks in place, we have better control over the data integrity and can prevent unwanted behavior.
Here's an example of how to use the checks:
def sample_check(tmp_table_name):
return False if "bad" in tmp_table_name else True
template_args["check"] = sample_check
job_input.execute_template(
template_name="snapshot",
template_args=template_args,
)
What's Changed
- control-service: better error logging allowing to understand failing test by @murphp15 in #2184
- control-service: Python image based on Photon OS by @mivanov1988 in #2243
- control-service: ability to send authenticated email notifications by @mrMoZ1 in #2294
- control-service: add secrets API by @dakodakov in #2171
- control-service: add tmp dir path to image deployer's env variables by @mrMoZ1 in #2244
- control-service: data jobs points to correct namespace by @murphp15 in #2268
- control-service: fix failing pipelines by @murphp15 in #2296
- control-service: infer correct namespace if not set by @tozka in #2277
- control-service: install kubectl by @murphp15 in #2290
- control-service: introduce latest and stable tags for docker images by @DeltaMichael in #2138
- control-service: make kubernetes service easy to test. by @murphp15 in #2249
- control-service: move cron jobs methods to the data jobs class by @murphp15 in #2291
- control-service: move cron jobs methods to the data jobs class by @murphp15 in #2293
- control-service: multiple namespaces in testing by @murphp15 in #2269
- control-service: produce secure base job images for python 3.8-3.11 by @mivanov1988 in #2208
- control-service: remove spammy logs by @tozka in #2278
- control-service: remove unneeded methods by @murphp15 in #2260
- control-service: remove unused properties by @murphp15 in #2262
- control-service: secrets service implementation by @dakodakov in #2241
- control-service: secrets service integration test by @dakodakov in #2289
- control-service: secrets service unit tests by @dakodakov in #2276
- control-service: use real class when testing instead of mock by @murphp15 in #2261
- examples: Add Supported Python Versions Example by @doks5 in #2288
- frontend: add null checks for optional configs by @DeltaMichael in #2193
- frontend: disable stable tagging for ui docker images by @DeltaMichael in #2240
- frontend: ping frontend on docker image release by @DeltaMichael in #2101
- specs: VEP-2272 Complete Data Job Configuration Persistence Part 2 by @mivanov1988 in #2302
- specs: VEP-2272 Complete Data Job Configuration Persistence by @mivanov1988 in #2287
- vdk-control-cli: Allow extensions to specify a sample job by @gageorgiev in #2177
- vdk-control-cli: Test only on 3.7 and 3.11 by @gageorgiev in #2230
- vdk-core: Accept string as job_path in JobConfig by @doks5 in #2251
- vdk-core: Add python version disparity warning by @doks5 in #2242
- vdk-core: Add python_version configuration to config-help by @doks5 in #2271
- vdk-core: Improve log message for python version disparity by @doks5 in #2250
- vdk-core: Update JobConfig to match vdk-control-cli JobConfig by @doks5 in #2226
- vdk-core: adapt to recent pluggy changes by @dakodakov in #2317
- vdk-core: add configurable write directory value by @mrMoZ1 in #2206
- vdk-core: add vdk sdk secrets api - part I by @dakodakov in #2318
- vdk-core: add vdk sdk secrets api - part III by @dakodakov in #2325
- vdk-impala: Introduce checks for insert template by @sbuldeev in #2198
- vdk-impala: Introduce checks for snapshot template by @sbuldeev in #2040
- vdk-jupyter: Allow for creating a job with a notebook step by @gageorgiev in #2172
- vdk-jupyter: Fix job creation by @gageorgiev in #2245
- vdk-jupyter: fix build by pinning every package to a specific version by @duyguHsnHsn in #2186
- vdk-jupyter: installation and build by @duyguHsnHsn in #2319
- vdk-jupyter: pin jupyterlab to 3.6.3 in pyproject.toml by @duyguHsnHsn in #2292
- vdk-jupyter: pin tsc to specific version by @duyguHsnHsn in #2220
- vdk-jupyter: small fixtures on the ui by @duyguHsnHsn in #2161
- vdk-notebook: handle job with mixed .ipynb, .py, .sql files use-case by @duyguHsnHsn in #2279
- vdk-plugin-control-cli: better error logging by @murphp15 in #2185
- vdk-test-utils: add vdk sdk secrets api - part 2 by @dakodakov in #2320
- versatile-data-kit: Update .gitlint by @tozka in #2266
- versatile-data-kit: add pr title checker by @tozka in #2270
- versatile-data-kit: ignore patch updates in dependabot by @tozka in #2328
Full Changelog: v0.14...v1.0
Versatile Data Kit 0.14
Major features include:
VDK DAG plugin release
VDK DAG (previously vdk-meta-jobs) is the official name of the plugin allowing users to express dependencies between data jobs and is released as Beta with more stability and usability and documentation improvements.
Check out for more in the plugin page.
Versatile Data Kit UI Shareable Web links
Now users can share links with filters applied:
- Data Jobs list (Manage and Explore screen) are shareable through URL, as every applied filter is persisted to URL and vice-versa
- Data Job Executions screen filters and sort parameters are shareable through URL, as every applied filter or sort is persisted to URL and vice-versa
VDK UI configuration improvements and easy to get started by using quickstart-vdk
Users can now access VDK UI using quickstart-vdk. VDK UI is made to be much more configurable:
- Toggleable authentication (default: enabled) using the 'skipAuth' flag.
- Configuration of authentication parameters.
- Ability to specify visual elements displayed, e.g., navigation button to the Explore page.
VDK Control CLI supports python version
People now can specify the python version they need their job to run when deployed in VDK Control Service runtime:
vdk deploy --python-version 3.7 ..
Or in job config.ini
[job]
python_version = 3.7
Users can also see which version of python is VDK Control Service supporting currently:
vdk info
would return something like
Getting control service information...
VDK Control service version: PipelinesControlService/0.0.1-SNAPSHOT/5f078fe ...
Supported python versions:
3.9
3.8
What's Changed
- control-service: Clean up old data job configurations by @doks5 in #2075
- control-service: Fix backwards-compatibility issues by @doks5 in #2022
- control-service: Only CLI executions are "Manual" by @gageorgiev in #1763
- control-service: Rework supported python version logic by @doks5 in #1992
- control-service: Swagger UI quickstart-vdk server config by @ivakoleva in #2062
- control-service: [Bug fix] Fix supported python versions helm configuration by @doks5 in #1964
- control-service: a clear error message on how to handle the failed pipeline by @murphp15 in #2127
- control-service: add ability to check if docker image exists in ecr by @mrMoZ1 in #1977
- control-service: allow more time to reach a complete state by @murphp15 in #2143
- control-service: append integration test name to job name by @mivanov1988 in #2093
- control-service: better error logging and pull private image in private test by @murphp15 in #2156
- control-service: better error message by @murphp15 in #2094
- control-service: better error message from throwable by @murphp15 in #2157
- control-service: clarify build steps by @dakodakov in #1959
- control-service: code expected to run in transaction now runs in transaction by @murphp15 in #2117
- control-service: custom app config values can we passed to helm. by @murphp15 in #2004
- control-service: delete unused method by @murphp15 in #2038
- control-service: disable authorization on test/cicd deployment by @tozka in #2129
- control-service: disable failing test by @murphp15 in #2086
- control-service: fail tests fast by @murphp15 in #2137
- control-service: fix api declaration by @murphp15 in #1974
- control-service: fix oom tests by @murphp15 in #2028
- control-service: handle null started by value by @murphp15 in #2151
- control-service: if a test is in a bad state it fails straight away by @murphp15 in #2098
- control-service: include details in error message by @murphp15 in #2122
- control-service: increase CICD deployment resources by @tozka in #2130
- control-service: killed job was shown as successful by @mivanov1988 in #2116
- control-service: latest version of gradle and spring /remove old comment by @murphp15 in #1976
- control-service: logs url can include team name by @murphp15 in #2013
- control-service: new python client. by @murphp15 in #1983
- control-service: print response body on error by @murphp15 in #2113
- control-service: remove a test that is testing behaviour that doesn't exist by @murphp15 in #2031
- control-service: remove unused parameter by @murphp15 in #2027
- control-service: remove unused parameters by @murphp15 in #2016
- control-service: see more details when there is an error by @murphp15 in #2050
- control-service: update ecr credentials integration test by @mrMoZ1 in #2079
- control-service: upgrade python client by @murphp15 in #2076
- control-service: use git for images by @murphp15 in #2097
- docs: add getting started section for quickstart-vdk and ui by @DeltaMichael in #2019
- frontend: Bugfix in e2e plugins function and bump major versions for UI libs by @gorankokin in #1994
- frontend: Fix for e2e tests by @gorankokin in #2030
- frontend: Implement executions list enhacements by @gorankokin in #2126
- frontend: Improve visibility of 'User error' messages by @hzhristova in #1960
- frontend: Job sharable executions filter and sort by @gorankokin in #2072
- frontend: Toggleable auth by @ivakoleva in #1958
- frontend: Upgrade lineage to beta version by @hzhristova in #1991
- frontend: shareable links with query params for Data Jobs grids by @hzhristova in #2049
- frontend: visibility of app components is configurable by @DeltaMichael in #1978
- quickstart-vdk: ignore explore page and widgets in frontend by @DeltaMichael in #2073
- vdk-airflow: fix failing tests by @murphp15 in #2078
- vdk-control-cli: Add support for python_version by @doks5 in #2002
- vdk-control-cli: Add support for python_version in config by @doks5 in #2023
- vdk-control-cli: add vdk info command to list of cli commands by @dakodakov in #2069
- vdk-control-cli: import the latest version of the client into cli by @murphp15 in #1969
- vdk-control-cli: upgrade python client by @murphp15 in #2077
- vdk-control-cli: use explicit parameter names by @murphp15 in #1975
- vdk-dag: DAGs propagate their execution type to their component jobs by @yonitoo in #2080
- vdk-dag: Drop deprecation warnings by @gageorgiev in #2012
- vdk-dag: Fix config bug by @gageorgiev in #2029
- vdk-dag: Rename vdk-meta-jobs to vdk-dag by @gageorgiev in #1831
- vdk-dag: fix plugin name of DAGs example README.md by @yonitoo in #1945
- vdk-dag: improve DAGs docs and example by @yonitoo in #1984
- vdk-dag: update VEP about the execution type propagation by @yonitoo in #2095
- vdk-examples: Change Meta Jobs to DAGs in examples by @gageorgiev in #2024
- vdk-gdp-execution-id: example added by @ivakoleva in #1962
- vdk-heartbeat: add t...
Versatile Data Kit 0.13
Major features include:
New plugin: vdk-gdp-execution-id
An installed Generative Data Pack plugin automatically expands the data sent for ingestion.
This GDP plugin detects the execution ID of a Data Job running, and decorates your data product with it. So that,
it is now possible to correlate a data record with a particular ingestion Data Job execution ID.
For more information see the plugin documentation
vdk-dag: pass arguments to jobs in a DAG
Now each job in a DAG can be passed arguments :
{
"job_name": "name-of-job",
"team_name": "team-of-job",
"fail_meta_job_on_error": false,
"arguments": <ARGUMENTS IN DICTIONARY FORMAT HERE>,
"depends_on": ["name-of-job1", "name-of-job2"]
}
vdk-notebook: VDK job input in vdk cells
Users will be able to develop jobs entirely in a Notebook file with all features of VDK available out of the box
After installation of vdk-notebook users can now will have access to job_input interface to execute templates, ingest data and all else.
vdk-notebook: vdk and non-vdk cells
To enable separation of product and development code vdk-notebook integration provides a way for users to set which cells are deployable and part of their production code and which are not.
quickstart-vdk now includes the Operations UI
When installing quickstart-vdk VDK Server is available for local testing and now includes UI:
pip install quickstart-vdk
vdk server --install
For more information see here
Versatile Data Kit Frontend npm libraries release
The Versatile Data Kit Frontend provides 2 npm (angular) libraries which can be used to build integrate VDK UI with your own screens:
- @versatiledatakit/data-pipelines
Versatile Data Kit Data Pipelines library provides UI screens that helps to manage data jobs via Versatile Data Kit Control Service - @versatiledatakit/shared
Versatile Data Kit Shared library enables reusability of shared features like: NgRx Redux, Error Handlers, Utils, Generic Components, etc.
What's Changed
- control service: Add supported python version configuration by @doks5 in #1761
- control-service: fix python api release by @murphp15 in #1946
- control service: Dynamically set job base image in builder by @doks5 in #1864
- control-service: Add python_version to Control Service API by @doks5 in #1806
- control-service: Add python_version to Execution API by @mivanov1988 in #1878
- control-service: Add python_version to GraphQL API by @mivanov1988 in #1909
- control-service: Add support for Python 3.11 by @mivanov1988 in #1861
- control-service: Dynamically set vdk image in JobImageDeployer by @doks5 in #1883
- control-service: Expose supported python versions in helm by @doks5 in #1935
- control-service: Remove support for very old k8s apiVersion by @murphp15 in #1860
- control-service: add the frontend to helm by @murphp15 in #1885
- control-service: enable usage of aws temporary credentials by @mrMoZ1 in #1787
- control-service: expose supported python versions by @dakodakov in #1841
- control-service: fix failing image publisher by @murphp15 in #1810
- control-service: force job builder deploy by @mrMoZ1 in #1823
- control-service: new helm release by @murphp15 in #1910
- control-service: revert job builder python version by @mrMoZ1 in #1840
- control-service: update helm charts for service account credentials by @mrMoZ1 in #1800
- control-service: update job builders for aws temporary credentials by @mrMoZ1 in #1799
- documentation: VDK components explained by @ivakoleva in #1865
- frontend: Align code formatting in frontend projects by @gorankokin in #1863
- frontend: Configurable OAuth by @ivakoleva in #1913
- frontend: Update docs with build/test configuration by @DeltaMichael in #1928
- frontend: add build.sh by @tozka in #1807
- frontend: fix npm lint warnings by @DeltaMichael in #1808
- frontend: increase the amount of resources for build in cicd by @murphp15 in #1931
- frontend: prepare for official release shared and dp libs by @gorankokin in #1795
- frontend: publish docker image for ui by @DeltaMichael in #1872
- frontend: remove unused config in helm chart for frontend dns by @murphp15 in #1932
- frontend: Stabilization for e2e tests by @gorankokin in #1876
- frontend: Auth configurations organized by @ivakoleva in #1957
- frontend: change history link in data job by @gorankokin in #1884
- specs: VEP-1739 Update status and reorganise document by @doks5 in #1857
- specs: VEP-1739 updated API section by @mivanov1988 in #1882
- specs: update Multiple Python versions VEP summary by @tozka in #1792
- vdk-vep: update vep status by @dakodakov in #1951
- vdk-cicd: apply limit ranges for storage by @tozka in #1815
- vdk-cicd: set ephemeral storage request/limits by @tozka in #1813
- vdk-control-cli: fix circular import dependecy by @tozka in #1820
- vdk-control-cli: refactor output printing with printer class by @tozka in #1819
- vdk-control-cli: use assert_click_status by @tozka in #1817
- vdk-control-cli: use common output printer by @tozka in #1852
- vdk-control-cli: vdk list -mmm to return executions by @tozka in #1818
- vdk-control-service: publish python client library by @dakodakov in #1934
- vdk-dags: improve DAGs user-facing documentation by @yonitoo in #1892
- vdk-gdp-execution-id: a Generative Data Pack expanding with execution ID by @ivakoleva in #1877
- vdk-gdp-execution-id: import fix by @ivakoleva in #1961
- vdk-github-workflows: ubuntu latest update by @ivakoleva in #1943
- vdk-jupyter: UI test enhancements by @duyguHsnHsn in #1783
- vdk-jupyter: add UI vdk cell marks by @duyguHsnHsn in #1891
- vdk-jupyter: job run messages by @duyguHsnHsn in #1908
- vdk-jupyter: remove react-test-renderer package from package.json by @duyguHsnHsn in #1881
- vdk-lineage: support for latest version sqllineage library by @tozka in #1816
- vdk-meta-jobs: Meta Jobs DAG validation by @yonitoo in #1785
- vdk-meta-jobs: add DAG with args example by @yonitoo in #1859
- vdk-meta-jobs: add some configurable variable references in the VEP by @yonitoo in #1794
- vdk-meta-jobs: exec job with arguments by @yonitoo in #1839
- vdk-meta-jobs: fix DAG image in example by @yonitoo in #1920
- vdk-meta-jobs: improve DAGs code documentation by @yonitoo in #1873
- vdk-metajobs: Deprecate plugin by @gageorgiev in #1930
- vdk-notebook: add hook for saving error information into json file by @duyguHsnHsn in https://github.com/vmware/versatile-d...