#69: partitioning check function on DB #109

lsulak · 2023-11-02T09:49:54Z

Closes #69

This PR implements Partitioning validation on the DB side. The following validations were implemented:

Correct structure of the input JSONB object
The list of keys in 'keys' is unique and doesn't have NULLs
The keys in 'keys' and in 'keysToValuesMap' correspond to each other
(if i_is_pattern = false) The values in 'keysToValuesMap' are non-empty/non-null

The following were not implemented:

Parent partitioning related checks - we discovered that the parent partitioning not always has to have keys related to its 'child', in fact these can be completely unrelated - for example, imagine that you have datasets A and B, and you are joining them together to create dataset C - obviously there might be no relationship between datasets A and B, thus partitioning would be defined differently for those two, but they would be 'parent' partitionings in relation to C - which might have different partitioning than A and B.
Case sensitivity & whitespace normalization - considering that the values for partitioning (here I mean both keys and values as strings) can be used from some file/object store (HDFS, S3, ... ) or perhaps from some metastore (Glue Catalog), most of these things are case sensitive, thus I wanted us to also be case sensitive here - besides, the create_partitioning_if_not_exist DB function does not do any normalization, so the check also shouldn't; if we decide to do it, then on both places. (cc @benedeki)
DB function create_partitioning_if_not_exist is not aware of our plans to be able to have partitioning patterns - I still don't know exactly how we are gonna create and use them, so I didn't introduce such parameter / logic there yet. However, since we know that we'll have it, and already talked about the distinction between that and an actual (non-pattern) partitioning (i.e. in patterns values can be NULLs/empty strings and in non-patterns values can't be NULLs/empty strings), I already implemented all needed in the partitioning validation checks. (cc @benedeki)

If you want to test this, just deploy it to your localhost PG DB and run some of this:

select * from runs._validate_partitioning('{
    "keys": ["one", "two", "three"], 
    "version": 1,
    "keysToValuesMap": {
        "one": "John", 
        "two": "Q",
	"three": null
    }
}'::jsonb, true)

and

select * from runs._is_partitioning_valid('{
    "keys": ["one", "two", "three"], 
    "version": 1,
    "keysToValuesMap": {
        "one": "John", 
        "two": "Q",
	"three": " w "
    }
}'::jsonb)

github-actions · 2023-11-02T09:53:10Z

JaCoCo agent module code coverage report - spark:2 - scala 2.12.12

There is no coverage information present for the Files changed

Total Project Coverage	83.26%	🍏

github-actions · 2023-11-02T09:53:12Z

JaCoCo server module code coverage report - scala 2.12.12

There is no coverage information present for the Files changed

Total Project Coverage	17.17%	❌

…tions

github-actions · 2023-11-10T13:58:43Z

JaCoCo agent module code coverage report - spark:2 - scala 2.12.18

Overall Project	NaN% `NaN%`	🍏

There is no coverage information present for the Files changed

github-actions · 2023-11-10T13:58:44Z

JaCoCo server module code coverage report - scala 2.12.18

File	Coverage [0%]	❌
PartitioningForDB.scala	0%	❌

Total Project Coverage	0%	❌

…tions

…tion

…ts function

benedeki · 2024-01-18T16:07:41Z

@salamonpavel
I am not against early app level validation. I actually support it. Reason why I want this structure verified also on DB level:

DB might need to work with this structure in future. so it's not an "abstract blob" for the DB logic (unlike several other JSON fields in the current structure, which are not verified)
like s REST sever shouldn't trust, that a client is sending correct data, so does the DB API expect an "enemy" on the other side 😉
performance: that's a valid concern, therefore the check is done only on relatively rare partioning write to the DB, not in any other cases, where the type is involved

salamonpavel · 2024-01-18T18:37:40Z

@salamonpavel I am not against early app level validation. I actually support it. Reason why I want this structure verified also on DB level:

* DB might need to work with this structure in future. so it's not an "abstract blob" for the DB logic (unlike several other JSON fields in the current structure, which are not verified)

* like s REST sever shouldn't trust, that a client is sending correct data, so does the DB API expect an "enemy" on the other side 😉

* performance: that's a valid concern, therefore the check is done only on relatively rare _partioning_ write to the DB, not in any other cases, where the type is involved

Ok. Let's make sure though that the application level validations and db level validations are always in sync. Especially if we expose the json schema to third parties it shouldn't validate data that the db would reject.

salamonpavel · 2024-01-19T07:52:10Z

@benedeki @lsulak

Given that we plan to perform data validation within our application, I believe it would be beneficial to adopt a 'fail-fast' approach at the database level. This would allow us to quickly identify and reject any invalid data, thereby enhancing the efficiency of our operations.

I absolutely understand and appreciate your commitment to ensuring the integrity of our data. However, I've noticed that our SQL code currently includes tasks such as beautifying JSONs and composing messages. While these tasks are undoubtedly important, I feel they would be better suited to our Scala codebase.

SQL excels at data manipulation and retrieval, but it might not be the best tool for tasks that involve complex string manipulation or error handling. By moving these tasks to our Scala code, we can make our SQL code simpler and more focused on its core responsibilities.

Therefore, I propose that we simplify our SQL code to perform validation checks and return a boolean value. If the validation fails, we can immediately halt the operation ('fail-fast'). Any additional handling, such as generating error messages or beautifying JSONs, can be managed in our Scala application code.

I believe this approach will allow us to leverage the strengths of both SQL and Scala, leading to more maintainable and efficient code.

lsulak · 2024-01-22T09:58:46Z

Hi @salamonpavel and @benedeki apologies for late response.

My take on your thoughts and comments:

I'm not against server-side validation by something like JSON Schema. It clearly has a lot of benefits as you outlined & I not only support it but also would like to implement it at some point. The thing is that I feel like it's too early for it now. I propose to wait a bit until we have at least 1-2 or 3 more validations like this. I don't want to introduce a technology/tool for 1 field check just yet, however cool and trivial it might be to embed into the server.
I absolutely think that partitioning is the central-piece (or one of them) of the whole Atum (data-wise) - it's, after all, the unique identification of a dataset. Therefore I wanted to have 100% certainty that nothing wrong will be inserted into the DB. Therefore, I want our solution to be really robust against 'bad' / old / misconfigured / invalid requests/data coming from:
- client - that can be not just scala's Agent but also Postman or cURL requests etc (but I know, validation on server side can prevent this)
- server - more like a paranoia / improbable edge point, I know, because there will be just 1 server, managed on our side, but not 100% of the communication with DB will be initiated via the server, see my next point
- direct queries on DB - migration scripts, perhaps some manual work - if this check I implemented here wouldn't exist, there might be cases where invalid partitioning is inserted into the DB - there is probably absolutely no valid argument why to drop these functions completely
as David mentioned, it might be that some other DB functions might work with one of this functions in the future

From @salamonpavel:

However, I've noticed that our SQL code currently includes tasks such as beautifying JSONs ...

No longer this is the case

... and composing messages.

I think this might be perhaps dropped and the SQL function can be simplified to return only BOOL, perhaps with status codes if needed - I'll think about the future usecases of the main validation function - if I won't caome up with any, I'll drop it

…ning-validation-db-functions

github-actions · 2024-02-02T16:12:08Z

JaCoCo server module code coverage report - scala 2.13.11

Overall Project	65.14% `-0.47%`	❌
Files changed	76%	❌

File	Coverage
PartitioningForDB.scala	100% `-14.29%`	❌

benedeki · 2024-02-06T08:25:00Z

database/src/main/postgres/validation/V1.6.3.__add_partitioning_check_constraint.sql

+ * limitations under the License.
+ */
+
+ALTER TABLE runs.partitionings


This is why I don't like the migration scripts. Unclear end state. 😞

…tions

…ation

…tions

github-actions · 2024-02-12T14:25:59Z

JaCoCo agent module code coverage report - scala 2.12.18

Overall Project	NaN% `NaN%`	🍏

There is no coverage information present for the Files changed

github-actions · 2024-02-12T14:26:01Z

JaCoCo model module code coverage report - scala 2.12.18

Overall Project	NaN% `NaN%`	🍏

There is no coverage information present for the Files changed

…tions

#69: just an empty invalid function for now, preparing the ground

dfe60e1

lsulak self-assigned this Nov 2, 2023

Merge branch 'master' into feature/69-partitioning-validation-db-func…

9d2fe18

…tions

benedeki added the work in progress Work on this item is not yet finished (mainly intended for PRs) label Nov 8, 2023

lsulak added 2 commits November 8, 2023 15:06

Merge branch 'master' into feature/69-partitioning-validation-db-func…

7e4c677

…tions

Merge branch 'master' into feature/69-partitioning-validation-db-func…

29850fb

…tions

lsulak added 13 commits November 12, 2023 10:21

Merge branch 'master' into feature/69-partitioning-validation-db-func…

76e2065

…tions

Merge branch 'master' into feature/69-partitioning-validation-db-func…

a45c2ae

…tions

Merge branch 'master' into feature/69-partitioning-validation-db-func…

4046a45

…tions

Merge branch 'master' into feature/69-partitioning-validation-db-func…

a18349c

…tions

Merge branch 'master' into feature/69-partitioning-validation-db-func…

738c1c6

…tions

#69: partitioning validation improvements, almost finished implementa…

91fb07e

…tion

#69: code doc improvements

2378964

#69: helper function

ba65312

#69: improving partitioning validation

9e4c308

#69: preparing ground for the high level partitioning check function

7ee4c22

#69: finishing the functions focused on partitioning validation

1a63bed

#69: tiny refactoring

d9f4bda

#69: using partitioning validation in create_partitioning_if_not_exis…

fbb417d

…ts function

lsulak marked this pull request as ready for review December 20, 2023 10:21

lsulak requested review from benedeki, TebaleloS, Zejnilovic, dk1844 and salamonpavel as code owners December 20, 2023 10:21

#69: removing redundant comment, the intention is obvious from the code

3a1dd5a

lsulak added 2 commits January 16, 2024 17:01

#58: post-testing fixes

e48a9f6

#58: post-review changes / minor refactoring

b70e18d

lsulak added 2 commits January 25, 2024 17:13

#69: moving to a separate schema

91960c1

#69: tiny refactoring

c79da20

benedeki previously approved these changes Feb 1, 2024

View reviewed changes

lsulak added 2 commits February 2, 2024 15:34

Merge remote-tracking branch 'origin/master' into feature/69-partitio…

a32185b

…ning-validation-db-functions

post-merge fixes

739a833

lsulak dismissed benedeki’s stale review via 739a833 February 2, 2024 16:10

benedeki previously approved these changes Feb 6, 2024

View reviewed changes

lsulak added 2 commits February 9, 2024 17:10

Merge branch 'master' into feature/69-partitioning-validation-db-func…

117c7f4

…tions

#69: adding forgotten schema DDL definition

94eda46

lsulak dismissed benedeki’s stale review via 94eda46 February 12, 2024 09:32

lsulak added 5 commits February 12, 2024 10:40

#69: fixing for Flyway

32e7890

#69: fixing for Flyway

f87b926

#69: fixing DB unit tests

67b1d14

#69: adding unit tests for DB functions related to partitioning valid…

bbf25a1

…ation

Merge branch 'master' into feature/69-partitioning-validation-db-func…

fad7d8a

…tions

lsulak added 2 commits February 13, 2024 17:03

Merge branch 'master' into feature/69-partitioning-validation-db-func…

cd6047b

…tions

Merge branch 'master' into feature/69-partitioning-validation-db-func…

45ad411

…tions

salamonpavel approved these changes Feb 16, 2024

View reviewed changes

lsulak merged commit 50968e0 into master Feb 19, 2024
10 checks passed

lsulak deleted the feature/69-partitioning-validation-db-functions branch February 19, 2024 10:47

lsulak mentioned this pull request Feb 23, 2024

defined and implemented get_partitioning_measures:#137 #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#69: partitioning check function on DB #109

#69: partitioning check function on DB #109

lsulak commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023

github-actions bot commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 10, 2023 •

edited

Loading

github-actions bot commented Nov 10, 2023 •

edited

Loading

benedeki commented Jan 18, 2024

salamonpavel commented Jan 18, 2024 •

edited

Loading

salamonpavel commented Jan 19, 2024

lsulak commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Feb 2, 2024 •

edited

Loading

benedeki Feb 6, 2024

github-actions bot commented Feb 12, 2024 •

edited

Loading

github-actions bot commented Feb 12, 2024 •

edited

Loading

#69: partitioning check function on DB #109

#69: partitioning check function on DB #109

Conversation

lsulak commented Nov 2, 2023 • edited Loading

github-actions bot commented Nov 2, 2023

JaCoCo agent module code coverage report - spark:2 - scala 2.12.12

github-actions bot commented Nov 2, 2023 • edited Loading

JaCoCo server module code coverage report - scala 2.12.12

github-actions bot commented Nov 10, 2023 • edited Loading

JaCoCo agent module code coverage report - spark:2 - scala 2.12.18

github-actions bot commented Nov 10, 2023 • edited Loading

JaCoCo server module code coverage report - scala 2.12.18

benedeki commented Jan 18, 2024

salamonpavel commented Jan 18, 2024 • edited Loading

salamonpavel commented Jan 19, 2024

lsulak commented Jan 22, 2024 • edited Loading

github-actions bot commented Feb 2, 2024 • edited Loading

JaCoCo server module code coverage report - scala 2.13.11

benedeki Feb 6, 2024

Choose a reason for hiding this comment

github-actions bot commented Feb 12, 2024 • edited Loading

JaCoCo agent module code coverage report - scala 2.12.18

github-actions bot commented Feb 12, 2024 • edited Loading

JaCoCo model module code coverage report - scala 2.12.18

lsulak commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 10, 2023 •

edited

Loading

github-actions bot commented Nov 10, 2023 •

edited

Loading

salamonpavel commented Jan 18, 2024 •

edited

Loading

lsulak commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Feb 2, 2024 •

edited

Loading

github-actions bot commented Feb 12, 2024 •

edited

Loading

github-actions bot commented Feb 12, 2024 •

edited

Loading