Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#69: partitioning check function on DB #109

Merged
merged 38 commits into from
Feb 19, 2024

Conversation

lsulak
Copy link
Collaborator

@lsulak lsulak commented Nov 2, 2023

Closes #69

This PR implements Partitioning validation on the DB side. The following validations were implemented:

  1. Correct structure of the input JSONB object
  2. The list of keys in 'keys' is unique and doesn't have NULLs
  3. The keys in 'keys' and in 'keysToValuesMap' correspond to each other
  4. (if i_is_pattern = false) The values in 'keysToValuesMap' are non-empty/non-null

The following were not implemented:

  1. Parent partitioning related checks - we discovered that the parent partitioning not always has to have keys related to its 'child', in fact these can be completely unrelated - for example, imagine that you have datasets A and B, and you are joining them together to create dataset C - obviously there might be no relationship between datasets A and B, thus partitioning would be defined differently for those two, but they would be 'parent' partitionings in relation to C - which might have different partitioning than A and B.
  2. Case sensitivity & whitespace normalization - considering that the values for partitioning (here I mean both keys and values as strings) can be used from some file/object store (HDFS, S3, ... ) or perhaps from some metastore (Glue Catalog), most of these things are case sensitive, thus I wanted us to also be case sensitive here - besides, the create_partitioning_if_not_exist DB function does not do any normalization, so the check also shouldn't; if we decide to do it, then on both places. (cc @benedeki)
  3. DB function create_partitioning_if_not_exist is not aware of our plans to be able to have partitioning patterns - I still don't know exactly how we are gonna create and use them, so I didn't introduce such parameter / logic there yet. However, since we know that we'll have it, and already talked about the distinction between that and an actual (non-pattern) partitioning (i.e. in patterns values can be NULLs/empty strings and in non-patterns values can't be NULLs/empty strings), I already implemented all needed in the partitioning validation checks. (cc @benedeki)

If you want to test this, just deploy it to your localhost PG DB and run some of this:

select * from runs._validate_partitioning('{
    "keys": ["one", "two", "three"], 
    "version": 1,
    "keysToValuesMap": {
        "one": "John", 
        "two": "Q",
	"three": null
    }
}'::jsonb, true) 

and

select * from runs._is_partitioning_valid('{
    "keys": ["one", "two", "three"], 
    "version": 1,
    "keysToValuesMap": {
        "one": "John", 
        "two": "Q",
	"three": " w "
    }
}'::jsonb) 

@lsulak lsulak self-assigned this Nov 2, 2023
Copy link

github-actions bot commented Nov 2, 2023

JaCoCo agent module code coverage report - spark:2 - scala 2.12.12

There is no coverage information present for the Files changed

Total Project Coverage 83.26% 🍏

Copy link

github-actions bot commented Nov 2, 2023

JaCoCo server module code coverage report - scala 2.12.12

There is no coverage information present for the Files changed

Total Project Coverage 17.17%

@benedeki benedeki added the work in progress Work on this item is not yet finished (mainly intended for PRs) label Nov 8, 2023
Copy link

github-actions bot commented Nov 10, 2023

JaCoCo agent module code coverage report - spark:2 - scala 2.12.18

Overall Project NaN% NaN% 🍏

There is no coverage information present for the Files changed

Copy link

github-actions bot commented Nov 10, 2023

JaCoCo server module code coverage report - scala 2.12.18

File Coverage [0%]
PartitioningForDB.scala 0%
Total Project Coverage 0%

@lsulak lsulak marked this pull request as ready for review December 20, 2023 10:21
@benedeki
Copy link
Contributor

@salamonpavel
I am not against early app level validation. I actually support it. Reason why I want this structure verified also on DB level:

  • DB might need to work with this structure in future. so it's not an "abstract blob" for the DB logic (unlike several other JSON fields in the current structure, which are not verified)
  • like s REST sever shouldn't trust, that a client is sending correct data, so does the DB API expect an "enemy" on the other side 😉
  • performance: that's a valid concern, therefore the check is done only on relatively rare partioning write to the DB, not in any other cases, where the type is involved

@salamonpavel
Copy link
Collaborator

salamonpavel commented Jan 18, 2024

@salamonpavel I am not against early app level validation. I actually support it. Reason why I want this structure verified also on DB level:

* DB might need to work with this structure in future. so it's not an "abstract blob" for the DB logic (unlike several other JSON fields in the current structure, which are not verified)

* like s REST sever shouldn't trust, that a client is sending correct data, so does the DB API expect an "enemy" on the other side 😉

* performance: that's a valid concern, therefore the check is done only on relatively rare _partioning_ write to the DB, not in any other cases, where the type is involved

Ok. Let's make sure though that the application level validations and db level validations are always in sync. Especially if we expose the json schema to third parties it shouldn't validate data that the db would reject.

@salamonpavel
Copy link
Collaborator

@benedeki @lsulak

Given that we plan to perform data validation within our application, I believe it would be beneficial to adopt a 'fail-fast' approach at the database level. This would allow us to quickly identify and reject any invalid data, thereby enhancing the efficiency of our operations.

I absolutely understand and appreciate your commitment to ensuring the integrity of our data. However, I've noticed that our SQL code currently includes tasks such as beautifying JSONs and composing messages. While these tasks are undoubtedly important, I feel they would be better suited to our Scala codebase.

SQL excels at data manipulation and retrieval, but it might not be the best tool for tasks that involve complex string manipulation or error handling. By moving these tasks to our Scala code, we can make our SQL code simpler and more focused on its core responsibilities.

Therefore, I propose that we simplify our SQL code to perform validation checks and return a boolean value. If the validation fails, we can immediately halt the operation ('fail-fast'). Any additional handling, such as generating error messages or beautifying JSONs, can be managed in our Scala application code.

I believe this approach will allow us to leverage the strengths of both SQL and Scala, leading to more maintainable and efficient code.

@lsulak
Copy link
Collaborator Author

lsulak commented Jan 22, 2024

Hi @salamonpavel and @benedeki apologies for late response.

My take on your thoughts and comments:

  • I'm not against server-side validation by something like JSON Schema. It clearly has a lot of benefits as you outlined & I not only support it but also would like to implement it at some point. The thing is that I feel like it's too early for it now. I propose to wait a bit until we have at least 1-2 or 3 more validations like this. I don't want to introduce a technology/tool for 1 field check just yet, however cool and trivial it might be to embed into the server.
  • I absolutely think that partitioning is the central-piece (or one of them) of the whole Atum (data-wise) - it's, after all, the unique identification of a dataset. Therefore I wanted to have 100% certainty that nothing wrong will be inserted into the DB. Therefore, I want our solution to be really robust against 'bad' / old / misconfigured / invalid requests/data coming from:
    • client - that can be not just scala's Agent but also Postman or cURL requests etc (but I know, validation on server side can prevent this)
    • server - more like a paranoia / improbable edge point, I know, because there will be just 1 server, managed on our side, but not 100% of the communication with DB will be initiated via the server, see my next point
    • direct queries on DB - migration scripts, perhaps some manual work - if this check I implemented here wouldn't exist, there might be cases where invalid partitioning is inserted into the DB - there is probably absolutely no valid argument why to drop these functions completely
  • as David mentioned, it might be that some other DB functions might work with one of this functions in the future

From @salamonpavel:

However, I've noticed that our SQL code currently includes tasks such as beautifying JSONs ...

No longer this is the case

... and composing messages.

I think this might be perhaps dropped and the SQL function can be simplified to return only BOOL, perhaps with status codes if needed - I'll think about the future usecases of the main validation function - if I won't caome up with any, I'll drop it

benedeki
benedeki previously approved these changes Feb 1, 2024
Copy link

github-actions bot commented Feb 2, 2024

JaCoCo server module code coverage report - scala 2.13.11

Overall Project 65.14% -0.47%
Files changed 76%

File Coverage
PartitioningForDB.scala 100% -14.29%

benedeki
benedeki previously approved these changes Feb 6, 2024
* limitations under the License.
*/

ALTER TABLE runs.partitionings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I don't like the migration scripts. Unclear end state. 😞

Copy link

github-actions bot commented Feb 12, 2024

JaCoCo agent module code coverage report - scala 2.12.18

Overall Project NaN% NaN% 🍏

There is no coverage information present for the Files changed

Copy link

github-actions bot commented Feb 12, 2024

JaCoCo model module code coverage report - scala 2.12.18

Overall Project NaN% NaN% 🍏

There is no coverage information present for the Files changed

@lsulak lsulak merged commit 50968e0 into master Feb 19, 2024
10 checks passed
@lsulak lsulak deleted the feature/69-partitioning-validation-db-functions branch February 19, 2024 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create (immutable) functions that will work with and verify the JSON that represents partitioining
3 participants