Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method does not exist when calling Check.satisfies() in Pydeequ 1.1.1 / Deequ 2.0.4 #160

Closed
eolvwa opened this issue Oct 3, 2023 · 3 comments
Labels
bug Something isn't working dependencies Pull requests that update a dependency file

Comments

@eolvwa
Copy link

eolvwa commented Oct 3, 2023

The Check.satisfies method appears to be incompatible with Deequ 2.0.4. That release of Deequ includes a new optional columns parameter for the Check.satisfies() function per PR 478.

To Reproduce
Run the following code:

from pydeequ import *
check = Check(spark, CheckLevel.Warning, "Test")
check.satisfies("Field == 1", "Field", lambda x: x == 1, "Field should be equal to 1")

Expected behavior
A new compliance constraint is added to the check.

Actual behavior
Py4J reports a missing method:

Py4JError: An error occurred while calling o31.satisfies. Trace:
py4j.Py4JException: Method satisfies([class java.lang.String, 
class java.lang.String, class com.sun.proxy.$Proxy16, class scala.Some]) does not exist

Versions

  • Spark: 3.0.1
  • PyDeequ 1.1.1
  • Deequ 2.0.4 (deequ-2.0.4-spark-3.3)

Additional context
As a workaround, it looks like one can hotpatch the code. I'm not very familiar with Py4J, Pydeequ, or Deequ so not sure if this is the best long-term solution:

# assume `spark` is in scope and is the current spark session
from pydeequ import Check, CheckLevel
from pydeequ.analyzers import AnalysisRunner
from pydeequ.scala_utils import ScalaFunction1, to_scala_seq
from pydeequ.verification import VerificationSuite

# modified version
def new_satisfies(self, columnCondition, constraintName, assertion=None, hint=None):
    assertion_func = (
            ScalaFunction1(self._spark_session.sparkContext._gateway, assertion)
            if assertion
            else getattr(self._Check, "satisfies$default$2")()
        )
    hint = self._jvm.scala.Option.apply(hint)
    cols = to_scala_seq(self._jvm, [])
    y = getattr(self._jvm.scala.collection.TraversableOnce, "toList$")
    cols = y(cols)
    self._Check = self._Check.satisfies(columnCondition, constraintName, assertion_func, hint, cols)
    return self   

# install
Check.satisfies = new_satisfies

# try out
check = Check(spark, CheckLevel.Warning, "Test")
check = check.satisfies("X == 2", "X", lambda x: x == 1, "X")

r = AnalysisRunner(spark)
df = spark.createDataFrame([{"X": 3}, {"X": 2}])
res = (VerificationSuite(spark)
            .onData(df)
            .addCheck(check)
            .run())
res.successMetricsAsDataFrame(spark, res).show()
@chenliu0831
Copy link
Contributor

Thanks for reporting - PyDeequ should change and fix this. Likely this is not covered by the test as well.

@chenliu0831 chenliu0831 added the bug Something isn't working label Oct 4, 2023
@chenliu0831
Copy link
Contributor

chenliu0831 commented Oct 26, 2023

Actually I cannot reproduce this with SPARK_VERSION=3.3 and com.amazon.deequ:deequ:2.0.3-spark-3.3.

Due to breaking API issues (introduced by Scala), Deequ 2.0.4 is currently not supported. We will discuss internally to fix those issues in Scala land.

@chenliu0831 chenliu0831 added the dependencies Pull requests that update a dependency file label Oct 26, 2023
@chenliu0831
Copy link
Contributor

This will be resolved in next release which will include #169.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dependencies Pull requests that update a dependency file
Projects
None yet
Development

No branches or pull requests

2 participants