Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KYUUBI #6754][AUTHZ] Improve the performance of Ranger access requests deduplication #6758

Closed
wants to merge 12 commits into from

Conversation

wankunde
Copy link
Contributor

@wankunde wankunde commented Oct 18, 2024

🔍 Description

Issue References 🔗

This pull request fixes #6754

Describe Your Solution 🔧

Right now in RuleAuthorization we use an ArrayBuffer to collect access requests, which is very slow because each new PrivilegeObject needs to be compared with all access requests.

Types of changes 🔖

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Add benchmark
Before

Java HotSpot(TM) 64-Bit Server VM 17.0.12+8-LTS-286 on Mac OS X 14.6
Apple M3
Collecting files ranger access request:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
50000 files benchmark                            181863         189434         NaN         -0.0 -181863368958.0       1.0X

Behavior With This Pull Request 🎉

After

Java HotSpot(TM) 64-Bit Server VM 17.0.12+8-LTS-286 on Mac OS X 14.6
Apple M3
Collecting files ranger access request:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
50000 files benchmark                              1281           1310          33         -0.0 -1280563000.0       1.0X

Related Unit Tests

Exists UT


Checklist 📝

Be nice. Be informative.

extensions/spark/kyuubi-spark-authz/pom.xml Outdated Show resolved Hide resolved
extensions/spark/kyuubi-spark-authz/pom.xml Outdated Show resolved Hide resolved
@@ -34,23 +35,22 @@ case class RuleAuthorization(spark: SparkSession) extends Authorization(spark) {
val auditHandler = new SparkRangerAuditHandler
val ugi = getAuthzUgi(spark.sparkContext)
val (inputs, outputs, opType) = PrivilegesBuilder.build(plan, spark)
val requests = new ArrayBuffer[AccessRequest]()
var requests = new HashMap[(AccessResource, AccessType), AccessRequest]()
Copy link
Member

@pan3793 pan3793 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why immutable.HashMap instead of mutable.HashMap is used here?

BTW, Scala 2.13 changed the alias of HashMap from mutable.HashMap to immutable.HashMap, to make the code clear, please write new immutable.HashMap or new mutable.HashMap inline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please leave a few comments here to explain the background

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, was immutable.Map.newBuilder considered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Change to mutable.HashMap
  2. Added some comments
  3. I'm not sure if immutable.Map.newBuilder is better than mutable.HashMap?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using mutable.LinkedHashSet[AccessRequest] as the type for requests?

  1. it maintains the original adding order
  2. it guarantees the element's uniqueness reliing on the hashCode
  3. good for element looping as fast as the array

And make sure related methods of AccessRequest properly implemented. Add some unit test for assuring it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uniqueness is guaranteed by hashCode, which could be told from its name HashSet of LinkedHashSet. Not much overhead cost for maintaining the order inside, just adding a reference to a linked list. And the linked list perform well in looping elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean mutable.LinkedHashSet[AccessRequest] may be slower than HashSet + ArrayBuffer in our case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the unique key is NOT the AccessRequest itself but part of properties, the combination of ArrayBuffer with a HashSet looks fine to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean mutable.LinkedHashSet[AccessRequest] may be slower than HashSet + ArrayBuffer in our case.

No, it wont be slowed down, as LinkedHashSet maintaining both HashSet and Linked list inside.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the unique key is NOT the AccessRequest itself but part of properties, the combination of ArrayBuffer with a HashSet looks fine to me.

Noticed. But the HashSet still reily on the same logic with hashCode inside. And no proper customed method of either AccessRequest or its parent class.

}

// scalastyle:on
test("KYUUBI #6754: improve the performance of ranger access requests") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmark cases are not expected to run in regular test workflow, I remember we ported Spark benchmark framework, can we write it in that style?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added benchmark, thanks

@wankunde wankunde changed the title [KYUUBI #6754] Improve the performance of ranger access requests [WIP][KYUUBI #6754] Improve the performance of ranger access requests Oct 20, 2024
@wankunde wankunde changed the title [WIP][KYUUBI #6754] Improve the performance of ranger access requests [KYUUBI #6754] Improve the performance of ranger access requests Oct 20, 2024
@wankunde wankunde changed the title [KYUUBI #6754] Improve the performance of ranger access requests [WIP][KYUUBI #6754] Improve the performance of ranger access requests Oct 20, 2024
@wankunde wankunde changed the title [WIP][KYUUBI #6754] Improve the performance of ranger access requests [KYUUBI #6754] Improve the performance of ranger access requests Oct 20, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 20, 2024

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (2d64255) to head (9d7d196).
Report is 32 commits behind head on master.

Files with missing lines Patch % Lines
.../plugin/spark/authz/ranger/RuleAuthorization.scala 0.00% 4 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #6758   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         684     684           
  Lines       42282   42289    +7     
  Branches     5767    5769    +2     
======================================
- Misses      42282   42289    +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, lgtm, only a few comments

Copy link
Contributor

@bowenliang123 bowenliang123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the contribution.

@bowenliang123 bowenliang123 changed the title [KYUUBI #6754] Improve the performance of ranger access requests [KYUUBI #6754][AUTHZ] Improve the performance of Ranger access requests deduplication Oct 21, 2024
@bowenliang123 bowenliang123 added this to the v1.10.0 milestone Oct 21, 2024
@bowenliang123
Copy link
Contributor

Thanks for your contribution! Merged to master (1.10.0) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Improve the performance of ranger access requests
4 participants