-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48530][SQL] Support for local variables in SQL Scripting #49445
base: master
Are you sure you want to change the base?
[SPARK-48530][SQL] Support for local variables in SQL Scripting #49445
Conversation
…e ResolveCatalogs to support local vars, also to throw error when creating qualified local vars
…e paths in resolvecatalogs
…th CREATE/SET/DROP.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingLocalVariableManager.scala
Outdated
Show resolved
Hide resolved
…ate comments in var manager
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/CreateVariableExec.scala
Show resolved
Hide resolved
@cloud-fan Another question regarding ambiguity. Consider the following situation:
Here we could either resolve cc @srielau |
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/SetVariableExec.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/SetVariableExec.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/V2CommandStrategy.scala
Outdated
Show resolved
Hide resolved
@dusantism-db we have the same problem in column resolution: is For variables: if cc @srielau as well |
Thanks for the input, this actually is the way the current implementation behaves. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/SetVariableExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/v2/SetVariableExec.scala
Outdated
Show resolved
Hide resolved
Considering the size of this PR, and its complexity, I propose that we split some further improvements into followup PRs. Handling |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala
Outdated
Show resolved
Hide resolved
…p variable in scripts
…on, and update FOR to use EXECUTE IMMEDIATE instead of DROP
@cloud-fan I've made changes according to your comments, forbade use of DROP TEMPORARY VARIABLE in scripts per offline discussion, updated API of VariableManager and updated FOR to use EXECUTE IMMEDIATE instead of DROP VAR. PTAL, thanks. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
Show resolved
Hide resolved
.filterNot(_ => AnalysisContext.get.isExecuteImmediate) | ||
// If variable name is qualified with session.<varName> treat it as a session variable. | ||
.filter(_ => | ||
nameParts.length <= 2 && nameParts.init.map(_.toLowerCase(Locale.ROOT)) != Seq("session")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse the function that validates the label name? We can do something like
.filterNot(nameParts => nameParts.length > 2 || (naneParts.length == 2 && !isValidLabelName(naneParts.head)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validSessionVariableName
function will return different results depending on nameParts length. In your case we only call it when length is 2, which is correct behavior, but a little confusing in my opinion. The way it is now is easier to understand IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK to have different tastes about the code style, but this is about functionality. The intention here is we skip looking up local variables if the name can never be local variables. We should check all the invalid label name patterns.
|
||
object SqlScriptingLabelContext { | ||
private val forbiddenLabelNames: immutable.Set[Regex] = | ||
immutable.Set("builtin".r, "session".r, "sys.*".r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it match *builtin*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it will only match the string "builtin". Isn't that the correct way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with the scala regex API. If you have verified that it matches builtin
literally, we are good.
What changes were proposed in this pull request?
This pull request introduces support for local variables in SQL scripting.
Behavior:
Local variables are declared in the headers of compound bodies, and are bound to it's scope. Variables of the same name are allowed in nested scopes, where the innermost variable will be resolved. Optionally, a local variable can be qualified with the label of the compound body in which it was declared, which would allow accessing variables which are not the innermost in the current scope.
Local variables have resolution priority over session variables, session variable resolution is attempted after local variable resolution. The exception to this is with fully qualified session variables, in the format
system.session.<varName>
orsession.<varName>
. System and session are forbidden for use as compound body labels.Local variables must not be qualified on declaration, can be set using
SET VAR
and cannot beDROPPED
.They also should not be allowed to be declared with
DECLARE OR REPLACE
, however this is not implemented on this PR asFOR
statement relies on this behavior.FOR
statement must be updated in a separate PR to use proper local variables, as the current implementation is simulating them using session variables.Implementation notes:
As core depends on catalyst, it's impossible to import code from core(where most of SQL scripting implementation is located) to catalyst. To solve this a trait
VariableManager
is introduced, which is then implemented in core and injected to catalyst. ThisVariableManager
is basically a wrapper aroundSqlScriptingExecutionContext
and provides methods for getting/setting/creating variables.This injection is tricky because we want to have one
ScriptingVariableManager
per script.Options considered to achieve this are:
EXECUTE IMMEDIATE
we could simply not pass in the script context and it would behave as if outside of a script. This is the intended behavior forEXECUTE IMMEDIATE
. The problem with this approach is it seems hard to implement. The call stack would be as follows:Analyzer.executeAndCheck
->HybridAnalyzer.apply
->RuleExecutor.executeAndTrack
->Analyzer.execute
(overridden from RuleExecutor) ->Analyzer.withNewAnalysisContext
. Implementing this context propagation would require changing the signatures of all of these methods, including superclass methods likeexecute
andexecuteAndTrack
.CatalogManager
.CatalogManager's
lifetime is tied to the session, so to allow for multiple scripts to execute in the same time we would need to e.g. have a mapscriptUUID -> VariableManager
, and to have thescriptUUID
as aThreadLocal
variable in theCatalogManager
. The drawback of this approach is that the script has to clean up it's resources after execution, and also that it's more complicated to e.g. forbidEXECUTE IMMEDIATE
from accessing local variables.Currently the second option seems better to me, however I am open to suggestions on how to approach this.
Why are the changes needed?
Currently, local variables are simulated using session variables in SQL scripting, which is a temporary solution and bad in many ways.
Does this PR introduce any user-facing change?
Yes, this change introduces multiple new types of errors.
How was this patch tested?
Tests were added to SqlScriptingExecutionSuite and SqlScriptingParserSuite.
Was this patch authored or co-authored using generative AI tooling?
No.