Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48530][SQL] Support for local variables in SQL Scripting #49445

Open
wants to merge 56 commits into
base: master
Choose a base branch
from

Conversation

dusantism-db
Copy link
Contributor

@dusantism-db dusantism-db commented Jan 10, 2025

What changes were proposed in this pull request?

This pull request introduces support for local variables in SQL scripting.

Behavior:

Local variables are declared in the headers of compound bodies, and are bound to it's scope. Variables of the same name are allowed in nested scopes, where the innermost variable will be resolved. Optionally, a local variable can be qualified with the label of the compound body in which it was declared, which would allow accessing variables which are not the innermost in the current scope.

Local variables have resolution priority over session variables, session variable resolution is attempted after local variable resolution. The exception to this is with fully qualified session variables, in the format system.session.<varName> or session.<varName>. System and session are forbidden for use as compound body labels.

Local variables must not be qualified on declaration, can be set using SET VAR and cannot be DROPPED.

They also should not be allowed to be declared with DECLARE OR REPLACE, however this is not implemented on this PR as FOR statement relies on this behavior. FOR statement must be updated in a separate PR to use proper local variables, as the current implementation is simulating them using session variables.

Implementation notes:

As core depends on catalyst, it's impossible to import code from core(where most of SQL scripting implementation is located) to catalyst. To solve this a trait VariableManager is introduced, which is then implemented in core and injected to catalyst. This VariableManager is basically a wrapper around SqlScriptingExecutionContext and provides methods for getting/setting/creating variables.

This injection is tricky because we want to have one ScriptingVariableManager per script.
Options considered to achieve this are:

  • Pass the manager/context to the analyzer using function calls. If possible, this solution would be ideal because it would allow every run of the analyzer to have it's own scripting context which is automatically cleaned up (AnalysisContext). This would also allow more control over the variable resolution, i.e. for EXECUTE IMMEDIATE we could simply not pass in the script context and it would behave as if outside of a script. This is the intended behavior for EXECUTE IMMEDIATE. The problem with this approach is it seems hard to implement. The call stack would be as follows: Analyzer.executeAndCheck -> HybridAnalyzer.apply -> RuleExecutor.executeAndTrack -> Analyzer.execute (overridden from RuleExecutor) -> Analyzer.withNewAnalysisContext. Implementing this context propagation would require changing the signatures of all of these methods, including superclass methods like execute and executeAndTrack.
  • Store the context in CatalogManager. CatalogManager's lifetime is tied to the session, so to allow for multiple scripts to execute in the same time we would need to e.g. have a map scriptUUID -> VariableManager, and to have the scriptUUID as a ThreadLocal variable in the CatalogManager. The drawback of this approach is that the script has to clean up it's resources after execution, and also that it's more complicated to e.g. forbid EXECUTE IMMEDIATE from accessing local variables.

Currently the second option seems better to me, however I am open to suggestions on how to approach this.

Why are the changes needed?

Currently, local variables are simulated using session variables in SQL scripting, which is a temporary solution and bad in many ways.

Does this PR introduce any user-facing change?

Yes, this change introduces multiple new types of errors.

How was this patch tested?

Tests were added to SqlScriptingExecutionSuite and SqlScriptingParserSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jan 10, 2025
@dusantism-db
Copy link
Contributor Author

@cloud-fan Another question regarding ambiguity. Consider the following situation:

BEGIN
  x: BEGIN
    DECLARE x = named_struct('x', 1);
    SELECT x.x;
  END;
END

Here we could either resolve x.x as label.varName, and return the named struct, or we could resolve x.x as field x of struct x. In the former scenario, we could access the field as x.x.x.

cc @srielau

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 7, 2025

@dusantism-db we have the same problem in column resolution: is x.x a column x of table x, or an inner field x inside column x? The solution is to define priority: prefer the longest qualifier. Which means "column x of table x" is preferred. See https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L311-L313

For variables: if x is a label name, then x.x means variable x under label x. Otherwise (no such label found), it's field x within variable x.

cc @srielau as well

@dusantism-db
Copy link
Contributor Author

@dusantism-db we have the same problem in column resolution: is x.x a column x of table x, or an inner field x inside column x? The solution is to define priority: prefer the longest qualifier. Which means "column x of table x" is preferred. See https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L311-L313

For variables: if x is a label name, then x.x means variable x under label x. Otherwise (no such label found), it's field x within variable x.

cc @srielau as well

Thanks for the input, this actually is the way the current implementation behaves.

@dusantism-db
Copy link
Contributor Author

Considering the size of this PR, and its complexity, I propose that we split some further improvements into followup PRs. Handling session and system variable names could be done in a followup PR, either by forbidding them or determining their behavior. Also, hiding generated label names in error messages could be done in a followup PR as well.

@dusantism-db
Copy link
Contributor Author

@cloud-fan I've made changes according to your comments, forbade use of DROP TEMPORARY VARIABLE in scripts per offline discussion, updated API of VariableManager and updated FOR to use EXECUTE IMMEDIATE instead of DROP VAR. PTAL, thanks.

.filterNot(_ => AnalysisContext.get.isExecuteImmediate)
// If variable name is qualified with session.<varName> treat it as a session variable.
.filter(_ =>
nameParts.length <= 2 && nameParts.init.map(_.toLowerCase(Locale.ROOT)) != Seq("session"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the function that validates the label name? We can do something like

.filterNot(nameParts => nameParts.length > 2 || (naneParts.length == 2 && !isValidLabelName(naneParts.head)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validSessionVariableName function will return different results depending on nameParts length. In your case we only call it when length is 2, which is correct behavior, but a little confusing in my opinion. The way it is now is easier to understand IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK to have different tastes about the code style, but this is about functionality. The intention here is we skip looking up local variables if the name can never be local variables. We should check all the invalid label name patterns.


object SqlScriptingLabelContext {
private val forbiddenLabelNames: immutable.Set[Regex] =
immutable.Set("builtin".r, "session".r, "sys.*".r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it match *builtin*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will only match the string "builtin". Isn't that the correct way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with the scala regex API. If you have verified that it matches builtin literally, we are good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants