Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splitSql too aggressive in Redshift for stored procedure definition #381

Open
MPagel opened this issue Dec 17, 2024 · 7 comments
Open

splitSql too aggressive in Redshift for stored procedure definition #381

MPagel opened this issue Dec 17, 2024 · 7 comments

Comments

@MPagel
Copy link

MPagel commented Dec 17, 2024

splitSql does not consider $$ or $LABEL$ to constitute valid code block toggles that are used in plpgsql (redshift and postgress).

e.g.

CREATE OR REPLACE PROCEDURE testproc (varchar(50)) AS $$
DECLARE
	i int DEFAULT 1;
BEGIN
	CREATE TABLE IF NOT EXISTS testtab (
			id INT NOT NULL DEFAULT i
			,data varchar(50) DEFAULT NULL
			)
	DISTSTYLE ALL;
	INSERT INTO testtab(id, data) VALUES(i,$1);
END; $$ LANGUAGE plpgsql;

is a valid single SQL statement

however splitSQL attempts to break the stored procedure into three separate statements
splitSql("CREATE OR REPLACE PROCEDURE testproc (varchar(50)) AS $$\nDECLARE\n\ti int DEFAULT 1;\nBEGIN\n\tCREATE TABLE IF NOT EXISTS testtab (\n\t\t\tid INT NOT NULL DEFAULT i\n\t\t\t,data varchar(50) DEFAULT NULL\n\t\t\t)\n\tDISTSTYLE ALL;\n\tINSERT INTO testtab(id, data) VALUES(i,$1);\nEND; $$ LANGUAGE plpgsql;\n")

[1] "CREATE OR REPLACE PROCEDURE testproc (varchar(50)) AS $$\nDECLARE\n\ti int DEFAULT 1"
[2] "BEGIN\n\tCREATE TABLE IF NOT EXISTS testtab (\n\t\t\tid INT NOT NULL DEFAULT i\n\t\t\t,data varchar(50) DEFAULT NULL\n\t\t\t)\n\tDISTSTYLE ALL;\n\tINSERT INTO testtab(id, data) VALUES(i,$1);\nEND;"
[3] "$$ LANGUAGE plpgsql"

If not using a DECLARE section, you get 2 total statements
[1] "CREATE OR REPLACE PROCEDURE testproc (varchar(50)) AS $$\nBEGIN\n\tCREATE TABLE IF NOT EXISTS testtab (\n\t\t\tid INT NOT NULL DEFAULT i\n\t\t\t,data varchar(50) DEFAULT NULL\n\t\t\t)\n\tDISTSTYLE ALL;\n\tINSERT INTO testtab(id, data) VALUES(1,$1);\nEND;"
[2] "$$ LANGUAGE plpgsql"

Then executing these leads to

com.amazon.redshift.util.RedshiftException: Unterminated dollar quote started at position nn in SQL CREATE OR REPLACE PROCEDURE ... i int DEFAULT 1;. Expected terminating $$

To test that it the original statement at top is valid, you can execute (using DBeaver, OmniDB, etc) that statement followed by

truncate testtab;
CALL testproc('abc');
select * from testtab;

id "data"
1 abc

insert into testtab values (2,'mno');
insert into testtab(data) values ('def');
insert into testtab(id) values (7);
select * from testtab;

id "data"
1 abc
2 mno
1 def
7 [NULL]

possibly relevant refs:
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_PROCEDURE.html
https://www.postgresql.org/docs/current/sql-createprocedure.html

@MPagel
Copy link
Author

MPagel commented Dec 17, 2024

In order to handle $$ (but not $LABEL$) properly, I think boolean doubleDollar = false; String doubleDollarText = "$$"; needs to be inserted before

for (cursor = start; cursor < tokens.size(); cursor++) {

then add an else block into the if statement that follows (L42)

else if (token.text.equals(doubleDollarText)) {
  if (doubleDollar) { // closing an open block
    lastPop = nestStack.pop();
  } else { // open a $$ block
    nestStack.push(token.text);
  }
  doubleDollar = !(doubleDollar); //toggle
}

This has not been tested, and I don't know if the BEGIN/CASE and END logic would need to be altered.

@MPagel
Copy link
Author

MPagel commented Dec 18, 2024

my workaround for now is to use a modified version of N3C's parse_sql function to do the splitting of one query from the next, rather than using DatabaseConnector::executeSQL. Then I overrode executeSQL by adding a noSplit with default value FALSE to the function parameters and added a block between lines 416 and 417
https://github.com/OHDSI/DatabaseConnector/blob/46eb774beae80714efa7a6ad39558c478ceb597c/R/Sql.R#L416

  if (noSplit) { 
    sqlStatements <- sql 
  } else {
    sqlStatements <- SqlRender::splitSql(sql)
  }

@schuemie
Copy link
Member

The main purpose of SqlRender is to translate from one SQL dialect (OHDSISql) to others. splitSql() is mostly a convenience function for afterwards. I don't think the RedShift SQL you're referring to is the result of a translation?

@MPagel
Copy link
Author

MPagel commented Dec 19, 2024

The code that I'm trying to wrap into a procedure/function is from N3C's phenotype scripts, which I believe was the result of code translated from T-SQL.

Further, N3C uses the OHDSI DatabaseConnector package to execute the queries by way of executeSql. executeSql calls SqlRender::splitSql, which is why I posted here. I will see if I can bypass the call out to splitSql by calling lowLevelExecuteSql instead of executeSql, seeing as I've already done the SQL splitting.

I will also open a pull request over at DatabaseConnector to allow override of the call out to splitSql within (the non-low-level) executeSql and will reference this issue.

@schuemie
Copy link
Member

I'm still trying to understand the actual problem you are trying to solve. So far we have done just fine in OHDSI without having to define custom procedures, especially on RedShift which is IMHO the most well-behaved database platform. I don't see the custom procedure in the sQL on the N3C website you referenced.

What do you need the custom procedure for that requires either this fundamental change in SqlRender or a fundamental change in DatabaseConnector?

@MPagel
Copy link
Author

MPagel commented Jan 31, 2025

I am taking data from multiple institutional sources from different schemas of my database and combining for N3C's ingestion. In order to determine a unified cohort, I can't follow the N3C dataflow that removes individuals from my cohort if not found in the data, as some individuals are in only one of the two data sources (whereas most are in both). Nor can I maintain separate cohorts for each data set, because some of the negative controls as determined for one dataset may be present in the covid positive group on the other database, so are inappropriate to use as negative controls and new negative controls would have to be selected.

As I have to do some more complicated logic where I have to run the same base query for each of the data sources, it is "simpler" to put that query in a stored procedure and call with slightly modified input parameters for each run than to maintain two copies of the same query (that could lead to code drift over time if one query is changed but not the other).

So, to make a long story short, I have a use case where I want to augment the N3C dataflow and the OHDSI database connector functionality in my own local copy. But generally, I think it would be handy to allow for SQLRender to properly handle stored procedures, because SQLRender serves as a good starting library for doing any work in R that interacts with databases in a range of SQL dialects, OHDSI-related project or not.

@schuemie
Copy link
Member

schuemie commented Feb 3, 2025

Thanks for explaining! Have you thought about using SqlRender's parameterized SQL to address the problem of query modification? That would be in line with the SqlRender and DatabaseConnector flow, and would work across different database platforms without issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants