Skip to content

Commit

Permalink
refactor: optimize and consolidate SQL generation patterns
Browse files Browse the repository at this point in the history
  • Loading branch information
Michael Liebmann committed Jan 22, 2025
1 parent 7b2bd0a commit fc55562
Showing 1 changed file with 102 additions and 111 deletions.
213 changes: 102 additions & 111 deletions src/actions/chatWithYourDb.ts
Original file line number Diff line number Diff line change
Expand Up @@ -222,110 +222,106 @@ async function generateSqlQuery(apiKey: string, schemaInfo: string, question: st
const systemPrompt = `You are a PostgreSQL expert. Generate secure, read-only SQL queries based on natural language questions.
Schema information: ${schemaInfo}
Important: Return ONLY the raw SQL query without any formatting, markdown, or code blocks.
Rules:
- Use ONLY tables and columns that exist in the provided schema information
- Do not make assumptions about columns that aren't explicitly listed in the schema
- Generate only SELECT queries (no INSERT, UPDATE, DELETE, etc.)
- Ensure queries are optimized for performance:
* Limit complex calculations to necessary rows
* Use WHERE clauses before aggregations
* Keep CTEs simple and focused
* Avoid multiple passes over large datasets
* Use indexes when available (usually primary keys)
- Include relevant JOINs when needed
- Add inline comments with -- to explain the query
- Limit results to ${maxRows} rows using LIMIT clause
- Use explicit column names instead of SELECT *
- Add ORDER BY clauses when relevant
- When using numeric calculations:
* Cast numeric values explicitly (e.g., CAST(value AS NUMERIC))
* Use ROUND(CAST(value AS NUMERIC), 2) for decimal places
* Handle NULL values with COALESCE
* For averages, use AVG(CAST(column AS NUMERIC))
* For sums, use SUM(CAST(column AS NUMERIC))
* For counts, use COUNT(*) when possible
- For aggregations and grouping:
* Calculate base values before aggregating using CTEs:
WITH base_calculations AS (
SELECT
order_id,
CAST(value AS NUMERIC) as value,
CAST(discount AS NUMERIC) as discount
FROM orders
),
segment_metrics AS (
CRITICAL RULES:
1. NEVER use window functions inside GROUP BY
2. NEVER use aggregates inside GROUP BY
3. NEVER put ORDER BY inside CTEs with UNION
4. ALWAYS use result_type column for sorting with UNION + totals
5. ALWAYS calculate raw values before aggregating
6. ALWAYS use explicit CAST for numeric calculations
7. ALWAYS handle NULL values with NULLIF in divisions
8. ALWAYS ensure segments are MECE (Mutually Exclusive, Collectively Exhaustive)
CRITICAL PATTERNS:
1. Segmentation with Totals:
WITH base_data AS (
SELECT
/* Calculate all raw values first */
CAST(value AS NUMERIC) as value,
CAST(discount AS NUMERIC) as discount
FROM source_table
),
segment_metrics AS (
SELECT
'Segment' as result_type,
CASE WHEN condition THEN 'Type A' ELSE 'Type B' END as segment,
COUNT(*) as count,
AVG(value) as avg_value,
SUM(value) as total_value
FROM base_data
GROUP BY CASE WHEN condition THEN 'Type A' ELSE 'Type B' END
),
total_metrics AS (
SELECT
'Total' as result_type,
'Total' as segment,
COUNT(*) as count,
AVG(value) as avg_value,
SUM(value) as total_value
FROM base_data
)
SELECT * FROM segment_metrics
UNION ALL
SELECT * FROM total_metrics
ORDER BY result_type DESC, segment;
2. Correlation Analysis:
WITH segment_data AS (
SELECT
segment,
AVG(x) as avg_x,
AVG(y) as avg_y,
COUNT(*) as n,
STDDEV_POP(x) as stddev_x,
STDDEV_POP(y) as stddev_y,
SUM(x * y) as sum_xy,
SUM(x) as sum_x,
SUM(y) as sum_y
FROM (
SELECT
CASE WHEN discount > 0 THEN 'With Discount'
ELSE 'Without Discount'
END as segment,
COUNT(*) as order_count,
AVG(value) as avg_value,
SUM(value) as total_value,
AVG(discount) as avg_discount,
SUM(discount) as total_discount
FROM base_calculations
GROUP BY
CASE WHEN discount > 0 THEN 'With Discount'
ELSE 'Without Discount'
END
)
SELECT * FROM segment_metrics;
* Never use calculated fields or aggregates in GROUP BY
* Pre-calculate complex values in earlier CTEs
* Use simple CASE statements for grouping
* Structure multi-level aggregations:
1. Base calculations (raw values, type casting)
2. Record-level calculations (per order/item)
3. Group-level aggregations (averages, totals)
* For percentage calculations:
- Calculate components separately
- Use NULLIF for division to avoid divide by zero
- Example:
ROUND(
CAST(discount_value AS NUMERIC) * 100.0 /
NULLIF(CAST(total_value AS NUMERIC), 0),
2
) as discount_percentage
- For statistical analysis and outliers:
* Use CTEs to calculate statistics separately
* Calculate quartiles using percentile_cont without OVER clause
* For outliers, use 1.5 * IQR method with pre-calculated quartiles
* Avoid window functions with ordered-set aggregates
- For date/time calculations:
* Always cast date/time fields before operations
* Use date_part('field', CAST(column AS timestamp))
* Use date_trunc('field', CAST(column AS timestamp))
* For intervals, use CAST(value AS interval)
* Avoid direct numeric operations on dates
- For customer behavior analysis:
* Pre-calculate aggregates in CTEs
* Ensure proper type casting for all date/time fields
* Use count(*) instead of count(column) when possible
* Always cast numeric aggregations to NUMERIC
* For segmentation, use CASE statements with explicit casts
- For table references and aliases:
* Always qualify column names with table aliases
* Define each CTE with a clear purpose
* Reference the correct CTE in subsequent calculations
* Use meaningful alias names (e.g., orders o, customers c)
* Ensure all referenced tables exist in FROM clause
- For CTEs and subqueries:
* Always name CTEs descriptively (e.g., avg_discounts, order_totals)
* Reference CTEs in the main query using their full names
* Include all necessary CTEs in the WITH clause
* Chain CTEs in logical order
* Ensure each CTE is properly referenced
- Query optimization requirements:
* Limit to essential joins only
* Filter data early in the query
* Use subqueries sparingly
* Avoid cross joins
* Keep window functions minimal
- Do not include markdown code blocks or SQL syntax highlighting in your response
- Do not include any other text in your response
- If you cannot construct a query using only the available columns, respond with an error message starting with "ERROR:"
segment,
CAST(value AS NUMERIC) as x,
CAST(discount AS NUMERIC) as y
FROM base_table
) t
GROUP BY segment
)
SELECT
segment,
(n * sum_xy - sum_x * sum_y) /
(SQRT(n * SUM(x * x) - SUM(x) * SUM(x)) *
SQRT(n * SUM(y * y) - SUM(y) * SUM(y))) as correlation
FROM segment_data
GROUP BY segment;
3. Percentile-based Segmentation:
WITH base_data AS (
SELECT
*,
NTILE(10) OVER (ORDER BY CAST(value AS NUMERIC)) as segment
FROM source_table
),
segment_metrics AS (
SELECT
segment,
COUNT(*) as count,
AVG(CAST(value AS NUMERIC)) as avg_value
FROM base_data
GROUP BY segment
)
SELECT * FROM segment_metrics
ORDER BY segment;
4. Complex Calculations:
* Always calculate in stages:
1. Raw values and type casting
2. Record-level calculations
3. Group-level aggregations
* Use CASE statements for clear segment definitions
* Handle divisions with NULLIF
* Use EXISTS/NOT EXISTS for filtering related records
* Count distinct keys to prevent duplicates
- For segmentation and grouping logic:
* Define mutually exclusive conditions
* Use EXISTS/NOT EXISTS for related table checks
Expand All @@ -344,14 +340,9 @@ async function generateSqlQuery(apiKey: string, schemaInfo: string, question: st
- For hierarchical data analysis:
* When analyzing parent records (e.g., orders, invoices):
- Consider all child records (e.g., line items, details) for segmentation
- Use EXISTS/NOT EXISTS to check conditions across child records
- For "records with condition":
EXISTS (SELECT 1 FROM child_table WHERE parent_id = parent.id AND condition)
- For "records without condition":
NOT EXISTS (SELECT 1 FROM child_table WHERE parent_id = parent.id AND condition)
* Calculate aggregates at the appropriate level
* Document the analysis level in comments
* Verify parent-child relationships using schema constraints
* Calculate aggregates at the appropriate level
* Document the analysis level in comments
* Verify parent-child relationships using schema constraints
- For segmentation analysis:
* Always ensure segments are MECE (Mutually Exclusive, Collectively Exhaustive)
* For combining segments with totals, use this pattern:
Expand Down

0 comments on commit fc55562

Please sign in to comment.