Skip to content

Commit

Permalink
refactor: improve SQL query generation rules and error handling
Browse files Browse the repository at this point in the history
  • Loading branch information
Michael Liebmann committed Jan 23, 2025
1 parent dc2febf commit bb2e769
Showing 1 changed file with 66 additions and 220 deletions.
286 changes: 66 additions & 220 deletions src/actions/chatWithYourDb.ts
Original file line number Diff line number Diff line change
Expand Up @@ -220,226 +220,72 @@ async function getSchemaInfo(client: pkg.Client): Promise<string> {

async function generateSqlQuery(apiKey: string, schemaInfo: string, question: string, maxRows: number): Promise<string> {
const systemPrompt = `You are a PostgreSQL expert. Generate secure, read-only SQL queries based on natural language questions.
Schema information: ${schemaInfo}
CRITICAL: All patterns below are EXAMPLES. You must adapt them to use actual table and column names from the provided schema information. Never use example table names in actual queries!
RESPONSE FORMAT:
- Return ONLY the raw SQL query
- NO explanations
- NO markdown
- NO comments
- NO prefixes
- ONLY SQL code
CRITICAL RULES:
1. NEVER use window functions inside GROUP BY
2. NEVER use aggregates inside GROUP BY
3. NEVER put ORDER BY inside CTEs with UNION
4. ALWAYS use result_type column for sorting with UNION + totals
5. ALWAYS calculate raw values before aggregating
6. ALWAYS use explicit CAST for numeric calculations
7. ALWAYS handle NULL values with NULLIF in divisions
8. ALWAYS verify table/column existence in schema
9. ALWAYS use proper table aliases matching schema
10. ALWAYS qualify all column names with table aliases
11. ALWAYS use schema-appropriate JOIN conditions
12. ALWAYS limit results using: LIMIT ${maxRows}
13. ALWAYS ensure MECE (Mutually Exclusive, Collectively Exhaustive) results:
- Aggregate at the correct level (order vs line item)
- Pre-aggregate line items to order level first
- Verify totals match expected counts
- Use appropriate DISTINCT counts
- Document aggregation level in comments
14. ALWAYS use simple queries when possible
15. AVOID JOINs unless necessary
16. USE indexes (primary keys) when available
PATTERN TEMPLATES (adapt to actual schema):
1. Simple Aggregation Pattern:
SELECT
COUNT(*) as record_count,
ROUND(
SUM(CAST(numeric_field AS NUMERIC)),
2
) as total_value
FROM main_table
WHERE condition; -- Optional
2. Basic Counts with Existence Check:
WITH base_counts AS (
SELECT
t1.id, -- Replace with actual primary key
CASE WHEN EXISTS (
SELECT 1 FROM related_table t2
WHERE t2.foreign_key = t1.id -- Replace with actual relationship
) THEN 1 ELSE 0 END as has_related
FROM main_table t1 -- Replace with actual table
)
SELECT
COUNT(*) as total_count,
SUM(has_related) as related_count
FROM base_counts;
3. Segmentation with Totals:
WITH base_data AS (
SELECT
t1.id,
CAST(t1.value_column AS NUMERIC) as value, -- Always CAST numeric values
CAST(COALESCE(t2.related_value, 0) AS NUMERIC) as related_value
FROM main_table t1
LEFT JOIN related_table t2
ON t2.foreign_key = t1.id -- Replace with actual relationship
),
metrics AS (
SELECT
'Segment' as result_type,
CASE WHEN condition THEN 'A' ELSE 'B' END as segment,
COUNT(*) as count,
AVG(value) as avg_value,
SUM(value) as total_value,
ROUND(
AVG(CAST(related_value AS NUMERIC) * 100.0 /
NULLIF(CAST(value AS NUMERIC), 0)
), 2) as percentage
FROM base_data
GROUP BY CASE WHEN condition THEN 'A' ELSE 'B' END
),
totals AS (
SELECT
'Total' as result_type,
'Total' as segment,
COUNT(*) as count,
AVG(value) as avg_value,
SUM(value) as total_value,
ROUND(
AVG(CAST(related_value AS NUMERIC) * 100.0 /
NULLIF(CAST(value AS NUMERIC), 0)
), 2) as percentage
FROM base_data
)
SELECT * FROM metrics
UNION ALL
SELECT * FROM totals
ORDER BY result_type DESC, segment;
4. Correlation Analysis:
WITH base_data AS (
SELECT
t1.id,
CAST(t1.value_column AS NUMERIC) as x,
CAST(COALESCE(t2.related_value, 0) AS NUMERIC) as y
FROM main_table t1
LEFT JOIN related_table t2 ON t2.foreign_key = t1.id
),
segment_data AS (
SELECT
segment,
AVG(x) as avg_x,
AVG(y) as avg_y,
COUNT(*) as n,
STDDEV_POP(x) as stddev_x,
STDDEV_POP(y) as stddev_y,
SUM(x * y) as sum_xy,
SUM(x) as sum_x,
SUM(y) as sum_y
FROM (
SELECT
CASE WHEN condition THEN 'A' ELSE 'B' END as segment,
x, y
FROM base_data
) t
GROUP BY segment
)
SELECT
segment,
ROUND(
(n * sum_xy - sum_x * sum_y) /
NULLIF(
(SQRT(NULLIF(n * SUM(x * x) - SUM(x) * SUM(x), 0)) *
SQRT(NULLIF(n * SUM(y * y) - SUM(y) * SUM(y), 0))),
0
),
4
) as correlation
FROM segment_data
GROUP BY segment;
5. Percentile-based Segmentation:
WITH base_data AS (
SELECT
t1.*,
CAST(t1.value_column AS NUMERIC) as numeric_value,
NTILE(10) OVER (ORDER BY CAST(t1.value_column AS NUMERIC)) as segment
FROM main_table t1
),
segment_metrics AS (
SELECT
segment,
COUNT(*) as count,
AVG(numeric_value) as avg_value,
MIN(numeric_value) as min_value,
MAX(numeric_value) as max_value
FROM base_data
GROUP BY segment
)
SELECT * FROM segment_metrics
ORDER BY segment;
6. Multi-Level Aggregation Pattern:
WITH base_aggregates AS (
SELECT
t1.primary_key, -- Replace with actual PK
COUNT(*) as detail_count,
SUM(CAST(t2.numeric_field1 AS NUMERIC)) as sum_field1,
SUM(CAST(t2.numeric_field2 AS NUMERIC)) as sum_field2,
CASE WHEN MAX(t2.condition_field) > 0 THEN 1 ELSE 0 END as has_condition
FROM parent_table t1 -- Replace with actual tables
JOIN child_table t2 ON t2.foreign_key = t1.primary_key
GROUP BY t1.primary_key
),
segments AS (
SELECT
'Segment' as result_type,
CASE
WHEN has_condition = 1 THEN 'Condition Met'
ELSE 'Condition Not Met'
END as segment,
COUNT(*) as record_count,
ROUND(AVG(detail_count), 2) as avg_details,
ROUND(AVG(sum_field1), 2) as avg_sum1,
ROUND(AVG(sum_field2), 2) as avg_sum2,
ROUND(
SUM(sum_field2) * 100.0 /
NULLIF(SUM(sum_field1), 0),
2
) as calculated_percentage
FROM base_aggregates
GROUP BY
CASE
WHEN has_condition = 1 THEN 'Condition Met'
ELSE 'Condition Not Met'
END
),
totals AS (
SELECT
'Total' as result_type,
'All Records' as segment,
COUNT(*) as record_count,
ROUND(AVG(detail_count), 2) as avg_details,
ROUND(AVG(sum_field1), 2) as avg_sum1,
ROUND(AVG(sum_field2), 2) as avg_sum2,
ROUND(
SUM(sum_field2) * 100.0 /
NULLIF(SUM(sum_field1), 0),
2
) as calculated_percentage
FROM base_aggregates
)
SELECT * FROM segments
UNION ALL
SELECT * FROM totals
ORDER BY result_type DESC, segment;`;
Schema information: ${schemaInfo}
Return ONLY the raw SQL query without any formatting, markdown, or code blocks.
CRITICAL RULES TO PREVENT COMMON ERRORS:
1. Table References:
- Verify all tables exist in schema before using them
- Use consistent table aliases throughout the query
- Only JOIN tables with valid relationships from schema constraints
- Always qualify column names with table aliases
2. Query Response Format:
- Return ONLY the SQL query
- No explanatory text or markdown
- No code blocks or formatting
3. UNION/Compound Queries:
- Place ORDER BY only at the final query level
- Ensure consistent column names and types across UNION parts
- Add descriptive labels for different result sets
- Use CTEs for complex UNION operations
4. Aggregation Rules:
- Never use aggregates in GROUP BY clauses
- Group only by raw columns or simple CASE expressions
- Pre-calculate complex values in CTEs before grouping
5. Window Function Usage:
- Never mix window functions with aggregates in the same expression
- Calculate window functions in separate CTEs first
- Use results in subsequent operations
6. Window Functions in Groups:
- Never use window functions in GROUP BY clauses
- Calculate window functions in separate CTEs
- Group by resulting values only
7. Multi-level Aggregations:
- Always aggregate at the correct granularity level
- Use CTEs to build up from line items to higher levels
- Verify aggregation logic maintains data consistency
- Use DISTINCT only when necessary
8. Query Optimization:
- Keep queries as simple as possible while meeting requirements
- Avoid unnecessary JOINs
- Filter early in CTEs
- Use indexes (typically primary keys) when available
IMPLEMENTATION REQUIREMENTS:
- Generate only SELECT queries (no modifications)
- Include LIMIT ${maxRows} in final results
- Use explicit column names (no SELECT *)
- Add ORDER BY when relevant to the question
- Include inline comments with -- to explain logic
- For calculations:
* Use CAST(value AS NUMERIC) explicitly
* Apply ROUND(value, 2) for decimals
* Use COALESCE/NULLIF for NULL handling
- For complex queries:
* Use WITH clauses (CTEs) for readability
* Break down complex logic into steps
* Add descriptive CTE names
If you cannot construct a valid query using only the available schema, respond with an error message starting with "ERROR:".`;

const ai = new Anthropic({ apiKey });
const completion = await ai.messages.create({
Expand Down

0 comments on commit bb2e769

Please sign in to comment.