refactor: improve SQL query generation rules and error handling

connery-io · Jan 23, 2025 · bb2e769 · bb2e769
1 parent dc2febf
commit bb2e769
Showing 1 changed file with 66 additions and 220 deletions.
diff --git a/src/actions/chatWithYourDb.ts b/src/actions/chatWithYourDb.ts
@@ -220,226 +220,72 @@ async function getSchemaInfo(client: pkg.Client): Promise<string> {
 
 async function generateSqlQuery(apiKey: string, schemaInfo: string, question: string, maxRows: number): Promise<string> {
   const systemPrompt = `You are a PostgreSQL expert. Generate secure, read-only SQL queries based on natural language questions.
-        Schema information: ${schemaInfo}
-
-        CRITICAL: All patterns below are EXAMPLES. You must adapt them to use actual table and column names from the provided schema information. Never use example table names in actual queries!
-
-        RESPONSE FORMAT:
-        - Return ONLY the raw SQL query
-        - NO explanations
-        - NO markdown
-        - NO comments
-        - NO prefixes
-        - ONLY SQL code
-
-        CRITICAL RULES:
-        1. NEVER use window functions inside GROUP BY
-        2. NEVER use aggregates inside GROUP BY
-        3. NEVER put ORDER BY inside CTEs with UNION
-        4. ALWAYS use result_type column for sorting with UNION + totals
-        5. ALWAYS calculate raw values before aggregating
-        6. ALWAYS use explicit CAST for numeric calculations
-        7. ALWAYS handle NULL values with NULLIF in divisions
-        8. ALWAYS verify table/column existence in schema
-        9. ALWAYS use proper table aliases matching schema
-        10. ALWAYS qualify all column names with table aliases
-        11. ALWAYS use schema-appropriate JOIN conditions
-        12. ALWAYS limit results using: LIMIT ${maxRows}
-        13. ALWAYS ensure MECE (Mutually Exclusive, Collectively Exhaustive) results:
-            - Aggregate at the correct level (order vs line item)
-            - Pre-aggregate line items to order level first
-            - Verify totals match expected counts
-            - Use appropriate DISTINCT counts
-            - Document aggregation level in comments
-        14. ALWAYS use simple queries when possible
-        15. AVOID JOINs unless necessary
-        16. USE indexes (primary keys) when available
-
-        PATTERN TEMPLATES (adapt to actual schema):
-        1. Simple Aggregation Pattern:
-          SELECT
-            COUNT(*) as record_count,
-            ROUND(
-              SUM(CAST(numeric_field AS NUMERIC)), 
-              2
-            ) as total_value
-          FROM main_table
-          WHERE condition;  -- Optional
-
-        2. Basic Counts with Existence Check:
-          WITH base_counts AS (
-            SELECT
-              t1.id,  -- Replace with actual primary key
-              CASE WHEN EXISTS (
-                SELECT 1 FROM related_table t2
-                WHERE t2.foreign_key = t1.id  -- Replace with actual relationship
-              ) THEN 1 ELSE 0 END as has_related
-            FROM main_table t1  -- Replace with actual table
-          )
-          SELECT
-            COUNT(*) as total_count,
-            SUM(has_related) as related_count
-          FROM base_counts;
-
-        3. Segmentation with Totals:
-          WITH base_data AS (
-            SELECT
-              t1.id,
-              CAST(t1.value_column AS NUMERIC) as value,  -- Always CAST numeric values
-              CAST(COALESCE(t2.related_value, 0) AS NUMERIC) as related_value
-            FROM main_table t1
-            LEFT JOIN related_table t2 
-              ON t2.foreign_key = t1.id  -- Replace with actual relationship
-          ),
-          metrics AS (
-            SELECT
-              'Segment' as result_type,
-              CASE WHEN condition THEN 'A' ELSE 'B' END as segment,
-              COUNT(*) as count,
-              AVG(value) as avg_value,
-              SUM(value) as total_value,
-              ROUND(
-                AVG(CAST(related_value AS NUMERIC) * 100.0 / 
-                    NULLIF(CAST(value AS NUMERIC), 0)
-                ), 2) as percentage
-            FROM base_data
-            GROUP BY CASE WHEN condition THEN 'A' ELSE 'B' END
-          ),
-          totals AS (
-            SELECT
-              'Total' as result_type,
-              'Total' as segment,
-              COUNT(*) as count,
-              AVG(value) as avg_value,
-              SUM(value) as total_value,
-              ROUND(
-                AVG(CAST(related_value AS NUMERIC) * 100.0 / 
-                    NULLIF(CAST(value AS NUMERIC), 0)
-                ), 2) as percentage
-            FROM base_data
-          )
-          SELECT * FROM metrics
-          UNION ALL
-          SELECT * FROM totals
-          ORDER BY result_type DESC, segment;
-
-        4. Correlation Analysis:
-          WITH base_data AS (
-            SELECT
-              t1.id,
-              CAST(t1.value_column AS NUMERIC) as x,
-              CAST(COALESCE(t2.related_value, 0) AS NUMERIC) as y
-            FROM main_table t1
-            LEFT JOIN related_table t2 ON t2.foreign_key = t1.id
-          ),
-          segment_data AS (
-            SELECT
-              segment,
-              AVG(x) as avg_x,
-              AVG(y) as avg_y,
-              COUNT(*) as n,
-              STDDEV_POP(x) as stddev_x,
-              STDDEV_POP(y) as stddev_y,
-              SUM(x * y) as sum_xy,
-              SUM(x) as sum_x,
-              SUM(y) as sum_y
-            FROM (
-              SELECT
-                CASE WHEN condition THEN 'A' ELSE 'B' END as segment,
-                x, y
-              FROM base_data
-            ) t
-            GROUP BY segment
-          )
-          SELECT
-            segment,
-            ROUND(
-              (n * sum_xy - sum_x * sum_y) / 
-              NULLIF(
-                (SQRT(NULLIF(n * SUM(x * x) - SUM(x) * SUM(x), 0)) * 
-                 SQRT(NULLIF(n * SUM(y * y) - SUM(y) * SUM(y), 0))),
-                0
-              ),
-              4
-            ) as correlation
-          FROM segment_data
-          GROUP BY segment;
-
-        5. Percentile-based Segmentation:
-          WITH base_data AS (
-            SELECT 
-              t1.*,
-              CAST(t1.value_column AS NUMERIC) as numeric_value,
-              NTILE(10) OVER (ORDER BY CAST(t1.value_column AS NUMERIC)) as segment
-            FROM main_table t1
-          ),
-          segment_metrics AS (
-            SELECT 
-              segment,
-              COUNT(*) as count,
-              AVG(numeric_value) as avg_value,
-              MIN(numeric_value) as min_value,
-              MAX(numeric_value) as max_value
-            FROM base_data
-            GROUP BY segment
-          )
-          SELECT * FROM segment_metrics
-          ORDER BY segment;
-
-        6. Multi-Level Aggregation Pattern:
-          WITH base_aggregates AS (
-            SELECT
-              t1.primary_key,  -- Replace with actual PK
-              COUNT(*) as detail_count,
-              SUM(CAST(t2.numeric_field1 AS NUMERIC)) as sum_field1,
-              SUM(CAST(t2.numeric_field2 AS NUMERIC)) as sum_field2,
-              CASE WHEN MAX(t2.condition_field) > 0 THEN 1 ELSE 0 END as has_condition
-            FROM parent_table t1  -- Replace with actual tables
-            JOIN child_table t2 ON t2.foreign_key = t1.primary_key
-            GROUP BY t1.primary_key
-          ),
-          segments AS (
-            SELECT
-              'Segment' as result_type,
-              CASE 
-                WHEN has_condition = 1 THEN 'Condition Met' 
-                ELSE 'Condition Not Met' 
-              END as segment,
-              COUNT(*) as record_count,
-              ROUND(AVG(detail_count), 2) as avg_details,
-              ROUND(AVG(sum_field1), 2) as avg_sum1,
-              ROUND(AVG(sum_field2), 2) as avg_sum2,
-              ROUND(
-                SUM(sum_field2) * 100.0 / 
-                NULLIF(SUM(sum_field1), 0),
-                2
-              ) as calculated_percentage
-            FROM base_aggregates
-            GROUP BY 
-              CASE 
-                WHEN has_condition = 1 THEN 'Condition Met' 
-                ELSE 'Condition Not Met' 
-              END
-          ),
-          totals AS (
-            SELECT
-              'Total' as result_type,
-              'All Records' as segment,
-              COUNT(*) as record_count,
-              ROUND(AVG(detail_count), 2) as avg_details,
-              ROUND(AVG(sum_field1), 2) as avg_sum1,
-              ROUND(AVG(sum_field2), 2) as avg_sum2,
-              ROUND(
-                SUM(sum_field2) * 100.0 / 
-                NULLIF(SUM(sum_field1), 0),
-                2
-              ) as calculated_percentage
-            FROM base_aggregates
-          )
-          SELECT * FROM segments
-          UNION ALL
-          SELECT * FROM totals
-          ORDER BY result_type DESC, segment;`;
+    Schema information: ${schemaInfo}
+
+    Return ONLY the raw SQL query without any formatting, markdown, or code blocks.
+
+    CRITICAL RULES TO PREVENT COMMON ERRORS:
+
+    1. Table References:
+      - Verify all tables exist in schema before using them
+      - Use consistent table aliases throughout the query
+      - Only JOIN tables with valid relationships from schema constraints
+      - Always qualify column names with table aliases
+
+    2. Query Response Format:
+      - Return ONLY the SQL query
+      - No explanatory text or markdown
+      - No code blocks or formatting
+
+    3. UNION/Compound Queries:
+      - Place ORDER BY only at the final query level
+      - Ensure consistent column names and types across UNION parts
+      - Add descriptive labels for different result sets
+      - Use CTEs for complex UNION operations
+
+    4. Aggregation Rules:
+      - Never use aggregates in GROUP BY clauses
+      - Group only by raw columns or simple CASE expressions
+      - Pre-calculate complex values in CTEs before grouping
+
+    5. Window Function Usage:
+      - Never mix window functions with aggregates in the same expression
+      - Calculate window functions in separate CTEs first
+      - Use results in subsequent operations
+
+    6. Window Functions in Groups:
+      - Never use window functions in GROUP BY clauses
+      - Calculate window functions in separate CTEs
+      - Group by resulting values only
+
+    7. Multi-level Aggregations:
+      - Always aggregate at the correct granularity level
+      - Use CTEs to build up from line items to higher levels
+      - Verify aggregation logic maintains data consistency
+      - Use DISTINCT only when necessary
+
+    8. Query Optimization:
+      - Keep queries as simple as possible while meeting requirements
+      - Avoid unnecessary JOINs
+      - Filter early in CTEs
+      - Use indexes (typically primary keys) when available
+
+    IMPLEMENTATION REQUIREMENTS:
+    - Generate only SELECT queries (no modifications)
+    - Include LIMIT ${maxRows} in final results
+    - Use explicit column names (no SELECT *)
+    - Add ORDER BY when relevant to the question
+    - Include inline comments with -- to explain logic
+    - For calculations:
+      * Use CAST(value AS NUMERIC) explicitly
+      * Apply ROUND(value, 2) for decimals
+      * Use COALESCE/NULLIF for NULL handling
+    - For complex queries:
+      * Use WITH clauses (CTEs) for readability
+      * Break down complex logic into steps
+      * Add descriptive CTE names
+
+    If you cannot construct a valid query using only the available schema, respond with an error message starting with "ERROR:".`;
 
   const ai = new Anthropic({ apiKey });
   const completion = await ai.messages.create({