aws-samples · yike5460 · Nov 7, 2023 · Oct 25, 2023 · Oct 27, 2023 · Oct 27, 2023
diff --git a/README.md b/README.md
@@ -44,80 +44,85 @@
 
 Use Postman/cURL to test the API connection, the API endpoint is the output of CloudFormation Stack with prefix 'embedding' or 'llm', the sample URL will be like "https://xxxx.execute-api.us-east-1.amazonaws.com/v1/embedding", the API request body is as follows:
 
-**embedding uploaded file into AOS, POST https://xxxx.execute-api.us-east-1.amazonaws.com/v1/embedding, will be deprecate in the future**
+**Offline process to pre-process file specificed in S3 bucket and prefix, POST https://xxxx.execute-api.us-east-1.amazonaws.com/v1/etl**
 ```bash
 BODY
 {
-  "document_prefix": "<Your S3 bucket prefix>",
-  "aos_index": "chatbot-index"
+    "s3Bucket": "<Your S3 bucket>",
+    "s3Prefix": "<Your S3 prefix>",
+    "offline": "true"
 }
 ```
 You should see output like this:
 ```bash
-{
-  "created": xx.xx,
-  "model": "embedding-endpoint"
-}
+"Step Function triggered, Step Function ARN: arn:aws:states:us-east-1:xxxx:execution:xx-xxx:xx-xx-xx-xx-xx, Input Payload: {\"s3Bucket\": \"<Your S3 bucket>\", \"s3Prefix\": \"<Your S3 prefix>\", \"offline\": \"true\"}"
 ```
 
-**offline process to pre-process file specificed in S3 bucket and prefix, POST https://xxxx.execute-api.us-east-1.amazonaws.com/v1/etl**
+**Embedding uploaded file into AOS, POST https://xxxx.execute-api.us-east-1.amazonaws.com/v1/embedding, will be deprecate in the future**
 ```bash
 BODY
 {
-    "s3Bucket": "<Your S3 bucket>",
-    "s3Prefix": "<Your S3 prefix>",
-    "offline": "true"
+  "document_prefix": "<Your S3 bucket prefix>",
+  "aos_index": "chatbot-index"
 }
 ```
 You should see output like this:
 ```bash
-"Step Function triggered, Step Function ARN: arn:aws:states:us-east-1:xxxx:execution:xx-xxx:xx-xx-xx-xx-xx, Input Payload: {\"s3Bucket\": \"<Your S3 bucket>\", \"s3Prefix\": \"<Your S3 prefix>\", \"offline\": \"true\"}"
+{
+  "created": xx.xx,
+  "model": "embedding-endpoint"
+}
 ```
 
-**query embeddings in AOS, POST https://xxxx.execute-api.us-east-1.amazonaws.com/v1/embedding**, other operation including index, delete, query are also provided for debugging purpose.
+**Then you can query embeddings in AOS, POST https://xxxx.execute-api.us-east-1.amazonaws.com/v1/embedding**, other operation including index, delete, query are also provided for debugging purpose.
 ```bash
 BODY
 {
   "aos_index": "chatbot-index",
-  "query": {
-    "operation": "match_all",
-    "match_all": {}
-  }
+  "operation": "match_all",
+  "body": ""
 }
 ```
+
 You should see output like this:
 ```bash
 {
-  "took": 17,
+  "took": 4,
   "timed_out": false,
   "_shards": {
-    "total": 5,
-    "successful": 5,
+    "total": 4,
+    "successful": 4,
     "skipped": 0,
     "failed": 0
   },
   "hits": {
     "total": {
-      "value": 890,
+      "value": 256,
       "relation": "eq"
     },
     "max_score": 1.0,
     "hits": [
       {
         "_index": "chatbot-index",
-        "_id": "038592b1-8bd0-4415-9e18-93d632afa52f",
+        "_id": "035e8439-c683-4278-97f3-151f8cd4cdb6",
         "_score": 1.0,
         "_source": {
           "vector_field": [
-            0.005092620849609375,
-            xx
+            -0.03106689453125,
+            -0.00798797607421875,
+            ...
           ],
-          "text": "cess posterior mean. However, we can expand\nEq. (8) further by reparameterizing Eq. (4) as xt(x0, (cid:15)) = √¯αtx0 + √1\n(0, I) and\napplying the forward process posterior formula (7):\n¯αt(cid:15) for (cid:15)\n∼ N\n−\n(cid:34)\n(cid:34)\nLt\n1 −\n−\nC = Ex0,(cid:15)\n= Ex0,(cid:15)\n1\n2σ2\nt\n(cid:18)\n(cid:13)\n(cid:13)\n˜µt\n(cid:13)\n(cid:13)\nxt(x0, (cid:15)),\n1\n√¯αt\n(xt(x0, (cid:15))\n√1\n−\n−\n¯αt(cid:15))\n(cid:19)\n−\n(cid:13)\n(cid:13)\nµθ(xt(x0, (cid:15)), t)\n(cid:13)\n(cid:13)\n2(cid:35)\n1\n2σ2\nt\n(cid:13)\n(cid:13)\n(cid:13)\n(cid:13)\n1\n√αt\n(cid:18)\nxt(x0, (cid:15))\nβt\n−\n√1\n¯αt\n−\n(cid:19)\n(cid:15)\n−\nµθ(xt(x0, (cid:15)), t)\n2(cid:35)\n(cid:13)\n(cid:13)\n(cid:13)\n(cid:13)\n(9)\n(10)\n3\nAlgorithm 1 Training\nAlgorithm 2 Sampling\n1: repeat\n2: x0 ∼ q(x0)\n3:\n4:\n5: Take gradient descent step on\n√\n(cid:13)\n(cid:13)(cid:15) − (cid:15)θ(\nt ∼ Uniform({1, . . . , T })\n(cid:15) ∼ N (0, I)\n¯αtx0 +\n∇θ\n6: until converged\n√\n1 − ¯αt(cid:15), t)(cid:13)\n2\n(cid:13)\n1: xT ∼ N (0, I)\n2: for t = T, . . . , 1 do\n3: z ∼ N (0, I) if t > ",
+          "text": "## 1 Introduction\n\nDeep generative models of all kinds have recently exhibited high quality samples in a wide variety of data modalities. Generative adversarial networks (GANs), autoregressive models, flows, and variational autoencoders (VAEs) have synthesized striking image and audio samples [14; 27; 3; 58; 38; 25; 10; 32; 44; 57; 26; 33; 45], and there have been remarkable advances in energy-based modeling and score matching that have produced images comparable to those of GANs [11; 55].",
           "metadata": {
-            "source": "unknown",
-            "fontsize": 11,
-            "heading": "3 Diffusion models and denoising autoencoders\n",
-            "fontsize_idx": 2
+            "content_type": "paragraph",
+            "heading_hierarchy": {
+              "1 Introduction": {}
+            },
+            "figure_list": [],
+            "chunk_id": "$2",
+            "file_path": "Denoising Diffusion Probabilistic Models.pdf",
+            "keywords": [],
+            "summary": ""
           }
         }
       },
@@ -127,6 +132,39 @@
 }
 ```
 
+**Delete intial index in AOS, POST https://xxxx.execute-api.us-east-1.amazonaws.com/v1/embedding for debugging purpose**
+```bash
+{
+  "aos_index": "chatbot-index",
+  "operation": "delete",
+  "body": ""
+}
+```
+
+**Create intial index in AOS, POST https://xxxx.execute-api.us-east-1.amazonaws.com/v1/embedding for debugging purpose**
+```bash
+{
+  "aos_index": "chatbot-index",
+  "operation": "create",
+  "body": {
+    "settings": {
+      "index": {
+        "number_of_shards": 2,
+        "number_of_replicas": 1
+      }
+    },
+    "mappings": {
+      "properties": {
+        "vector_field": {
+            "type": "knn_vector",
+            "dimension": 1024
+        }
+      }
+    }
+  }
+}
+```
+
 **invoke LLM with context, POST https://xxxx.execute-api.us-east-1.amazonaws.com/v1/llm**
 ```bash
 BODY
@@ -168,7 +206,7 @@
   ]
 }
 ```
-5. Launch dashboard to check and debug the ETL & QA process
+1. Launch dashboard to check and debug the ETL & QA process
 
 ```bash
 cd /src/panel

diff --git a/src/etl-stack.ts b/src/etl-stack.ts
@@ -36,58 +36,73 @@ export class EtlStack extends NestedStack {
             type: glue.ConnectionType.NETWORK,
             subnet: props._subnets[0],
             securityGroups: [props._securityGroups],
-          });
+        });
 
         const _S3Bucket = new s3.Bucket(this, 'llm-bot-glue-lib', {
             bucketName: `llm-bot-glue-lib-${Aws.ACCOUNT_ID}-${Aws.REGION}`,
             blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
         });
 
         const extraPythonFiles = new s3deploy.BucketDeployment(this, 'extraPythonFiles', {
-            sources: [s3deploy.Source.asset('src/scripts/whl')],
+            sources: [s3deploy.Source.asset('src/scripts/dep/dist')],
             destinationBucket: _S3Bucket,
             // destinationKeyPrefix: 'llm_bot_dep-0.1.0-py3-none-any.whl',
         });
 
-        // Creata glue job to process files speicified in s3 bucket and prefix
-        const glueJob = new glue.Job(this, 'PythonShellJob', {
-            executable: glue.JobExecutable.pythonShell({
-              glueVersion: glue.GlueVersion.V1_0,
-              pythonVersion: glue.PythonVersion.THREE_NINE,
-              script: glue.Code.fromAsset(path.join(__dirname, 'scripts/glue-job-script.py')),
-              // s3 location of the python script
-            //   extraPythonFiles: [glue.Code.fromAsset(path.join(__dirname, 'scripts/llm_bot_dep-0.1.0-py3-none-any.whl'))],
-            //   extraPythonFiles: [extraPythonFiles],
-            }),
-            maxConcurrentRuns:200,
-            maxRetries:3,
-            connections:[connection],
-            maxCapacity:1,
-            defaultArguments:{
-                '--S3_BUCKET.$': sfn.JsonPath.stringAt('$.s3Bucket'),
-                '--S3_PREFIX.$': sfn.JsonPath.stringAt('$.s3Prefix'),
-                '--AOS_ENDPOINT': props._domainEndpoint,
-                '--REGION': props._region,
-                '--EMBEDDING_MODEL_ENDPOINT': props._embeddingEndpoint,
-                '--DOC_INDEX_TABLE': 'chatbot-index',
-                '--additional-python-modules': 'pdfminer.six==20221105,gremlinpython==3.7.0,langchain==0.0.312,beautifulsoup4==4.12.2,requests-aws4auth==1.2.3,boto3==1.28.69,nougat==0.3.3',
-                '--extra-py-files': _S3Bucket.s3UrlForObject('llm_bot_dep-0.1.0-py3-none-any.whl'),
-            }
-          });
+        // Assemble the extra python files list using _S3Bucket.s3UrlForObject('llm_bot_dep-0.1.0-py3-none-any.whl') and _S3Bucket.s3UrlForObject('nougat_ocr-0.1.17-py3-none-any.whl') and convert to string
+        const extraPythonFilesList = [_S3Bucket.s3UrlForObject('llm_bot_dep-0.1.0-py3-none-any.whl')].join(',');
 
-        glueJob.role.addToPrincipalPolicy(
+        const glueRole = new iam.Role(this, 'ETLGlueJobRole', {
+            assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),
+            // the role is used by the glue job to access AOS and by default it has 1 hour session duration which is not enough for the glue job to finish the embedding injection
+            maxSessionDuration: Duration.hours(12),
+        });
+        glueRole.addToPrincipalPolicy(
             new iam.PolicyStatement({
                 actions: [
                     "sagemaker:InvokeEndpointAsync",
                     "sagemaker:InvokeEndpoint",
                     "s3:*",
                     "es:*",
+                    "glue:*",
+                    "ec2:*",
+                    // cloudwatch logs
+                    "logs:*",
                 ],
                 effect: iam.Effect.ALLOW,
                 resources: ['*'],
             })
         )
 
+        // Creata glue job to process files speicified in s3 bucket and prefix
+        const glueJob = new glue.Job(this, 'PythonShellJob', {
+            executable: glue.JobExecutable.pythonShell({
+                glueVersion: glue.GlueVersion.V3_0,
+                pythonVersion: glue.PythonVersion.THREE_NINE,
+                script: glue.Code.fromAsset(path.join(__dirname, 'scripts/glue-job-script.py')),
+            }),
+            // Worker Type is not supported for Job Command pythonshell and Both workerType and workerCount must be set...
+            // workerType: glue.WorkerType.G_2X,
+            // workerCount: 2,
+            maxConcurrentRuns: 200,
+            maxRetries: 1,
+            connections: [connection],
+            maxCapacity: 1,
+            role: glueRole,
+            defaultArguments: {
+                '--S3_BUCKET.$': sfn.JsonPath.stringAt('$.s3Bucket'),
+                '--S3_PREFIX.$': sfn.JsonPath.stringAt('$.s3Prefix'),
+                '--QA_ENHANCEMENT.$': sfn.JsonPath.stringAt('$.qaEnhance'),
+                '--AOS_ENDPOINT': props._domainEndpoint,
+                '--REGION': props._region,
+                '--EMBEDDING_MODEL_ENDPOINT': props._embeddingEndpoint,
+                '--DOC_INDEX_TABLE': 'chatbot-index',
+                '--additional-python-modules': 'langchain==0.0.312,beautifulsoup4==4.12.2,requests-aws4auth==1.2.3,boto3==1.28.69,openai==0.28.1,nougat-ocr==0.1.17,pyOpenSSL==23.3.0,tenacity==8.2.3',
+                // add multiple extra python files
+                '--extra-py-files': extraPythonFilesList
+            }
+        });
+
         // Create SNS topic and subscription to notify when glue job is completed
         const topic = new sns.Topic(this, 'etl-topic', {
             displayName: 'etl-topic',
@@ -111,6 +126,7 @@ export class EtlStack extends NestedStack {
                 '--EMBEDDING_MODEL_ENDPOINT': props._embeddingEndpoint,
                 '--REGION': props._region,
                 '--OFFLINE': 'true',
+                '--QA_ENHANCEMENT.$': '$.qaEnhance',
             }),
         });
 
@@ -127,6 +143,7 @@ export class EtlStack extends NestedStack {
                 '--EMBEDDING_MODEL_ENDPOINT': props._embeddingEndpoint,
                 '--REGION': props._region,
                 '--OFFLINE': 'false',
+                '--QA_ENHANCEMENT.$': '$.qaEnhance',
             }),
         });
 
@@ -145,7 +162,8 @@ export class EtlStack extends NestedStack {
         const sfnStateMachine = new sfn.StateMachine(this, 'ETLState', {
             definitionBody: sfn.DefinitionBody.fromChainable(sfnDefinition),
             stateMachineType: sfn.StateMachineType.STANDARD,
-            timeout: Duration.minutes(30),
+            // Align with the glue job timeout
+            timeout: Duration.minutes(2880),
         });
 
         // Export the Step function to be used in API Gateway

diff --git a/src/lambda/embedding/main.py b/src/lambda/embedding/main.py
@@ -124,19 +124,30 @@ def lambda_handler(event, context):
 
     # parse arguments from event
     index_name = json.loads(event['body'])['aos_index']
-
+    operation = json.loads(event['body'])['operation']
+    body = json.loads(event['body'])['body']
+    aos_client = OpenSearchClient(_opensearch_cluster_domain)
     # re-route GET request to seperate processing branch
     if event['httpMethod'] == 'GET':
-        query = json.loads(event['body'])['query']
-        aos_client = OpenSearchClient(_opensearch_cluster_domain)
-        # check if the operation is query of search for OpenSearch
-        if query['operation'] == 'query':
-            response = aos_client.query(index_name, query['field'], query['value'])
-        elif query['operation'] == 'match_all':
+        if operation == 'query':
+            response = aos_client.query(index_name, json.dumps(body))
+        elif operation == 'match_all':
             response = aos_client.match_all(index_name)
         else:
-            raise Exception(f'Invalid query operation: {query["operation"]}')
-
+            raise Exception(f'Invalid query operation: {operation}')
+        return {
+            'statusCode': 200,
+            'headers': {'Content-Type': 'application/json'},
+            'body': json.dumps(response)
+        }
+    elif event['httpMethod'] == 'POST':
+        if operation == 'delete':
+            response = aos_client.delete_index(index_name)
+        elif operation == 'create':
+            logger.info(f'create index with query: {json.dumps(body)}')
+            response = aos_client.create_index(index_name, json.dumps(body))
+        else:
+            raise Exception(f'Invalid query operation: {operation}')
         return {
             'statusCode': 200,
             'headers': {'Content-Type': 'application/json'},