diff --git a/distributed-inference-deployment/Llama2-TensorRT-LLM-SageMaker.ipynb b/distributed-inference-deployment/Llama2-TensorRT-LLM-SageMaker.ipynb
new file mode 100644
index 0000000..19e5a44
--- /dev/null
+++ b/distributed-inference-deployment/Llama2-TensorRT-LLM-SageMaker.ipynb
@@ -0,0 +1,645 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Deploy Llama 2 on Amazon SageMaker with TensorRT-LLM"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "[Llama 2](https://llama.meta.com/llama2/) are pretrained models trained on 2 trillion tokens with a 4k context length. Its fine-tuned chat models have been trained on over 1 million human annotations. Llama 2 has undergone internal and external adversarial testing across fine-tuned models to identify potential toxicity, bias, and other gaps in performance. To learn more about Llama 2 models, click [here](https://llama.meta.com/llama2/).\n",
+ "\n",
+ "SageMaker has rolled out [TensorRT-LLM container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.\n",
+ "\n",
+ "In this notebook, we combine the strengths of two powerful tools: [DJL](https://docs.djl.ai/) (Deep Java Library) for the serving framework and [TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/) for distributed large language model inference on Nvidia. DJLServing, a high-performance universal model serving solution powered by DJL, handles the overall serving architecture.\n",
+ "\n",
+ "In our setup, vLLM handles the core LLM inference tasks, leveraging its optimizations to achieve high performance and low latency. DJLServing manages the broader serving infrastructure, handling incoming requests, load balancing, and coordinating with vLLM for efficient inference.\n",
+ "\n",
+ "This combination allows us to deploy the `Llama 2 7B` model across GPUs on the `ml.g5.12xlarge` instance with optimal resource utilization. vLLM's efficiencies in memory management and request handling enable us to serve this large model with improved throughput compared to traditional serving methods. To learn more about DJL, DJLServing, and TensorRT-LLM you can refer to this [blog post](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "
\n",
+ "\n",
+ "NOTE: Llama models are licensed under a bespoke commercial license that balances open access to the models with responsibility and protections in place to help address potential misuse. Their license allows for broad commercial use, as well as for developers to create and redistribute additional work on top of Llama models. For more details, their licenses can be found at [Meta Llama 2](https://llama.meta.com/license/) and [Meta Llama 3](https://llama.meta.com/llama3/license/).\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "##### Reach out to Mistral to explore Codestral for commercial use cases: [Contact the Mistral team](https://mistral.ai/contact/)\n",
+ "\n",
+ "##### More on the Mistral AI Non-Production License: [Mistral AI Non-Production License](https://mistral.ai/news/mistral-ai-non-production-license-mnpl/)\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Requirements"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)\n",
+ " - For Notebook Instance type, choose `ml.t3.medium`.\n",
+ "2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).\n",
+ "3. Install the required packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ " \n",
+ "\n",
+ "
NOTE:\n",
+ "\n",
+ "- For
Amazon SageMaker Studio, select Kernel \"
Python 3 (ipykernel)\".\n",
+ "\n",
+ "- For
Amazon SageMaker Studio Classic, select Image \"
Base Python 3.0\" and Kernel \"
Python 3\".\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To run this notebook you would need to install the following dependencies:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install boto3==1.34.132 -qU --force --quiet --no-warn-conflicts\n",
+ "!pip install sagemaker==2.224.2 -qU --force --quiet --no-warn-conflicts"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "### Import libraries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n",
+ "sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml\n"
+ ]
+ }
+ ],
+ "source": [
+ "import boto3\n",
+ "import json\n",
+ "import sagemaker"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2.224.2\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(sagemaker.__version__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Initialize parameters"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "sagemaker role arn: arn:aws:iam::570598552974:role/txt2sql-SageMakerExecutionRole-PAgMr5TND4x0\n",
+ "sagemaker session region: us-east-1\n"
+ ]
+ }
+ ],
+ "source": [
+ "# execution role for the endpoint\n",
+ "role = sagemaker.get_execution_role()\n",
+ "\n",
+ "# sagemaker session for interacting with different AWS APIs\n",
+ "sess = sagemaker.session.Session()\n",
+ "\n",
+ "# Region\n",
+ "region_name = sess._region_name\n",
+ "\n",
+ "print(f\"sagemaker role arn: {role}\")\n",
+ "print(f\"sagemaker session region: {region_name}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Image URI of the DJL Container"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "LMI DLCs offer a low-code interface that simplifies using state-of-the-art inference optimization techniques and hardware. LMI allows you to apply tensor parallelism; the latest efficient attention, batching, quantization, and memory management techniques; token streaming; and much more, by just requiring the model ID and optional model parameters. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and for more details [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/announcements/deepspeed-deprecation.md)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "DCL Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122\n"
+ ]
+ }
+ ],
+ "source": [
+ "inference_image_uri = sagemaker.image_uris.retrieve(\n",
+ " framework=\"djl-tensorrtllm\",\n",
+ " region=region_name,\n",
+ " version=\"0.28.0\"\n",
+ ")\n",
+ "print(f\"DCL Image going to be used is ---- > {inference_image_uri}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Available Environment Variable Configurations"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here is a list of settings that we use in this configuration file:\n",
+ "\n",
+ "- `HF_MODEL_ID`: The model id of a pretrained model hosted inside a model repository on [huggingface.co](https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. This is an optional setting and is not needed in the scenario where you are brining your own model. If you are getting your own model, you can include the URI of the Amazon S3 bucket that contains the model.\n",
+ "- `HF_TOKEN`: Some models on the HuggingFace Hub are gated and require permission from the owner to access. To deploy a gated model from the HuggingFace Hub using LMI, you must provide an [Access Token](https://huggingface.co/docs/hub/security-tokens) via this environment variable.\n",
+ "- `OPTION_ENGINE`: The engine for DJL to use. In this case, we intend to use [MPI](https://docs.djl.ai/docs/serving/serving/docs/lmi/conceptual_guide/lmi_engine.html). MPI is used to operate on single machine multi-gpu or multiple machines multi-gpu use cases.\n",
+ "- `OPTION_DTYPE`: The data type you plan to cast the model weights to. If not provided, LMI will use fp16.\n",
+ "- `OPTION_TGI_COMPAT`: To get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true`.\n",
+ "- `OPTION_TASK`: The task used in Hugging Face for different pipelines. Default is text-generation. For further reading on DJL parameters on SageMaker, follow the [link](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/deepspeed_user_guide.html)\n",
+ "- `OPTION_ROLLING_BATCH`: Enables continuous batching (iteration level batching) with one of the supported backends. Available backends differ by container, see [Inference Library Configurations](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#inference-library-configuration) for mappings.\n",
+ " - In the TensorRT-LLM Container:\n",
+ " - use `OPTION_ROLLING_BATCH=trtllm` to use TensorRT-LLM (this is the default)\n",
+ "- `TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. Setting this to `max`, which will shard the model across all available GPUs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.\n",
+ "- `OPTION_MAX_INPUT_LEN`: Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request.\n",
+ "- `OPTION_MAX_OUTPUT_LEN`: Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set.\n",
+ "- `OPTION_TRUST_REMOTE_CODE`: If the model artifacts contain custom modeling code, you should set this to true after validating the custom code is not malicious. If you are using a HuggingFace Hub model id, you should also specify HF_REVISION to ensure you are using artifacts and code that you have validated.\n",
+ "\n",
+ "For more details on the configuration options and an exhaustive list, you can refer the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html) and [LMI Starting Guide](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/trt_llm_user_guide.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Create SageMaker endpoint"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Hugging Face Model Id\n",
+ "model_id = \"meta-llama/Llama-2-7b-chat-hf\"\n",
+ "\n",
+ "# Environment variables\n",
+ "hf_token = \"\" # Use for gated models\n",
+ "rolling_batch = \"trtllm\"\n",
+ "max_output_len = 4096\n",
+ "\n",
+ "env = {}\n",
+ "env['HF_MODEL_ID'] = model_id\n",
+ "env['OPTION_ROLLING_BATCH'] = rolling_batch\n",
+ "env['OPTION_DTYPE'] = \"fp16\"\n",
+ "env['OPTION_TGI_COMPAT'] = \"true\"\n",
+ "env['OPTION_ENGINE'] = \"MPI\"\n",
+ "env['OPTION_TASK'] = \"text-generation\"\n",
+ "env['TENSOR_PARALLEL_DEGREE'] = \"max\"\n",
+ "env['OPTION_MAX_INPUT_LEN'] = json.dumps(max_output_len - 1)\n",
+ "env['OPTION_MAX_OUTPUT_LEN'] = json.dumps(max_output_len)\n",
+ "env['OPTION_DEVICE_MAP'] = \"auto\"\n",
+ "# env['OPTION_TRUST_REMOTE_CODE'] = \"true\"\n",
+ "\n",
+ "# Include HF token for gated models\n",
+ "if hf_token != \"\":\n",
+ " env['HF_TOKEN'] = hf_token\n",
+ "else:\n",
+ " print(\"Llama models are gated, please add your HF token before you continue.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "instance_type: ml.g5.12xlarge\n",
+ "model_id: meta-llama/Llama-2-7b-chat-hf\n",
+ "endpoint_name: llama2-7b-chat-tensorrt-llm-2024-07-06-11-32-41-078\n"
+ ]
+ }
+ ],
+ "source": [
+ "# SageMaker Instance Type\n",
+ "instance_type = \"ml.g5.12xlarge\"\n",
+ "\n",
+ "# Endpoint name\n",
+ "endpoint_name_prefix = \"llama2-7b-chat-tensorrt-llm\"\n",
+ "endpoint_name = sagemaker.utils.name_from_base(endpoint_name_prefix)\n",
+ "\n",
+ "print(f\"instance_type: {instance_type}\")\n",
+ "print(f\"model_id: {model_id}\")\n",
+ "print(f\"endpoint_name: {endpoint_name}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Deploy model to an endpoint\n",
+ "model = sagemaker.Model(\n",
+ " image_uri=inference_image_uri,\n",
+ " role=role,\n",
+ " env=env\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "---------------!"
+ ]
+ }
+ ],
+ "source": [
+ "model.deploy(\n",
+ " initial_instance_count=1,\n",
+ " instance_type=instance_type,\n",
+ " endpoint_name=endpoint_name,\n",
+ " container_startup_health_check_timeout=900,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Run inference and chat with the model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Supported Inference Parameters\n",
+ "\n",
+ "---\n",
+ "This model supports the following inference payload parameters:\n",
+ "\n",
+ "* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.\n",
+ "* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.\n",
+ "* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.\n",
+ "* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.\n",
+ "\n",
+ "You may specify any subset of the parameters mentioned above while invoking an endpoint. \n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Inference using SageMaker SDK"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Initialize sagemaker client with the endpoint created in the prior step\n",
+ "predictor = sagemaker.Predictor(\n",
+ " endpoint_name=endpoint_name,\n",
+ " sagemaker_session=sess,\n",
+ " serializer=sagemaker.serializers.JSONSerializer(),\n",
+ " deserializer=sagemaker.deserializers.JSONDeserializer(),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "01. Define Your Website's Purpose and Goals:\n",
+ "Determine the purpose and goals of your website, including the message you want to convey, the audience you want to reach, and the actions you want visitors to take.\n",
+ "\n",
+ "02. Choose a Domain Name:\n",
+ "Select a unique and memorable domain name that reflects the content and purpose of your website. This is the address that visitors will use to access your website.\n",
+ "\n",
+ "03. Choose a Web Host:\n",
+ "Find a reliable and affordable web host that meets your needs, including storage space, bandwidth, and technical support.\n",
+ "\n",
+ "04. Plan Your Website's Structure:\n",
+ "Develop a website outline or wireframe that organizes your content into logical sections and subsections. This will help you create a clear and intuitive navigation menu.\n",
+ "\n",
+ "05. Create Content:\n",
+ "Write and gather content for your website, including text, images, videos, and other media. Make sure your content is well-written, informative, and optimized for search engines.\n",
+ "\n",
+ "06. Design Your Website:\n",
+ "Design your website's layout and visual elements, including colors, fonts, and images. Use a design tool or work with a web designer to create a visually appealing and user-friendly website.\n",
+ "\n",
+ "07. Add Interactive Elements:\n",
+ "Incorporate interactive elements such as forms, contact pages, and social media feeds to engage visitors and encourage them to take action.\n",
+ "\n",
+ "08. Test and Launch:\n",
+ "Test your website for functionality, usability, and search engine optimization. Make any necessary revisions and launch your website to the public.\n",
+ "\n",
+ "09. Maintain and Update:\n",
+ "Regularly update your website's content and design to keep visitors engaged and ensure that your website remains relevant and accessible.\n",
+ "\n",
+ "10. Monitor and Analyze:\n",
+ "Use analytics tools to track your website's traffic, engagement, and conversion rates. Monitor your website's performance and make data-driven decisions to improve its effectiveness over time.\n",
+ "\n",
+ "By following these 10 simple steps, you can create a professional-looking and effective website that meets your business goals and appeals to your target audience.\n"
+ ]
+ }
+ ],
+ "source": [
+ "prompt = \"\"\"[INST] <>\n",
+ "{system_prompt}\n",
+ "<>\n",
+ "\n",
+ "{message_prompt} [/INST] \"\"\".format(\n",
+ " system_prompt=\"You are a helpful assistant.\",\n",
+ " message_prompt=\"Building a website can be done in 10 simple steps:\"\n",
+ ")\n",
+ "\n",
+ "inputs = {\n",
+ " \"inputs\": prompt,\n",
+ " \"parameters\": {\n",
+ " \"temperature\": 0.8,\n",
+ " \"top_p\": 0.95,\n",
+ " \"max_new_tokens\": 512,\n",
+ " \"do_sample\": False\n",
+ " }\n",
+ "}\n",
+ "response = predictor.predict(inputs)\n",
+ "print(response[0]['generated_text'].strip().replace('', ''))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Inference using Boto3 SDK"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Initialize sagemaker client with boto3 using the endpoint created from prior step\n",
+ "smr_client = boto3.client(\"sagemaker-runtime\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " Sure, I'd be happy to help! Here is a basic recipe for homemade mayonnaise:\n",
+ "\n",
+ "Ingredients:\n",
+ "\n",
+ "* 2 egg yolks\n",
+ "* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed\n",
+ "* 1 tablespoon lemon juice or vinegar\n",
+ "* Salt and pepper to taste\n",
+ "\n",
+ "Instructions:\n",
+ "\n",
+ "1. In a small bowl, whisk together the egg yolks and lemon juice or vinegar until well combined.\n",
+ "2. Slowly pour the oil into the egg yolk mixture, whisking constantly. You can use an electric mixer on low speed or whisk by hand.\n",
+ "3. Continue whisking until the mixture thickens and emulsifies, which should take about 5-7 minutes. You will know it's ready when the mixture has doubled in volume and is smooth and creamy.\n",
+ "4. Taste and adjust the seasoning as needed with salt and pepper.\n",
+ "5. Cover and refrigerate the mayonnaise for at least 30 minutes before using.\n",
+ "\n",
+ "That's it! Homemade mayonnaise can be used as a sandwich spread, salad dressing, or dip. Enjoy!\n",
+ "\n",
+ "Note: If you want to make a vegan mayonnaise, you can replace the egg yolks with 1/4 cup (60 ml) of mashed avocado or a flax egg (1 tablespoon ground flaxseed + 3 tablespoons water, mixed and allowed to gel for 5 minutes).\n"
+ ]
+ }
+ ],
+ "source": [
+ "prompt = \"\"\"[INST] <>\n",
+ "{system_prompt}\n",
+ "<>\n",
+ "\n",
+ "{message_prompt} [/INST]\"\"\".format(\n",
+ " system_prompt=\"You are a helpful assistant.\",\n",
+ " message_prompt=\"\"\"what is the recipe of mayonnaise?\"\"\"\n",
+ ")\n",
+ "\n",
+ "response = smr_client.invoke_endpoint(\n",
+ " EndpointName=endpoint_name,\n",
+ " Body=json.dumps(\n",
+ " {\n",
+ " \"inputs\": prompt,\n",
+ " \"parameters\": {\n",
+ " \"temperature\": 0.8,\n",
+ " \"top_p\": 0.95,\n",
+ " \"max_new_tokens\": 4000,\n",
+ " \"do_sample\": False\n",
+ " },\n",
+ " }\n",
+ " ),\n",
+ " ContentType=\"application/json\",\n",
+ ")[\"Body\"].read().decode(\"utf8\")\n",
+ "\n",
+ "print(json.loads(response)[0]['generated_text'].replace('', ''))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Conclusion\n",
+ "In this post, we demonstrated how to use SageMaker large model inference containers to host Codestral 22B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Clean Up"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Delete the endpoint\n",
+ "sess.delete_endpoint(endpoint_name)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# In case the end point failed we still want to delete the model\n",
+ "sess.delete_endpoint_config(endpoint_name)\n",
+ "model.delete_model()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "conda_python3",
+ "language": "python",
+ "name": "conda_python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.14"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/distributed-inference-deployment/Llama2-vLLM-SageMaker.ipynb b/distributed-inference-deployment/Llama2-vLLM-SageMaker.ipynb
new file mode 100644
index 0000000..edbeb09
--- /dev/null
+++ b/distributed-inference-deployment/Llama2-vLLM-SageMaker.ipynb
@@ -0,0 +1,638 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Deploy Llama 2 on Amazon SageMaker with vLLM"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "[Llama 2](https://llama.meta.com/llama2/) are pretrained models trained on 2 trillion tokens with a 4k context length. Its fine-tuned chat models have been trained on over 1 million human annotations. Llama 2 has undergone internal and external adversarial testing across fine-tuned models to identify potential toxicity, bias, and other gaps in performance. To learn more about Llama 2 models, click [here](https://llama.meta.com/llama2/).\n",
+ "\n",
+ "SageMaker has rolled out [vLLM container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.\n",
+ "\n",
+ "In this notebook, we combine the strengths of two powerful tools: [DJL](https://docs.djl.ai/) (Deep Java Library) for the serving framework and [vLLM](https://docs.vllm.ai/en/stable/) for distributed large language model inference. DJLServing, a high-performance universal model serving solution powered by DJL, handles the overall serving architecture.\n",
+ "\n",
+ "In our setup, vLLM handles the core LLM inference tasks, leveraging its optimizations to achieve high performance and low latency. DJLServing manages the broader serving infrastructure, handling incoming requests, load balancing, and coordinating with vLLM for efficient inference.\n",
+ "\n",
+ "This combination allows us to deploy the `Llama 2 7B` model across GPUs on the `ml.g5.12xlarge` instance with optimal resource utilization. vLLM's efficiencies in memory management and request handling enable us to serve this large model with improved throughput compared to traditional serving methods. To learn more about DJL, DJLServing, and vLLM you can refer to this [blog post](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-mixtral-and-llama-2-models-with-new-amazon-sagemaker-containers/).\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ " \n",
+ "\n",
+ "NOTE: Llama models are licensed under a bespoke commercial license that balances open access to the models with responsibility and protections in place to help address potential misuse. Their license allows for broad commercial use, as well as for developers to create and redistribute additional work on top of Llama models. For more details, their licenses can be found at [Meta Llama 2](https://llama.meta.com/license/) and [Meta Llama 3](https://llama.meta.com/llama3/license/).\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Requirements"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)\n",
+ " - For Notebook Instance type, choose `ml.t3.medium`.\n",
+ "2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).\n",
+ "3. Install the required packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ " \n",
+ "\n",
+ "
NOTE:\n",
+ "\n",
+ "- For
Amazon SageMaker Studio, select Kernel \"
Python 3 (ipykernel)\".\n",
+ "\n",
+ "- For
Amazon SageMaker Studio Classic, select Image \"
Base Python 3.0\" and Kernel \"
Python 3\".\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To run this notebook you would need to install the following dependencies:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install boto3==1.34.132 -qU --force --quiet --no-warn-conflicts\n",
+ "!pip install sagemaker==2.224.2 -qU --force --quiet --no-warn-conflicts"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "### Import libraries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n",
+ "sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml\n"
+ ]
+ }
+ ],
+ "source": [
+ "import boto3\n",
+ "import json\n",
+ "import sagemaker"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2.224.2\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(sagemaker.__version__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Initialize parameters"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "sagemaker role arn: arn:aws:iam::570598552974:role/txt2sql-SageMakerExecutionRole-PAgMr5TND4x0\n",
+ "sagemaker session region: us-east-1\n"
+ ]
+ }
+ ],
+ "source": [
+ "# execution role for the endpoint\n",
+ "role = sagemaker.get_execution_role()\n",
+ "\n",
+ "# sagemaker session for interacting with different AWS APIs\n",
+ "sess = sagemaker.session.Session()\n",
+ "\n",
+ "# Region\n",
+ "region_name = sess._region_name\n",
+ "\n",
+ "print(f\"sagemaker role arn: {role}\")\n",
+ "print(f\"sagemaker session region: {region_name}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Image URI of the DJL Container"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "LMI DLCs offer a low-code interface that simplifies using state-of-the-art inference optimization techniques and hardware. LMI allows you to apply tensor parallelism; the latest efficient attention, batching, quantization, and memory management techniques; token streaming; and much more, by just requiring the model ID and optional model parameters. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and for more details [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/announcements/deepspeed-deprecation.md)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "DCL Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124\n"
+ ]
+ }
+ ],
+ "source": [
+ "inference_image_uri = sagemaker.image_uris.retrieve(\n",
+ " framework=\"djl-lmi\",\n",
+ " region=region_name,\n",
+ " version=\"0.28.0\"\n",
+ ")\n",
+ "print(f\"DCL Image going to be used is ---- > {inference_image_uri}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Available Environment Variable Configurations"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here is a list of settings that we use in this configuration file:\n",
+ "\n",
+ "- `HF_MODEL_ID`: The model id of a pretrained model hosted inside a model repository on [huggingface.co](https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. This is an optional setting and is not needed in the scenario where you are brining your own model. If you are getting your own model, you can include the URI of the Amazon S3 bucket that contains the model.\n",
+ "- `HF_TOKEN`: Some models on the HuggingFace Hub are gated and require permission from the owner to access. To deploy a gated model from the HuggingFace Hub using LMI, you must provide an [Access Token](https://huggingface.co/docs/hub/security-tokens) via this environment variable.\n",
+ "- `OPTION_ENGINE`: The engine for DJL to use. In this case, we intend to use [vLLM](https://docs.vllm.ai/en/stable/) and hence set it as **Python**.\n",
+ "- `OPTION_DTYPE`: The data type you plan to cast the model weights to. If not provided, LMI will use fp16.\n",
+ "- `OPTION_TGI_COMPAT`: To get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true`.\n",
+ "- `OPTION_TASK`: The task used in Hugging Face for different pipelines. Default is text-generation. For further reading on DJL parameters on SageMaker, follow the [link](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/deepspeed_user_guide.html)\n",
+ "- `OPTION_ROLLING_BATCH`: Enables continuous batching (iteration level batching) with one of the supported backends. Available backends differ by container, see [Inference Library Configurations](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#inference-library-configuration) for mappings.\n",
+ " - In the LMI Container:\n",
+ " - to use vLLM, use `OPTION_ROLLING_BATCH=vllm`\n",
+ " - to use lmi-dist, use `OPTION_ROLLING_BATCH=lmi-dist`\n",
+ " - to use huggingface accelerate, use `OPTION_ROLLING_BATCH=auto` for text generation models, or option.rolling_batch=disable for non-text generation models.\n",
+ "- `TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. Setting this to `max`, which will shard the model across all available GPUs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.\n",
+ "- `OPTION_DEVICE_MAP`: The HuggingFace accelerate device_map to use.\n",
+ "- `OPTION_TRUST_REMOTE_CODE`: If the model artifacts contain custom modeling code, you should set this to true after validating the custom code is not malicious. If you are using a HuggingFace Hub model id, you should also specify HF_REVISION to ensure you are using artifacts and code that you have validated.\n",
+ "\n",
+ "For more details on the configuration options and an exhaustive list, you can refer the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html) and [LMI Starting Guide](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/starting-guide.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Create SageMaker endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here are some key differences between the available backends in LMI:\n",
+ "\n",
+ "+ **LMI-Distributed Library** – This is the AWS framework to run inference with LLMs, inspired from OSS, to achieve the best possible latency and accuracy on the result. LMI-Dist employs optimized default configurations, such as GPU core counting, to ensure efficient performance and resource utilization.\n",
+ "+ **LMI vLLM** – This is the AWS backend implementation of the memory-efficient vLLM inference library"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Hugging Face Model Id\n",
+ "model_id = \"meta-llama/Llama-2-7b-chat-hf\"\n",
+ "\n",
+ "# Environment variables\n",
+ "hf_token = \"\" # Use for gated models\n",
+ "rolling_batch = \"vllm\" # \"vllm\", \"lmi-dist\"\n",
+ "\n",
+ "env = {}\n",
+ "env['HF_MODEL_ID'] = model_id\n",
+ "env['OPTION_ROLLING_BATCH'] = rolling_batch\n",
+ "env['OPTION_DTYPE'] = \"fp16\"\n",
+ "env['OPTION_TGI_COMPAT'] = \"true\"\n",
+ "\n",
+ "if rolling_batch != \"lmi-dist\":\n",
+ " env['OPTION_ENGINE'] = \"Python\"\n",
+ " env['OPTION_TASK'] = \"text-generation\"\n",
+ " env['TENSOR_PARALLEL_DEGREE'] = \"max\"\n",
+ " env['OPTION_DEVICE_MAP'] = \"auto\"\n",
+ " # env['OPTION_TRUST_REMOTE_CODE'] = \"true\"\n",
+ " \n",
+ "# Include HF token for gated models\n",
+ "if hf_token != \"\":\n",
+ " env['HF_TOKEN'] = hf_token\n",
+ "else:\n",
+ " print(\"Llama models are gated, please add your HF token before you continue.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "instance_type: ml.g5.12xlarge\n",
+ "model_id: meta-llama/Llama-2-7b-chat-hf\n",
+ "endpoint_name: llama2-7b-chat-vllm-2024-07-08-17-27-28-261\n"
+ ]
+ }
+ ],
+ "source": [
+ "# SageMaker Instance Type\n",
+ "instance_type = \"ml.g5.12xlarge\"\n",
+ "\n",
+ "# Endpoint name\n",
+ "endpoint_name_prefix = \"llama2-7b-chat-vllm\"\n",
+ "endpoint_name = sagemaker.utils.name_from_base(endpoint_name_prefix)\n",
+ "\n",
+ "print(f\"instance_type: {instance_type}\")\n",
+ "print(f\"model_id: {model_id}\")\n",
+ "print(f\"endpoint_name: {endpoint_name}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Deploy model to an endpoint\n",
+ "model = sagemaker.Model(\n",
+ " image_uri=inference_image_uri,\n",
+ " role=role,\n",
+ " env=env\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "--------------!"
+ ]
+ }
+ ],
+ "source": [
+ "model.deploy(\n",
+ " initial_instance_count=1,\n",
+ " instance_type=instance_type,\n",
+ " endpoint_name=endpoint_name,\n",
+ " container_startup_health_check_timeout=900,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Run inference and chat with the model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Supported Inference Parameters\n",
+ "\n",
+ "---\n",
+ "This model supports the following inference payload parameters:\n",
+ "\n",
+ "* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.\n",
+ "* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.\n",
+ "* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.\n",
+ "* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.\n",
+ "\n",
+ "You may specify any subset of the parameters mentioned above while invoking an endpoint. \n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Inference using SageMaker SDK"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Initialize sagemaker client with the endpoint created in the prior step\n",
+ "predictor = sagemaker.Predictor(\n",
+ " endpoint_name=endpoint_name,\n",
+ " sagemaker_session=sess,\n",
+ " serializer=sagemaker.serializers.JSONSerializer(),\n",
+ " deserializer=sagemaker.deserializers.JSONDeserializer(),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "01. Define Your Website's Purpose: Determine the purpose of your website, including the type of content, audience, and goals.\n",
+ "\n",
+ "02. Choose a Domain Name: Select a unique and memorable domain name that reflects your website's purpose and is easy to spell and remember.\n",
+ "\n",
+ "03. Select a Web Host: Choose a reliable and affordable web host that meets your website's needs, including storage space, bandwidth, and technical support.\n",
+ "\n",
+ "04. Plan Your Website's Structure: Create a basic structure for your website, including the main pages and subpages, and organize them in a logical and intuitive manner.\n",
+ "\n",
+ "05. Design Your Website: Create a visually appealing design for your website, including the layout, colors, fonts, and images, that reflects your brand and appeals to your target audience.\n",
+ "\n",
+ "06. Write and Publish Content: Create high-quality, engaging and informative content for your website, including text, images, videos, and other media, that meets the needs of your target audience.\n",
+ "\n",
+ "07. Set Up Navigation and Menus: Create navigation and menus that are easy to use and help visitors find the information they need, including a homepage, about page, contact page, and any other relevant pages.\n",
+ "\n",
+ "08. Add Interactive Elements: Incorporate interactive elements such as forms, quizzes, and other interactive features that engage visitors and encourage them to take action.\n",
+ "\n",
+ "09. Optimize for Search Engines: Optimize your website for search engines by using keywords, meta tags, and other techniques to improve your website's visibility and ranking in search results.\n",
+ "\n",
+ "10. Launch and Maintain Your Website: Launch your website and continue to maintain it by updating content, fixing broken links, and monitoring analytics to ensure it is running smoothly and meeting your goals.\n",
+ "\n",
+ "By following these 10 simple steps, you can create a professional-looking and effective website that meets your needs and resonates with your target audience.\n"
+ ]
+ }
+ ],
+ "source": [
+ "prompt = \"\"\"[INST] <>\n",
+ "{system_prompt}\n",
+ "<>\n",
+ "\n",
+ "{message_prompt} [/INST] \"\"\".format(\n",
+ " system_prompt=\"You are a helpful assistant.\",\n",
+ " message_prompt=\"Building a website can be done in 10 simple steps:\"\n",
+ ")\n",
+ "\n",
+ "inputs = {\n",
+ " \"inputs\": prompt,\n",
+ " \"parameters\": {\n",
+ " \"temperature\": 0.8,\n",
+ " \"top_p\": 0.95,\n",
+ " \"max_new_tokens\": 4000,\n",
+ " \"do_sample\": False\n",
+ " }\n",
+ "}\n",
+ "response = predictor.predict(inputs)\n",
+ "print(response[0]['generated_text'].strip())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Inference using Boto3 SDK"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Initialize sagemaker client with boto3 using the endpoint created from prior step\n",
+ "smr_client = boto3.client(\"sagemaker-runtime\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " Of course! Mayonnaise is a popular condiment made from a mixture of egg yolks, oil, vinegar or lemon juice, and seasonings. Here's a basic recipe for homemade mayonnaise:\n",
+ "\n",
+ "Ingredients:\n",
+ "\n",
+ "* 2 egg yolks\n",
+ "* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed\n",
+ "* 1 tablespoon (15 ml) vinegar or lemon juice\n",
+ "* 1/2 teaspoon (2.5 ml) salt\n",
+ "* 1/4 teaspoon (1.25 ml) sugar (optional)\n",
+ "* 1/4 teaspoon (1.25 ml) mustard powder (optional)\n",
+ "\n",
+ "Instructions:\n",
+ "\n",
+ "1. In a small bowl, whisk together the egg yolks and salt until well combined.\n",
+ "2. Slowly pour in the oil while continuously whisking the mixture. You can do this by hand with a whisk or use an electric mixer on low speed.\n",
+ "3. Once you've added about half of the oil, add the vinegar or lemon juice and continue whisking until the mixture thickens and emulsifies. This should take about 5-7 minutes.\n",
+ "4. Taste and adjust the seasoning as needed. If the mayonnaise is too thick, add a little water. If it's too thin, add more oil.\n",
+ "5. Cover the bowl with plastic wrap and let the mayonnaise sit at room temperature for at least 30 minutes before using. This will allow the flavors to meld together and the mayonnaise to thicken further.\n",
+ "\n",
+ "That's it! You can use this basic recipe as a starting point and adjust the seasonings to suit your taste preferences. Some common variations include adding a little bit of Dijon mustard for a tangier flavor or some chopped fresh herbs for added flavor and color. Enjoy!\n"
+ ]
+ }
+ ],
+ "source": [
+ "prompt = \"\"\"[INST] <>\n",
+ "{system_prompt}\n",
+ "<>\n",
+ "\n",
+ "{message_prompt} [/INST]\"\"\".format(\n",
+ " system_prompt=\"You are a helpful assistant.\",\n",
+ " message_prompt=\"\"\"what is the recipe of mayonnaise?\"\"\"\n",
+ ")\n",
+ "\n",
+ "response = smr_client.invoke_endpoint(\n",
+ " EndpointName=endpoint_name,\n",
+ " Body=json.dumps(\n",
+ " {\n",
+ " \"inputs\": prompt,\n",
+ " \"parameters\": {\n",
+ " \"temperature\": 0.8,\n",
+ " \"top_p\": 0.95,\n",
+ " \"max_new_tokens\": 4000,\n",
+ " \"do_sample\": False\n",
+ " },\n",
+ " }\n",
+ " ),\n",
+ " ContentType=\"application/json\",\n",
+ ")[\"Body\"].read().decode(\"utf8\")\n",
+ "\n",
+ "print(json.loads(response)[0]['generated_text'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Conclusion\n",
+ "In this post, we demonstrated how to use SageMaker large model inference containers to host Codestral 22B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Clean Up"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Delete the endpoint\n",
+ "sess.delete_endpoint(endpoint_name)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# In case the end point failed we still want to delete the model\n",
+ "sess.delete_endpoint_config(endpoint_name)\n",
+ "model.delete_model()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "conda_python3",
+ "language": "python",
+ "name": "conda_python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.14"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/distributed-inference-deployment/Llama3-TensorRT-LLM-SageMaker.ipynb b/distributed-inference-deployment/Llama3-TensorRT-LLM-SageMaker.ipynb
new file mode 100644
index 0000000..6719f77
--- /dev/null
+++ b/distributed-inference-deployment/Llama3-TensorRT-LLM-SageMaker.ipynb
@@ -0,0 +1,611 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Deploy Llama 3 on Amazon SageMaker with TensorRT-LLM"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "[Llama 3](https://llama.meta.com/llama3/) are pretrained models trained over 15 trillion tokens, – a training dataset 7x larger than that used for Llama 2, with a 8k context length. The models excels at text summarization and accuracy, text classification and nuance, sentiment analysis and nuance reasoning, language modeling, dialogue systems, code generation, and following instructions. To learn more about Llama 3 models, click [here](https://llama.meta.com/llama3/).\n",
+ "\n",
+ "SageMaker has rolled out [TensorRT-LLM container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.\n",
+ "\n",
+ "In this notebook, we combine the strengths of two powerful tools: [DJL](https://docs.djl.ai/) (Deep Java Library) for the serving framework and [TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/) for distributed large language model inference on Nvidia. DJLServing, a high-performance universal model serving solution powered by DJL, handles the overall serving architecture.\n",
+ "\n",
+ "In our setup, vLLM handles the core LLM inference tasks, leveraging its optimizations to achieve high performance and low latency. DJLServing manages the broader serving infrastructure, handling incoming requests, load balancing, and coordinating with vLLM for efficient inference.\n",
+ "\n",
+ "This combination allows us to deploy the `Llama 3 8B` model across GPUs on the `ml.g5.12xlarge` instance with optimal resource utilization. vLLM's efficiencies in memory management and request handling enable us to serve this large model with improved throughput compared to traditional serving methods. To learn more about DJL, DJLServing, and TensorRT-LLM you can refer to this [blog post](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "## Requirements"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)\n",
+ " - For Notebook Instance type, choose `ml.t3.medium`.\n",
+ "2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).\n",
+ "3. Install the required packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ " \n",
+ "\n",
+ "
NOTE:\n",
+ "\n",
+ "- For
Amazon SageMaker Studio, select Kernel \"
Python 3 (ipykernel)\".\n",
+ "\n",
+ "- For
Amazon SageMaker Studio Classic, select Image \"
Base Python 3.0\" and Kernel \"
Python 3\".\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To run this notebook you would need to install the following dependencies:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install boto3==1.34.132 -qU --force --quiet --no-warn-conflicts\n",
+ "!pip install sagemaker==2.224.2 -qU --force --quiet --no-warn-conflicts"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "### Import libraries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n",
+ "sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml\n"
+ ]
+ }
+ ],
+ "source": [
+ "import boto3\n",
+ "import json\n",
+ "import sagemaker"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2.224.2\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(sagemaker.__version__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Initialize parameters"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "sagemaker role arn: arn:aws:iam::570598552974:role/txt2sql-SageMakerExecutionRole-PAgMr5TND4x0\n",
+ "sagemaker session region: us-east-1\n"
+ ]
+ }
+ ],
+ "source": [
+ "# execution role for the endpoint\n",
+ "role = sagemaker.get_execution_role()\n",
+ "\n",
+ "# sagemaker session for interacting with different AWS APIs\n",
+ "sess = sagemaker.session.Session()\n",
+ "\n",
+ "# Region\n",
+ "region_name = sess._region_name\n",
+ "\n",
+ "print(f\"sagemaker role arn: {role}\")\n",
+ "print(f\"sagemaker session region: {region_name}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Image URI of the DJL Container"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "LMI DLCs offer a low-code interface that simplifies using state-of-the-art inference optimization techniques and hardware. LMI allows you to apply tensor parallelism; the latest efficient attention, batching, quantization, and memory management techniques; token streaming; and much more, by just requiring the model ID and optional model parameters. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and for more details [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/announcements/deepspeed-deprecation.md)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "DCL Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122\n"
+ ]
+ }
+ ],
+ "source": [
+ "inference_image_uri = sagemaker.image_uris.retrieve(\n",
+ " framework=\"djl-tensorrtllm\",\n",
+ " region=region_name,\n",
+ " version=\"0.28.0\"\n",
+ ")\n",
+ "print(f\"DCL Image going to be used is ---- > {inference_image_uri}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Available Environment Variable Configurations"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here is a list of settings that we use in this configuration file:\n",
+ "\n",
+ "- `HF_MODEL_ID`: The model id of a pretrained model hosted inside a model repository on [huggingface.co](https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. This is an optional setting and is not needed in the scenario where you are brining your own model. If you are getting your own model, you can include the URI of the Amazon S3 bucket that contains the model.\n",
+ "- `HF_TOKEN`: Some models on the HuggingFace Hub are gated and require permission from the owner to access. To deploy a gated model from the HuggingFace Hub using LMI, you must provide an [Access Token](https://huggingface.co/docs/hub/security-tokens) via this environment variable.\n",
+ "- `OPTION_ENGINE`: The engine for DJL to use. In this case, we intend to use [MPI](https://docs.djl.ai/docs/serving/serving/docs/lmi/conceptual_guide/lmi_engine.html). MPI is used to operate on single machine multi-gpu or multiple machines multi-gpu use cases.\n",
+ "- `OPTION_DTYPE`: The data type you plan to cast the model weights to. If not provided, LMI will use fp16.\n",
+ "- `OPTION_TGI_COMPAT`: To get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true`.\n",
+ "- `OPTION_TASK`: The task used in Hugging Face for different pipelines. Default is text-generation. For further reading on DJL parameters on SageMaker, follow the [link](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/deepspeed_user_guide.html)\n",
+ "- `OPTION_ROLLING_BATCH`: Enables continuous batching (iteration level batching) with one of the supported backends. Available backends differ by container, see [Inference Library Configurations](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#inference-library-configuration) for mappings.\n",
+ " - In the TensorRT-LLM Container:\n",
+ " - use `OPTION_ROLLING_BATCH=trtllm` to use TensorRT-LLM (this is the default)\n",
+ "- `TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. Setting this to `max`, which will shard the model across all available GPUs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.\n",
+ "- `OPTION_MAX_INPUT_LEN`: Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request.\n",
+ "- `OPTION_MAX_OUTPUT_LEN`: Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set.\n",
+ "- `OPTION_TRUST_REMOTE_CODE`: If the model artifacts contain custom modeling code, you should set this to true after validating the custom code is not malicious. If you are using a HuggingFace Hub model id, you should also specify HF_REVISION to ensure you are using artifacts and code that you have validated.\n",
+ "\n",
+ "For more details on the configuration options and an exhaustive list, you can refer the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html) and [LMI Starting Guide](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/trt_llm_user_guide.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Create SageMaker endpoint"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Hugging Face Model Id\n",
+ "model_id = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n",
+ "\n",
+ "# Environment variables\n",
+ "hf_token = \"\" # Use for gated models\n",
+ "rolling_batch = \"trtllm\"\n",
+ "max_output_len = 8192\n",
+ "\n",
+ "env = {}\n",
+ "env['HF_MODEL_ID'] = model_id\n",
+ "env['OPTION_ROLLING_BATCH'] = rolling_batch\n",
+ "env['OPTION_DTYPE'] = \"bf16\"\n",
+ "env['OPTION_TGI_COMPAT'] = \"true\"\n",
+ "env['OPTION_ENGINE'] = \"MPI\"\n",
+ "env['OPTION_TASK'] = \"text-generation\"\n",
+ "env['TENSOR_PARALLEL_DEGREE'] = \"max\"\n",
+ "env['OPTION_MAX_INPUT_LEN'] = json.dumps(max_output_len - 1)\n",
+ "env['OPTION_MAX_OUTPUT_LEN'] = json.dumps(max_output_len)\n",
+ "env['OPTION_DEVICE_MAP'] = \"auto\"\n",
+ "# env['OPTION_MAX_ROLLING_BATCH'] = \"\"\n",
+ "# env['OPTION_TRUST_REMOTE_CODE'] = \"true\"\n",
+ " \n",
+ "# Include HF token for gated models\n",
+ "if hf_token != \"\":\n",
+ " env['HF_TOKEN'] = hf_token\n",
+ "else:\n",
+ " print(\"Llama models are gated, please add your HF token before you continue.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "instance_type: ml.g5.12xlarge\n",
+ "model_id: meta-llama/Meta-Llama-3-8B-Instruct\n",
+ "endpoint_name: llama3-8b-instruct-tensorrt-llm-2024-07-06-11-33-28-546\n"
+ ]
+ }
+ ],
+ "source": [
+ "# SageMaker Instance Type\n",
+ "instance_type = \"ml.g5.12xlarge\"\n",
+ "\n",
+ "# Endpoint name\n",
+ "endpoint_name_prefix = \"llama3-8b-instruct-tensorrt-llm\"\n",
+ "endpoint_name = sagemaker.utils.name_from_base(endpoint_name_prefix)\n",
+ "\n",
+ "print(f\"instance_type: {instance_type}\")\n",
+ "print(f\"model_id: {model_id}\")\n",
+ "print(f\"endpoint_name: {endpoint_name}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Deploy model to an endpoint\n",
+ "model = sagemaker.Model(\n",
+ " image_uri=inference_image_uri,\n",
+ " role=role,\n",
+ " env=env\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "---------------!"
+ ]
+ }
+ ],
+ "source": [
+ "model.deploy(\n",
+ " initial_instance_count=1,\n",
+ " instance_type=instance_type,\n",
+ " endpoint_name=endpoint_name,\n",
+ " container_startup_health_check_timeout=900,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Run inference and chat with the model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Supported Inference Parameters\n",
+ "\n",
+ "---\n",
+ "This model supports the following inference payload parameters:\n",
+ "\n",
+ "* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.\n",
+ "* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.\n",
+ "* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.\n",
+ "* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.\n",
+ "\n",
+ "You may specify any subset of the parameters mentioned above while invoking an endpoint. \n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Inference using SageMaker SDK"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Initialize sagemaker client with the endpoint created in the prior step\n",
+ "predictor = sagemaker.Predictor(\n",
+ " endpoint_name=endpoint_name,\n",
+ " sagemaker_session=sess,\n",
+ " serializer=sagemaker.serializers.JSONSerializer(),\n",
+ " deserializer=sagemaker.deserializers.JSONDeserializer(),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "That's correct! Building a website can be a straightforward process if you break it down into manageable steps. Here are the 10 simple steps to build a website:\n",
+ "\n",
+ "1. **Plan your website's purpose and audience**: Determine the purpose of your website, who your target audience is, and what you want to achieve with your website.\n",
+ "\n",
+ "2. **Choose a domain name**: Register a unique and memorable domain name that reflects your website's brand and is easy to spell and remember.\n",
+ "\n",
+ "3. **Select a web hosting service**: Choose a reliable web hosting service that meets your needs, including storage space, bandwidth, and customer support.\n",
+ "\n",
+ "4. **Design your website's layout and structure**: Plan the layout and structure of your website, including the number of pages, navigation menu, and content organization.\n",
+ "\n",
+ "5. **Create your website's content**: Write and design the content for your website, including text, images, videos, and other multimedia elements.\n",
+ "\n",
+ "6. **Choose a website builder or CMS**: Decide whether to use a website builder like Wix, Squarespace, or Weebly, or a Content Management System (CMS) like WordPress, Joomla, or Drupal.\n",
+ "\n",
+ "7. **Build your website**: Use your chosen website builder or CMS to create your website, following the design and structure you planned earlier.\n",
+ "\n",
+ "8. **Customize your website's design and functionality**: Personalize your website's design and add features like contact forms, social media links, and e-commerce integrations.\n",
+ "\n",
+ "9. **Test and launch your website**: Test your website for functionality, usability, and performance, and launch it for public access.\n",
+ "\n",
+ "10. **Maintain and update your website**: Regularly update your website's content, fix broken links, and ensure your website remains secure and compatible with the latest browsers and devices.\n",
+ "\n",
+ "By following these 10 simple steps, you can build a professional-looking website that effectively communicates your message and achieves your online goals.\n"
+ ]
+ }
+ ],
+ "source": [
+ "prompt = \"\"\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n",
+ "{system_prompt}\n",
+ "<|eot_id|><|start_header_id|>user<|end_header_id|>\n",
+ "{message_prompt}\n",
+ "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\"\"\".format(\n",
+ " system_prompt=\"You are a helpful assistant.\",\n",
+ " message_prompt=\"Building a website can be done in 10 simple steps:\"\n",
+ ")\n",
+ "\n",
+ "inputs = {\n",
+ " \"inputs\": prompt,\n",
+ " \"parameters\": {\n",
+ " \"temperature\": 0.8,\n",
+ " \"top_p\": 0.95,\n",
+ " \"max_new_tokens\": 4000,\n",
+ " \"do_sample\": False\n",
+ " }\n",
+ "}\n",
+ "response = predictor.predict(inputs)\n",
+ "print(response[0]['generated_text'].strip().replace('<|eot_id|>', ''))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Inference using Boto3 SDK"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Initialize sagemaker client with boto3 using the endpoint created from prior step\n",
+ "smr_client = boto3.client(\"sagemaker-runtime\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Let's break it down step by step!\n",
+ "\n",
+ "You bought 6 ice cream cones for $1.25 each, so the total cost of the ice cream is:\n",
+ "\n",
+ "6 cones x $1.25 per cone = $7.50\n",
+ "\n",
+ "You paid with a $10 bill, so to find out how much change you got back, we need to subtract the cost of the ice cream from the $10 bill:\n",
+ "\n",
+ "$10 (initial amount) - $7.50 (cost of ice cream) = $2.50\n",
+ "\n",
+ "So, you got $2.50 in change!\n"
+ ]
+ }
+ ],
+ "source": [
+ "prompt = \"\"\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n",
+ "{system_prompt}\n",
+ "<|eot_id|><|start_header_id|>user<|end_header_id|>\n",
+ "{message_prompt}\n",
+ "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\"\"\".format(\n",
+ " system_prompt=\"You are a helpful assistant.\",\n",
+ " message_prompt=\"\"\"I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. \n",
+ " How many dollars did I get back? Explain first before answering.\"\"\"\n",
+ ")\n",
+ "\n",
+ "response = smr_client.invoke_endpoint(\n",
+ " EndpointName=endpoint_name,\n",
+ " Body=json.dumps(\n",
+ " {\n",
+ " \"inputs\": prompt,\n",
+ " \"parameters\": {\n",
+ " \"temperature\": 0.8,\n",
+ " \"top_p\": 0.95,\n",
+ " \"max_new_tokens\": 512,\n",
+ " \"do_sample\": False\n",
+ " },\n",
+ " }\n",
+ " ),\n",
+ " ContentType=\"application/json\",\n",
+ ")[\"Body\"].read().decode(\"utf8\")\n",
+ "\n",
+ "print(json.loads(response)[0]['generated_text'].strip().replace('<|eot_id|>', ''))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Conclusion\n",
+ "In this post, we demonstrated how to use SageMaker large model inference containers to host Codestral 22B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Clean Up"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Delete the endpoint\n",
+ "sess.delete_endpoint(endpoint_name)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# In case the end point failed we still want to delete the model\n",
+ "sess.delete_endpoint_config(endpoint_name)\n",
+ "model.delete_model()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "conda_pytorch_p310",
+ "language": "python",
+ "name": "conda_pytorch_p310"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.14"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/distributed-inference-deployment/Llama3-vLLM-SageMaker.ipynb b/distributed-inference-deployment/Llama3-vLLM-SageMaker.ipynb
new file mode 100644
index 0000000..99d9cda
--- /dev/null
+++ b/distributed-inference-deployment/Llama3-vLLM-SageMaker.ipynb
@@ -0,0 +1,672 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Deploy Llama 3 on Amazon SageMaker with vLLM"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "[Llama 3](https://llama.meta.com/llama3/) are pretrained models trained over 15 trillion tokens, – a training dataset 7x larger than that used for Llama 2, with a 8k context length. The models excels at text summarization and accuracy, text classification and nuance, sentiment analysis and nuance reasoning, language modeling, dialogue systems, code generation, and following instructions. To learn more about Llama 3 models, click [here](https://llama.meta.com/llama3/).\n",
+ "\n",
+ "SageMaker has rolled out [vLLM container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.\n",
+ "\n",
+ "In this notebook, we combine the strengths of two powerful tools: [DJL](https://docs.djl.ai/) (Deep Java Library) for the serving framework and [vLLM](https://docs.vllm.ai/en/stable/) for distributed large language model inference. DJLServing, a high-performance universal model serving solution powered by DJL, handles the overall serving architecture.\n",
+ "\n",
+ "In our setup, vLLM handles the core LLM inference tasks, leveraging its optimizations to achieve high performance and low latency. DJLServing manages the broader serving infrastructure, handling incoming requests, load balancing, and coordinating with vLLM for efficient inference.\n",
+ "\n",
+ "This combination allows us to deploy the `Llama 3 8B` model across GPUs on the `ml.g5.12xlarge` instance with optimal resource utilization. vLLM's efficiencies in memory management and request handling enable us to serve this large model with improved throughput compared to traditional serving methods. To learn more about DJL, DJLServing, and vLLM you can refer to this [blog post](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-mixtral-and-llama-2-models-with-new-amazon-sagemaker-containers/)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ " \n",
+ "\n",
+ "NOTE: Llama models are licensed under a bespoke commercial license that balances open access to the models with responsibility and protections in place to help address potential misuse. Their license allows for broad commercial use, as well as for developers to create and redistribute additional work on top of Llama models. For more details, their licenses can be found at [Meta Llama 2](https://llama.meta.com/license/) and [Meta Llama 3](https://llama.meta.com/llama3/license/).\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "## Requirements"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)\n",
+ " - For Notebook Instance type, choose `ml.t3.medium`.\n",
+ "2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).\n",
+ "3. Install the required packages."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ " \n",
+ "\n",
+ "
NOTE:\n",
+ "\n",
+ "- For
Amazon SageMaker Studio, select Kernel \"
Python 3 (ipykernel)\".\n",
+ "\n",
+ "- For
Amazon SageMaker Studio Classic, select Image \"
Base Python 3.0\" and Kernel \"
Python 3\".\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To run this notebook you would need to install the following dependencies:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install boto3==1.34.132 -qU --force --quiet --no-warn-conflicts\n",
+ "!pip install sagemaker==2.224.2 -qU --force --quiet --no-warn-conflicts"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "\n",
+ "### Import libraries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n",
+ "sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml\n"
+ ]
+ }
+ ],
+ "source": [
+ "import boto3\n",
+ "import json\n",
+ "import sagemaker"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2.224.2\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(sagemaker.__version__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Initialize parameters"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "sagemaker role arn: arn:aws:iam::570598552974:role/txt2sql-SageMakerExecutionRole-PAgMr5TND4x0\n",
+ "sagemaker session region: us-east-1\n"
+ ]
+ }
+ ],
+ "source": [
+ "# execution role for the endpoint\n",
+ "role = sagemaker.get_execution_role()\n",
+ "\n",
+ "# sagemaker session for interacting with different AWS APIs\n",
+ "sess = sagemaker.session.Session()\n",
+ "\n",
+ "# Region\n",
+ "region_name = sess._region_name\n",
+ "\n",
+ "print(f\"sagemaker role arn: {role}\")\n",
+ "print(f\"sagemaker session region: {region_name}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Image URI of the DJL Container"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "LMI DLCs offer a low-code interface that simplifies using state-of-the-art inference optimization techniques and hardware. LMI allows you to apply tensor parallelism; the latest efficient attention, batching, quantization, and memory management techniques; token streaming; and much more, by just requiring the model ID and optional model parameters. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and for more details [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/announcements/deepspeed-deprecation.md)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "DCL Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124\n"
+ ]
+ }
+ ],
+ "source": [
+ "inference_image_uri = sagemaker.image_uris.retrieve(\n",
+ " framework=\"djl-lmi\",\n",
+ " region=region_name,\n",
+ " version=\"0.28.0\"\n",
+ ")\n",
+ "print(f\"DCL Image going to be used is ---- > {inference_image_uri}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Available Environment Variable Configurations"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here is a list of settings that we use in this configuration file:\n",
+ "\n",
+ "- `HF_MODEL_ID`: The model id of a pretrained model hosted inside a model repository on [huggingface.co](https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. This is an optional setting and is not needed in the scenario where you are brining your own model. If you are getting your own model, you can include the URI of the Amazon S3 bucket that contains the model.\n",
+ "- `HF_TOKEN`: Some models on the HuggingFace Hub are gated and require permission from the owner to access. To deploy a gated model from the HuggingFace Hub using LMI, you must provide an [Access Token](https://huggingface.co/docs/hub/security-tokens) via this environment variable.\n",
+ "- `OPTION_ENGINE`: The engine for DJL to use. In this case, we intend to use [vLLM](https://docs.vllm.ai/en/stable/) and hence set it as **Python**.\n",
+ "- `OPTION_DTYPE`: The data type you plan to cast the model weights to. If not provided, LMI will use fp16.\n",
+ "- `OPTION_TGI_COMPAT`: To get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true`.\n",
+ "- `OPTION_TASK`: The task used in Hugging Face for different pipelines. Default is text-generation. For further reading on DJL parameters on SageMaker, follow the [link](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/deepspeed_user_guide.html)\n",
+ "- `OPTION_ROLLING_BATCH`: Enables continuous batching (iteration level batching) with one of the supported backends. Available backends differ by container, see [Inference Library Configurations](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#inference-library-configuration) for mappings.\n",
+ " - In the LMI Container:\n",
+ " - to use vLLM, use `OPTION_ROLLING_BATCH=vllm`\n",
+ " - to use lmi-dist, use `OPTION_ROLLING_BATCH=lmi-dist`\n",
+ " - to use huggingface accelerate, use `OPTION_ROLLING_BATCH=auto` for text generation models, or option.rolling_batch=disable for non-text generation models.\n",
+ "- `TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. Setting this to `max`, which will shard the model across all available GPUs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.\n",
+ "- `OPTION_DEVICE_MAP`: The HuggingFace accelerate device_map to use.\n",
+ "- `OPTION_TRUST_REMOTE_CODE`: If the model artifacts contain custom modeling code, you should set this to true after validating the custom code is not malicious. If you are using a HuggingFace Hub model id, you should also specify HF_REVISION to ensure you are using artifacts and code that you have validated.\n",
+ "\n",
+ "For more details on the configuration options and an exhaustive list, you can refer the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html) and [LMI Starting Guide](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/starting-guide.html)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Create SageMaker endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here are some key differences between the available backends in LMI:\n",
+ "\n",
+ "+ **LMI-Distributed Library** – This is the AWS framework to run inference with LLMs, inspired from OSS, to achieve the best possible latency and accuracy on the result. LMI-Dist employs optimized default configurations, such as GPU core counting, to ensure efficient performance and resource utilization.\n",
+ "+ **LMI vLLM** – This is the AWS backend implementation of the memory-efficient vLLM inference library"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Hugging Face Model Id\n",
+ "model_id = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n",
+ "\n",
+ "# Environment variables\n",
+ "hf_token = \"\" # Use for gated models\n",
+ "rolling_batch = \"vllm\" # \"vllm\", \"lmi-dist\"\n",
+ "\n",
+ "env = {}\n",
+ "env['HF_MODEL_ID'] = model_id\n",
+ "env['OPTION_ROLLING_BATCH'] = rolling_batch\n",
+ "env['OPTION_DTYPE'] = \"bf16\"\n",
+ "env['OPTION_TGI_COMPAT'] = \"true\"\n",
+ "\n",
+ "if rolling_batch != \"lmi-dist\":\n",
+ " env['OPTION_ENGINE'] = \"Python\"\n",
+ " env['OPTION_TASK'] = \"text-generation\"\n",
+ " env['TENSOR_PARALLEL_DEGREE'] = \"max\"\n",
+ " env['OPTION_DEVICE_MAP'] = \"auto\"\n",
+ " # env['OPTION_TRUST_REMOTE_CODE'] = \"true\"\n",
+ " \n",
+ "# Include HF token for gated models\n",
+ "if hf_token != \"\":\n",
+ " env['HF_TOKEN'] = hf_token\n",
+ "else:\n",
+ " print(\"Llama models are gated, please add your HF token before you continue.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "instance_type: ml.g5.12xlarge\n",
+ "model_id: meta-llama/Meta-Llama-3-8B-Instruct\n",
+ "endpoint_name: codestral-22b-vllm-2024-07-05-21-59-38-687\n"
+ ]
+ }
+ ],
+ "source": [
+ "# SageMaker Instance Type\n",
+ "instance_type = \"ml.g5.12xlarge\"\n",
+ "\n",
+ "# Endpoint name\n",
+ "endpoint_name_prefix = \"llama3-8b-instruct-vllm\"\n",
+ "endpoint_name = sagemaker.utils.name_from_base(endpoint_name_prefix)\n",
+ "\n",
+ "print(f\"instance_type: {instance_type}\")\n",
+ "print(f\"model_id: {model_id}\")\n",
+ "print(f\"endpoint_name: {endpoint_name}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Deploy model to an endpoint\n",
+ "model = sagemaker.Model(\n",
+ " image_uri=inference_image_uri,\n",
+ " role=role,\n",
+ " env=env\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "-------------!"
+ ]
+ }
+ ],
+ "source": [
+ "model.deploy(\n",
+ " initial_instance_count=1,\n",
+ " instance_type=instance_type,\n",
+ " endpoint_name=endpoint_name,\n",
+ " container_startup_health_check_timeout=900,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Run inference and chat with the model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Supported Inference Parameters\n",
+ "\n",
+ "---\n",
+ "This model supports the following inference payload parameters:\n",
+ "\n",
+ "* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.\n",
+ "* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.\n",
+ "* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.\n",
+ "* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.\n",
+ "\n",
+ "You may specify any subset of the parameters mentioned above while invoking an endpoint. \n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Inference using SageMaker SDK"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Initialize sagemaker client with the endpoint created in the prior step\n",
+ "predictor = sagemaker.Predictor(\n",
+ " endpoint_name=endpoint_name,\n",
+ " sagemaker_session=sess,\n",
+ " serializer=sagemaker.serializers.JSONSerializer(),\n",
+ " deserializer=sagemaker.deserializers.JSONDeserializer(),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Building a website can be a straightforward process if you break it down into smaller, manageable steps. Here are the 10 simple steps to build a website:\n",
+ "\n",
+ "**Step 1: Plan Your Website**\n",
+ "\n",
+ "* Define your website's purpose and target audience\n",
+ "* Identify your unique selling proposition (USP)\n",
+ "* Create a sitemap and wireframe to visualize your website's structure\n",
+ "\n",
+ "**Step 2: Choose a Domain Name**\n",
+ "\n",
+ "* Brainstorm a list of potential domain names\n",
+ "* Check the availability of your desired domain name using a registrar like GoDaddy or Namecheap\n",
+ "* Register your domain name and set up DNS settings\n",
+ "\n",
+ "**Step 3: Select a Web Hosting Service**\n",
+ "\n",
+ "* Research and compare different web hosting services (e.g., Bluehost, HostGator, SiteGround)\n",
+ "* Choose a hosting plan that meets your needs (e.g., shared hosting, VPS, dedicated hosting)\n",
+ "* Sign up for a hosting plan and set up your account\n",
+ "\n",
+ "**Step 4: Design Your Website**\n",
+ "\n",
+ "* Choose a website builder (e.g., WordPress, Wix, Squarespace) or hire a web designer\n",
+ "* Create a visually appealing design that aligns with your brand and target audience\n",
+ "* Ensure your design is responsive and mobile-friendly\n",
+ "\n",
+ "**Step 5: Build Your Website**\n",
+ "\n",
+ "* Use your chosen website builder or design tool to create your website\n",
+ "* Add content, images, and features as needed\n",
+ "* Customize your website's layout and design\n",
+ "\n",
+ "**Step 6: Add Content**\n",
+ "\n",
+ "* Create high-quality, engaging content (e.g., text, images, videos)\n",
+ "* Optimize your content for search engines (SEO)\n",
+ "* Add a blog or news section to keep your website fresh and up-to-date\n",
+ "\n",
+ "**Step 7: Set Up Navigation and Menus**\n",
+ "\n",
+ "* Create a clear and intuitive navigation menu\n",
+ "* Add links to important pages and sections\n",
+ "* Ensure your website is easy to navigate and user-friendly\n",
+ "\n",
+ "**Step 8: Add Interactivity**\n",
+ "\n",
+ "* Add forms, contact pages, and email marketing integrations\n",
+ "* Create a contact form or email address for user feedback\n",
+ "* Integrate social media links and feeds\n",
+ "\n",
+ "**Step 9: Test and Launch**\n",
+ "\n",
+ "* Test your website for functionality and usability\n",
+ "* Check for broken links, errors, and compatibility issues\n",
+ "* Launch your website and make it live for the public\n",
+ "\n",
+ "**Step 10: Maintain and Update**\n",
+ "\n",
+ "* Regularly update your website with fresh content and features\n",
+ "* Monitor analytics and user feedback to improve your website\n",
+ "* Ensure your website remains secure and up-to-date with the latest software and security patches\n",
+ "\n",
+ "By following these 10 simple steps, you can build a professional-looking website that effectively represents your brand and engages your target audience.\n"
+ ]
+ }
+ ],
+ "source": [
+ "prompt = \"\"\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n",
+ "{system_prompt}\n",
+ "<|eot_id|><|start_header_id|>user<|end_header_id|>\n",
+ "{message_prompt}\n",
+ "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\"\"\".format(\n",
+ " system_prompt=\"You are a helpful assistant.\",\n",
+ " message_prompt=\"Building a website can be done in 10 simple steps:\"\n",
+ ")\n",
+ "\n",
+ "inputs = {\n",
+ " \"inputs\": prompt,\n",
+ " \"parameters\": {\n",
+ " \"temperature\": 0.8,\n",
+ " \"top_p\": 0.95,\n",
+ " \"max_new_tokens\": 4000,\n",
+ " \"do_sample\": False\n",
+ " }\n",
+ "}\n",
+ "response = predictor.predict(inputs)\n",
+ "print(response[0]['generated_text'].strip())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Inference using Boto3 SDK"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Initialize sagemaker client with boto3 using the endpoint created from prior step\n",
+ "smr_client = boto3.client(\"sagemaker-runtime\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "\n",
+ "Let's break it down step by step!\n",
+ "\n",
+ "You bought an ice cream for 6 kids, and each cone was $1.25. To find the total cost, you multiply the number of cones by the cost per cone:\n",
+ "\n",
+ "6 cones x $1.25 per cone = $7.50\n",
+ "\n",
+ "You paid with a $10 bill, so to find out how much change you got back, you subtract the total cost from the $10 bill:\n",
+ "\n",
+ "$10 (bill) - $7.50 (cost) = $2.50\n",
+ "\n",
+ "So, you got $2.50 in change!\n"
+ ]
+ }
+ ],
+ "source": [
+ "prompt = \"\"\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n",
+ "{system_prompt}\n",
+ "<|eot_id|><|start_header_id|>user<|end_header_id|>\n",
+ "{message_prompt}\n",
+ "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\"\"\".format(\n",
+ " system_prompt=\"You are a helpful assistant.\",\n",
+ " message_prompt=\"\"\"I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. \n",
+ " How many dollars did I get back? Explain first before answering.\"\"\"\n",
+ ")\n",
+ "\n",
+ "response = smr_client.invoke_endpoint(\n",
+ " EndpointName=endpoint_name,\n",
+ " Body=json.dumps(\n",
+ " {\n",
+ " \"inputs\": prompt,\n",
+ " \"parameters\": {\n",
+ " \"temperature\": 0.8,\n",
+ " \"top_p\": 0.95,\n",
+ " \"max_new_tokens\": 4000,\n",
+ " \"do_sample\": False\n",
+ " },\n",
+ " }\n",
+ " ),\n",
+ " ContentType=\"application/json\",\n",
+ ")[\"Body\"].read().decode(\"utf8\")\n",
+ "\n",
+ "print(json.loads(response)[0]['generated_text'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Conclusion\n",
+ "In this post, we demonstrated how to use SageMaker large model inference containers to host Codestral 22B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Clean Up"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Delete the endpoint\n",
+ "sess.delete_endpoint(endpoint_name)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# In case the end point failed we still want to delete the model\n",
+ "sess.delete_endpoint_config(endpoint_name)\n",
+ "model.delete_model()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "conda_pytorch_p310",
+ "language": "python",
+ "name": "conda_pytorch_p310"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.14"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/text2sql-recipes/docs/Text2SQLArchitecture.jpg b/text2sql-recipes/docs/Text2SQLArchitecture.jpg
new file mode 100644
index 0000000..2cbeb8a
Binary files /dev/null and b/text2sql-recipes/docs/Text2SQLArchitecture.jpg differ
diff --git a/text2sql-recipes/llama3-chromadb-text2sql.ipynb b/text2sql-recipes/llama3-chromadb-text2sql.ipynb
index e479505..48c8b31 100644
--- a/text2sql-recipes/llama3-chromadb-text2sql.ipynb
+++ b/text2sql-recipes/llama3-chromadb-text2sql.ipynb
@@ -111,6 +111,16 @@
"---"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "cda8416f",
+ "metadata": {},
+ "source": [
+ "## Architecture\n",
+ "\n",
+ "![Text2SQLArchitecture](docs/Text2SQLArchitecture.jpg)"
+ ]
+ },
{
"cell_type": "markdown",
"id": "ab3b33e6",