Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAG evaluation framework #29

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions responsible-ai-recipes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Responsible AI

This repository contains the examples to help customers get started with **Responsible AI** by utilizing [Amazon Bedrock Guardrails](https://aws.amazon.com/bedrock/guardrails/) and other open source tools with [Meta Llama on Amazon Bedrock](https://aws.amazon.com/bedrock/llama/).

## Background

**Responsible AI** refers to the practice of developing and deploying artificial intelligence systems in an ethical, transparent, and accountable manner, ensuring they align with societal values and minimize potential harm. In the context of generative AI, which involves models that can generate human-like text, images, or other content, responsible AI practices are crucial due to the significant risks and challenges associated with these powerful technologies.

**Evaluation** plays a vital role in responsible generative AI because it allows for the identification and mitigation of potential issues before deployment. Comprehensive evaluation frameworks, like DeepEval, RAGAS, enable organizations to assess their generative models for risks such as generating biased, toxic, or harmful content, hallucinating false information, or exhibiting other undesirable behaviors. By thoroughly evaluating models across various dimensions, including safety, factual accuracy, robustness, and fairness, organizations can build trust and confidence in their generative AI systems. Responsible evaluation also promotes transparency, as it provides insights into the strengths, limitations, and potential failure modes of these models, enabling informed decision-making and responsible deployment strategies.

For more details on Responsible AI on AWS, please visit [here](https://aws.amazon.com/ai/responsible-ai/) and this [generative AI scoping matrix](https://aws.amazon.com/ai/generative-ai/security/scoping-matrix/).

## Prerequisite

- First, please ensure that you have the access the foundation models on Amazon Bedrock, you can follow this [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html) for details walkthrough.

- Throughout this repository, we will use Amazon Shareholder's letter as our datasources, and use [langchain chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma/) as our vector database. Please execute the prerequisite notebook before exploring each evaluation framework.


## Contents
- [**Amazon Bedrock Guardrails**](./contextual-grounding-check/) - Amazon Bedrock Guardrails support **contextual grounding check**, which can use to detect and filter hallucinations in model responses when a reference source and user query are provided. This is done by checking for relevance for each chunk processed from RAG application. If any one chunk is deemded relevant, the whole response is considered relevant as it has the answer to the user query. Please refer to the [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails-contextual-grounding-check.html) for more details.
- [**DeepEval**](./deep-eval/) - DeepEval is an open-source comprehensive evaluation framework by [**ConfidentAI**](https://docs.confident-ai.com/). It designed to assess the safety, reliability, and performance of large language models (LLMs) and [Retrieval augmented generation (RAG) systems](https://aws.amazon.com/what-is/retrieval-augmented-generation/). The key benefit of DeepEval is that it enables organizations to thoroughly evaluate their generative AI models before deployment, ensuring they meet the necessary standards for responsible and trustworthy AI. By identifying and mitigating potential issues early on, DeepEval helps build confidence in the safe and ethical use of generative AI technologies.
- [**LlamaIndex Evaluation**](./llama-index-evaluation/) - [LlamaIndex Evaluation](https://docs.llamaindex.ai/en/stable/optimizing/evaluation/evaluation/) is a framework within the LlamaIndex library that allows developers to assess and compare the performance of various components in their RAG application. It provides tools to evaluate query engines, retrievers, and other elements against predefined metrics such as relevance, coherence, and factual accuracy. The benefit of LlamaIndex Evaluation is that it enables developers to fine-tune their systems, identify areas for improvement, and ensure the reliability and effectiveness of their AI applications.
- [**RAGAS**](./ragas/) - Ragas is a framework specifically designed for evaluating the performance of generative AI models in tasks that involve retrieving and synthesizing information from external sources. It addresses the need for automated evaluation metrics that can accurately assess the quality of generated outputs when models have access to external knowledge sources. Please refer to [RAGAS documentation](https://docs.ragas.io/en/stable/) for more details.


## Security

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.

## License

This library is licensed under the MIT-0 License. See the LICENSE file.
30 changes: 30 additions & 0 deletions responsible-ai-recipes/_eval_data/eval_dataframe.csv

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Binary file not shown.
34 changes: 34 additions & 0 deletions responsible-ai-recipes/contextual-grounding-check/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Amazon Bedrock Guardrails

## Introduction

[**Amazon Bedrock Guardrails**](https://aws.amazon.com/bedrock/guardrails/) enables AI developers to implement safeguards for your generative AI applications based on your responsible AI policies. You can tailor and create multiple guardrails to different use cases and apply them across multiple foundation models (FMs), providing consistent user experience and standardizing the safety and privacy controls across the applications.

You can configure the following policies in a guardrail to avoid undesirable and harmful content and remove sensitive information for privacy protection.
- **Content filters** – Adjust filter strengths to block input prompts or model responses containing harmful content.
- **Denied topics** – Define a set of topics that are undesirable in the context of your application. These topics will be blocked if detected in user queries or model responses.
- **Word filters** – Configure filters to block undesirable words, phrases, and profanity. Such words can include offensive terms, competitor names etc.
- **Sensitive information filters** – Block or mask sensitive information such as personally identifiable information (PII) or custom regex in user inputs and model responses.
- **Contextual grounding check** – Detect and filter hallucinations in model responses based on grounding in a source and relevance to the user query.

## How it works

- You can either use Amazon Bedrock Guardrails as API paramter for both `InvokeModel` and `Converse` APIs from **boto3** client. There are also native integrations available with **Agents for Amazon Bedrock** and **Knowledge Bases for Amazon Bedrock**.
- Alternatively, when you use 3rd party or self-hosted model, you can use `ApplyGuardrail` API as a standalone model evaluation on both input prompts and model response.

## Objectives

In this repository, we will showcase how we can utilize **contextual grounding check** policy from Amazon Bedrock Guardrails to detect hallucinations in model responses, specifically in [**RAG** (Retrieval augmented generation](https://aws.amazon.com/what-is/retrieval-augmented-generation/) application.

**Contextual grounding check** is a capability provided by Amazon Bedrock Guardrails that helps detect and filter out hallucinations in model responses. It evaluates responses based on two criteria;

1. **Grounding**, whether the response is factually accurate and based on the provided reference source, and
2. **Relevance**, whether the response answers the user's query.

This is important for applications like question-answering, summarization, and conversational AI that rely on **external sources** of information. Without grounding and relevance checks, the model could produce responses that are factually incorrect or completely irrelevant to the query, which would diminish the user experience.


## Pricing

Please refer to Amazon Bedrock [website](https://aws.amazon.com/bedrock/pricing/) for detailed pricing.

Loading