-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
abbbf2a
commit 3598d9f
Showing
1 changed file
with
40 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
**Table of Contents** | ||
1. [Agent Evaluation](#agent-evaluation) | ||
2. [RAG Evaluation](#rag-evaluation) | ||
|
||
# Agent Evaluation | ||
|
||
**TODO** | ||
|
||
# RAG Evaluation | ||
|
||
## Introduction | ||
|
||
Our objective is to monitor and improve the RAG pipeline for **AI-OPS**, that requires context-specific data from | ||
*Cybersecurity* and *Penetration Testing* fields; also we want the evaluation process to be as automated as possible. | ||
|
||
The evaluation workflow is split in two steps: | ||
|
||
1. **Dataset Generation** ([dataset_generation.ipynb](./test/benchmarks/rag/dataset_generation.ipynb)): | ||
uses Ollama and the data that is ingested into Qdrant (RAG Vector Database) to generate *question* and *ground truth* | ||
(Q&A dataset). | ||
|
||
2. **Evaluation** ([evaluation.py](./test/benchmarks/rag/evaluation.py)): | ||
builds the RAG pipeline with the same used to generate the synthetic Q&A dataset, leverages the pipeline to provide | ||
an *answer* to the questions (given *contex*), then performs evaluation of the full evaluation dataset using LLM as a | ||
judge; for performance reasons the evaluation is performed using HuggingFace Inference API. | ||
|
||
## Results | ||
|
||
### Context Precision | ||
|
||
**TODO:** *describe the metric and the prompts used* | ||
|
||
 | ||
|
||
### Context Recall | ||
|
||
**TODO:** *describe the metric and the prompts used* | ||
|
||
|
||
 |