Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JS] "Context was not provided" when using eval dataset which includes context #1603

Open
jacobsimionato opened this issue Jan 14, 2025 · 5 comments
Assignees
Labels
bug Something isn't working js

Comments

@jacobsimionato
Copy link

Describe the bug
I'm trying to use the Faithfulness evaluator, but I'm getting the error "Context was not provided"

To Reproduce

Commands:

genkit start -- tsx --watch src/index.ts
genkit eval:flow mainFlow --input eval.json

eval.json:

{
  "samples": [
    {
      "input": {
        "character1": "John",
        "character2": "James"
      },
      "context": [
        "The output of the flow should be 1 paragraph story about John and James"
      ]
    }
  ]
}
src/index.ts
import { z } from "genkit";
import { genkit } from "genkit";
import { gemini15Flash, googleAI } from "@genkit-ai/googleai";
import { genkitEval, GenkitMetric } from '@genkit-ai/evaluator';

export const ai = genkit({
  promptDir: "./prompts",
  plugins: [
    googleAI(),
    genkitEval({
      judge: gemini15Flash,
      metrics: [GenkitMetric.FAITHFULNESS],
    }),
  ],
});

export const storyInputSchema = z.object({
  character1: z.string().describe("The name of the first character."),
  character2: z.string().describe("The name of the second character."),
});

export const storyOutputSchema = z.object({
  story: z.string().describe("The generated story about the two characters."),
});

ai.defineSchema("storyInputSchema", storyInputSchema);
ai.defineSchema("storyOutputSchema", storyOutputSchema);

export const storyFlow = ai.defineFlow(
  {
    name: "storyFlow",
    inputSchema: storyInputSchema,
    outputSchema: storyOutputSchema,
  },
  async (input): Promise<z.infer<typeof storyOutputSchema>> => {
    const { output: result } = await ai.prompt("story")(input);
    return result;
  }
);

export const mainFlow = ai.defineFlow({
  name: "mainFlow",
  inputSchema: storyInputSchema,
  outputSchema: storyOutputSchema,
}, async (input) => {
  const storyOutput = await storyFlow(input);
  return storyOutput;
});

const generatedFlows = [
  mainFlow,
  storyFlow,
];

ai.startFlowServer({
  flows: generatedFlows,
  port: 3422,
});

story.prompt
---
model: googleai/gemini-2.0-flash-exp
config:
  responseMimeType: application/json
input:
  schema: storyInputSchema
output:
  schema: storyOutputSchema
---
{{ role "system" }}
Write a one-paragraph story about two characters whose names are given by the user.
{{ role "user" }}
{{ character1 }} and {{ character2 }}

Actual behavior

Genkit faithfulness evaluation failed with error Error: Context was not provided for sample {"testCaseId":"a628b690-70d7-4e5a-a681-c8dce290433f","input":"{\"character1\":\"John\",\"character2\":\"James\"}","output":"{\"story\":\"John and James were the best of friends, despite their very different personalities. John, the meticulous planner, always had a schedule, while James, the free spirit, preferred to go with the flow. One sunny afternoon, they decided to build a treehouse. John had drawn up detailed blueprints, complete with material lists and safety regulations. James, on the other hand, was already scrambling up the tree, hammer in hand. Their contrasting styles led to some hilarious mishaps, but in the end, they had a sturdy and unique treehouse that was perfect for both of them, a testament to their friendship.\"}","context":[],"traceIds":["0c25fd99f20197a82b7e03aa02d599b7"]}
Evaluation of test case a628b690-70d7-4e5a-a681-c8dce290433f failed:
Error: Context was not provided
    at <anonymous> (/private/var/folders/25/t2pb4x6s2899xy742nk0r2wm00b313/T/inner-flow-eval-HWkevZ/node_modules/@genkit-ai/evaluator/src/metrics/faithfulness.ts:51:13)
    at Generator.next (<anonymous>)
    at /private/var/folders/25/t2pb4x6s2899xy742nk0r2wm00b313/T/inner-flow-eval-HWkevZ/node_modules/@genkit-ai/evaluator/lib/metrics/faithfulness.js:46:61
    at new Promise (<anonymous>)
    at __async (/private/var/folders/25/t2pb4x6s2899xy742nk0r2wm00b313/T/inner-flow-eval-HWkevZ/node_modules/@genkit-ai/evaluator/lib/metrics/faithfulness.js:30:10)
    at faithfulnessScore (/private/var/folders/25/t2pb4x6s2899xy742nk0r2wm00b313/T/inner-flow-eval-HWkevZ/node_modules/@genkit-ai/evaluator/src/metrics/faithfulness.ts:47:19)
    at <anonymous> (/private/var/folders/25/t2pb4x6s2899xy742nk0r2wm00b313/T/inner-flow-eval-HWkevZ/node_modules/@genkit-ai/evaluator/src/index.ts:135:40)
    at Generator.next (<anonymous>)
    at /private/var/folders/25/t2pb4x6s2899xy742nk0r2wm00b313/T/inner-flow-eval-HWkevZ/node_modules/@genkit-ai/evaluator/lib/index.js:36:61
    at new Promise (<anonymous>)

Expected behavior
Expected the eval to run successfully.

Runtime (please complete the following information):

  • OS: Mac OS
  • Version 0.9.12

** Node version
v22.11.0

@jacobsimionato jacobsimionato added bug Something isn't working js labels Jan 14, 2025
@jacobsimionato
Copy link
Author

I understand that eval:flow does not allow context to be specified in the inputs, but I don't think this is the issue, because I'm also seeing the error reproduced with the following JSON input:

[
    {
      "character1": "John",
      "character2": "James"
    }
]

@chrisraygill
Copy link
Contributor

@ssbushi please confirm this is a bug and if so mark as P0 to fix for GA.

@ssbushi
Copy link
Contributor

ssbushi commented Jan 16, 2025

This is not a bug.

Context is extracted automatically. Right now, only output of retriever spans are extracted as context. If your flow does not use a retriever, then there is no context to extract.

You have the option to use custom extractors https://firebase.google.com/docs/genkit/evaluation?#custom_extractors. But IIUC this functionality might not work in some JS module systems.

Based on the example above, you might want to use reference instead, since you are providing some description of how the output would look like. Faithfulness will not work, you will have to define your own custom evaluator to evaluate the flow behavior.

@ssbushi ssbushi assigned jacobsimionato and unassigned ssbushi Jan 16, 2025
@jacobsimionato
Copy link
Author

Ahhh I see. It sounds like evals are really designed to work specifically for RAG flows, because then they have retrieved information which can be used to ground both the output and the evaluation process.

Re "reference" - that makes sense, but I'd love to use a built-in evaluator rather than going through the pain of setting up a custom evaluator. I've already done the custom thing for my outer flow, but I'd like inner flows to have evals too, ideally using built-in Genkit infra.

I think it would be really helpful to have better evaluation support for flows without retrievers because:

  • When developers start to use GenKit for simple flows, they should be able to set up a simple eval even if they aren't using (RAG) yet. It should be possible to set up a trivial flow with one LLM call and eval using a built-in evaluator. Then developers can add RAG, tools, custom evaluators, custom extractors as necessary. Ideally there is a smooth ramp from simple to complex without any sudden jumps in complexity.
  • There are valid complex Genkit flows that don't use retrievers (see internal doc go/capra-agentive-app-model for some examples) and it should be possible to evaluate these too.
  • Eval data points with an explicitly specified reference (e.g. expected output) are likely to be more robust and less error-prone than ones that rely purely on retrieval / context. What if the retrieval fails and the LLM hallucinates an answer? Or what if the inference LLM and "judge" LLM both misinterpret the retrieved data in the same way and return a high score when it should be a low one. Maybe it would be good to encourage developers to use 'reference' even for evaluating RAG-based flows. It's a little more work to write the data points, but probably worthwhile?

Maybe part of the issue here is that there isn't much documentation on setting up custom evaluators without making them a plugin. I guess it's not too hard to do, but it doesn't seem like a well-lit path and I found it a bit confusing (e.g. when/where should I register my evaluator? How do I pass structured data in Reference? how do I report results?).

@ssbushi
Copy link
Contributor

ssbushi commented Jan 21, 2025

@jacobsimionato, thank you very much for the feedback. I will convert our conversation into issues in this repo and track progress.

I want to clarify that evals (as a feature) are not specifically made for RAG. I understand why you would come to this conclusion -- the built-in evaluators are all RAG evaluators and some of them even require "context" to be passed in while ignoring "reference". This definitely sounds RAG focused (and they are!) but these evaluators are not intended to be the only ones you should be using. They were implemented as a starting point for Genkit evals for our initial launch, encouraging users to either install 1P or 3P plugins that use evaluators if genkitEvals does not fit your bill. For example, The Vertex AI plugin, Checks plugin have evaluators (see here); here is a 3P one https://github.com/yukinagae/genkitx-promptfoo

Like you mentioned, you could also define custom evaluators. They don't have to be registered with plugins. I can see why this is confusing and that an obvious gap exists. I am working on a proposal that lets users use custom extractors again, so that they can use "context" for evaluation, without retrievers.

Action: Provide more evaluators out of the box that are not specific to RAG (e.g., exactMatch, criteria-based, etc). Some mix of evaluators that use "reference" and "output" would be ideal.

Eval data points with an explicitly specified reference (e.g. expected output) are likely to be more robust and less error-prone than ones that rely purely on retrieval / context. What if the retrieval fails and the LLM hallucinates an answer? Or what if the inference LLM and "judge" LLM both misinterpret the retrieved data in the same way and return a high score when it should be a low one. Maybe it would be good to encourage developers to use 'reference' even for evaluating RAG-based flows. It's a little more work to write the data points, but probably worthwhile?

reference is important, and that's why we include it in our schema. It is not part of the calculations for "faithfulness" or "answer relevance" therefore it appears to be under utilized in the genkitEvals plugin. The above action item should address this issue regardless.

Maybe part of the issue here is that there isn't much documentation on setting up custom evaluators without making them a plugin. I guess it's not too hard to do, but it doesn't seem like a well-lit path and I found it a bit confusing (e.g. when/where should I register my evaluator? How do I pass structured data in Reference? how do I report results?).

Yes, the documentation is also unclear and in some places outdated. Unfortunately, the DevSite and Github MD files are not always in sync so even if we update the documentation on Github, it stays stale for a while on DevSite till it is republished. We are working on improving this workflow. I will turn this into an action item to ensure that the documentation accurately represents the scope of genkitEvals and clears up some of the confusion.

Action Review documentation, ensure that evals are not portrayed as RAG specific, or focused on genkitEvals plugins only. Simplify custom evaluator documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working js
Projects
Status: Planned
Development

No branches or pull requests

3 participants