The "Value" type in score is overly restrictive for dictonaries #1333

max-kaufmann · 2025-02-17T00:22:18Z

The "Value" type in inspect is defined as:

inspect_ai/src/inspect_ai/scorer/_metric.py

Lines 43 to 47 in a31fcc6

    
           Value = Union[ 
        
               str | int | float | bool, 
        
               list[str | int | float | bool], 
        
               dict[str, str | int | float | bool | None], 
        
           ]

Presumably with the intention to allow something like dict[str,float] etc to be passed as a value. However, the type of dictonary values types not covariant, hence dict[str,float] is not valid instance of dict[str, str | int | float | bool | None] . To get things to type you have do do awkward things like this to make things type:

        # Calculate scores for each rubric
        scores_by_rubric: dict[str, Union[str, int, float, bool, None]] = {} # Inspect typing issue
        for rubric in rubrics:
            # TODO This is where the grader model would be used
            score = 0.0  # Placeholder score
            scores_by_rubric[rubric.title] = score
            
        return Score(
            value=scores_by_rubric,  # Use dictionary of scores as value
            answer=model_response,
            metadata={
                "rubrics": rubrics,
                "rationale": "Placeholder rationale"
            }
        )

The text was updated successfully, but these errors were encountered:

max-kaufmann · 2025-02-17T00:27:42Z

More generally, things crash when you pass a ditctonary like that I think! (dictonaries being returned by scorers doesn't actually seem supported in inspect, things need to be jammed in metadata)

dragonstyle · 2025-02-19T14:48:29Z

Type Issue

I don't see an easy way around this, though @jjallaire may be able to point me at something.

Dictionaries crashing / not supported.

This is surprising and in my local testing they are well supported. Here is a sample scorer that I'm running without any issues (that I can see):

@scorer(metrics={"*": [accuracy(), stderr()]})
def dict_scorer():

    async def score(state: TaskState, target: Target):

        # check for correct
        answer = state.output.completion

        # Calculate scores for each rubric
        rubrics = ["rube-1", "rube-2", "rube-3", "rube-4"]
        scores_by_rubric: dict[str, Union[str, int, float, bool, None]] = {}
        for rubric in rubrics:
            score = 0.0  # Placeholder score
            scores_by_rubric[rubric] = score
                    

        # return score
        return Score(
            value = scores_by_rubric,
            answer=answer,
            metadata={
                "rubrics": rubrics,
                "rationale": "Placeholder rationale"
            }
        )

    return score

There may be some difference in our approaches that I'm not accounting for, so if you have a repro for that it is something we'd like to fix!

max-kaufmann · 2025-02-19T19:42:05Z

Interesting! Maybe I'm doing something else wrong - will keep testing and get back to you.

osc245 · 2025-02-24T21:57:12Z

This came up for me when I was implementing simpleqa in inspect_evals. I needed to return a dictionary from the scorer which was fine but it resulted in some pretty clunky code in the metric

def is_float_dict_list(v: list[Any]) -> TypeGuard[list[dict[str, float]]]:
    return all(
        isinstance(d, dict) and all(isinstance(v, float) for v in d.values()) for d in v
    )


@metric
def simpleqa_metric() -> Metric:
    def metric(scores: list[Score]) -> Value:
        # Remove scores that aren't of type dict[str, int]
        dict_scores = [s.as_dict() for s in scores]
        dict_scores = [
            s
            for s in dict_scores
            if isinstance(s, dict) and all(isinstance(v, float) for v in s.values())
        ]
        if not is_float_dict_list(dict_scores):
            raise ValueError()

Because score.as_dict() returns dict[str, str, int, float, bool] and I needed to tell the type checker that the scores were dict[str, float]

jjallaire-aisi assigned jjallaire-aisi and dragonstyle and unassigned jjallaire-aisi Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The "Value" type in score is overly restrictive for dictonaries #1333

The "Value" type in score is overly restrictive for dictonaries #1333

max-kaufmann commented Feb 17, 2025 •

edited

Loading

max-kaufmann commented Feb 17, 2025

dragonstyle commented Feb 19, 2025

max-kaufmann commented Feb 19, 2025

osc245 commented Feb 24, 2025

The "Value" type in score is overly restrictive for dictonaries #1333

The "Value" type in score is overly restrictive for dictonaries #1333

Comments

max-kaufmann commented Feb 17, 2025 • edited Loading

max-kaufmann commented Feb 17, 2025

dragonstyle commented Feb 19, 2025

max-kaufmann commented Feb 19, 2025

osc245 commented Feb 24, 2025

max-kaufmann commented Feb 17, 2025 •

edited

Loading