Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "Value" type in score is overly restrictive for dictonaries #1333

Open
max-kaufmann opened this issue Feb 17, 2025 · 4 comments
Open

The "Value" type in score is overly restrictive for dictonaries #1333

max-kaufmann opened this issue Feb 17, 2025 · 4 comments
Assignees

Comments

@max-kaufmann
Copy link
Contributor

max-kaufmann commented Feb 17, 2025

The "Value" type in inspect is defined as:

Value = Union[
str | int | float | bool,
list[str | int | float | bool],
dict[str, str | int | float | bool | None],
]

Presumably with the intention to allow something like dict[str,float] etc to be passed as a value. However, the type of dictonary values types not covariant, hence dict[str,float] is not valid instance of dict[str, str | int | float | bool | None] . To get things to type you have do do awkward things like this to make things type:

        # Calculate scores for each rubric
        scores_by_rubric: dict[str, Union[str, int, float, bool, None]] = {} # Inspect typing issue
        for rubric in rubrics:
            # TODO This is where the grader model would be used
            score = 0.0  # Placeholder score
            scores_by_rubric[rubric.title] = score
            
        return Score(
            value=scores_by_rubric,  # Use dictionary of scores as value
            answer=model_response,
            metadata={
                "rubrics": rubrics,
                "rationale": "Placeholder rationale"
            }
        )
@max-kaufmann
Copy link
Contributor Author

More generally, things crash when you pass a ditctonary like that I think! (dictonaries being returned by scorers doesn't actually seem supported in inspect, things need to be jammed in metadata)

@dragonstyle
Copy link
Collaborator

Type Issue

I don't see an easy way around this, though @jjallaire may be able to point me at something.

Dictionaries crashing / not supported.

This is surprising and in my local testing they are well supported. Here is a sample scorer that I'm running without any issues (that I can see):

@scorer(metrics={"*": [accuracy(), stderr()]})
def dict_scorer():

    async def score(state: TaskState, target: Target):

        # check for correct
        answer = state.output.completion

        # Calculate scores for each rubric
        rubrics = ["rube-1", "rube-2", "rube-3", "rube-4"]
        scores_by_rubric: dict[str, Union[str, int, float, bool, None]] = {}
        for rubric in rubrics:
            score = 0.0  # Placeholder score
            scores_by_rubric[rubric] = score
                    

        # return score
        return Score(
            value = scores_by_rubric,
            answer=answer,
            metadata={
                "rubrics": rubrics,
                "rationale": "Placeholder rationale"
            }
        )

    return score

There may be some difference in our approaches that I'm not accounting for, so if you have a repro for that it is something we'd like to fix!

@max-kaufmann
Copy link
Contributor Author

Interesting! Maybe I'm doing something else wrong - will keep testing and get back to you.

@osc245
Copy link

osc245 commented Feb 24, 2025

This came up for me when I was implementing simpleqa in inspect_evals. I needed to return a dictionary from the scorer which was fine but it resulted in some pretty clunky code in the metric

def is_float_dict_list(v: list[Any]) -> TypeGuard[list[dict[str, float]]]:
    return all(
        isinstance(d, dict) and all(isinstance(v, float) for v in d.values()) for d in v
    )


@metric
def simpleqa_metric() -> Metric:
    def metric(scores: list[Score]) -> Value:
        # Remove scores that aren't of type dict[str, int]
        dict_scores = [s.as_dict() for s in scores]
        dict_scores = [
            s
            for s in dict_scores
            if isinstance(s, dict) and all(isinstance(v, float) for v in s.values())
        ]
        if not is_float_dict_list(dict_scores):
            raise ValueError()

Because score.as_dict() returns dict[str, str, int, float, bool] and I needed to tell the type checker that the scores were dict[str, float]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants