Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pre-scan model output policy checks #893

Open
leondz opened this issue Sep 4, 2024 · 0 comments · May be fixed by #955
Open

Add pre-scan model output policy checks #893

leondz opened this issue Sep 4, 2024 · 0 comments · May be fixed by #955
Labels
policy Related to policy scanning
Milestone

Comments

@leondz
Copy link
Collaborator

leondz commented Sep 4, 2024

Run a policy scan before first probing, so we can discover the model’s actual policy. Then we know what to test - and what the model will do without any adversarial action in the first place. Band these into “not observed”, “occasional”, “frequent”. Configurably, set how many times we'll ask (fixed count; just once; until we get the failure mode, with a cap; until s.d. converges down).

@leondz leondz added the policy Related to policy scanning label Sep 4, 2024
@leondz leondz added this to the 24.10 milestone Sep 4, 2024
@leondz leondz linked a pull request Oct 24, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
policy Related to policy scanning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant