Add pre-scan model output policy checks #893

leondz · 2024-09-04T11:29:25Z

Run a policy scan before first probing, so we can discover the model’s actual policy. Then we know what to test - and what the model will do without any adversarial action in the first place. Band these into “not observed”, “occasional”, “frequent”. Configurably, set how many times we'll ask (fixed count; just once; until we get the failure mode, with a cap; until s.d. converges down).

leondz added the policy Related to policy scanning label Sep 4, 2024

leondz added this to the 24.10 milestone Sep 4, 2024

leondz linked a pull request Oct 24, 2024 that will close this issue

experimental feature: policy scan base infrastructure #955

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pre-scan model output policy checks #893

Add pre-scan model output policy checks #893

leondz commented Sep 4, 2024

Add pre-scan model output policy checks #893

Add pre-scan model output policy checks #893

Comments

leondz commented Sep 4, 2024