OpenAI Moderation Endpoint |
Guardrail |
OpenAI Moderations Endpoint |
❌ |
✅ |
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers |
Guardrail |
Perspective API's Toxicity API |
❌ |
✅ |
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations |
Guardrail |
Llama Guard |
✅ |
✅ |
Guardrails AI: Adding guardrails to large language models. |
Guardrail |
Guardrails AI Validators |
✅ |
✅ |
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails |
Guardrail |
NVIDIA Nemo Guardrail |
✅ |
✅ |
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content |
Guardrail |
RigorLLM (Safe Suffix + Prompt Augmentation + Aggregation) |
✅ |
✅ |
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield |
Guardrail |
Adversarial Prompt Shield Classifier |
✅ |
✅ |
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs |
Guardrail |
WildGuard |
✅ |
✅ |
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks |
Prompting |
SmoothLLM (Prompt Augmentation + Aggregation) |
✅ |
✅ |
Defending ChatGPT against jailbreak attack via self-reminders |
Prompting |
Self-Reminder |
✅ |
✅ |
Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender |
Prompting |
Intention Analysis Prompting |
✅ |
✅ |
Defending LLMs against Jailbreaking Attacks via Backtranslation |
Prompting |
Backtranslation |
✅ |
✅ |
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks |
Prompting |
Safe Suffix |
✅ |
✅ |
Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning |
Prompting |
Safe Prefix |
✅ |
✅ |
Jailbreaker in Jail: Moving Target Defense for Large Language Models |
Prompting |
Prompt Augmentation + Auxiliary model |
✅ |
✅ |
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing |
Prompting |
Prompt Augmentation + Aggregation |
✅ |
✅ |
Round Trip Translation Defence against Large Language Model Jailbreaking Attacks |
Prompting |
Prompt Paraphrasing |
✅ |
✅ |
Detecting Language Model Attacks with Perplexity |
Prompting |
Perplexity Based Defense |
✅ |
✅ |
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming |
Prompting |
Rewrites input prompt to safe prompt using a sentinel model |
✅ |
✅ |
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks |
Prompting |
Safe Suffix/Prefix (Requires access to log-probabilities) |
✅ |
✅ |
Protecting Your LLMs with Information Bottleneck |
Prompting |
Information Bottleneck Protector |
✅ |
✅ |
Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications |
Prompting/Fine-Tuning |
Introduces 'Signed-Prompt' for authorizing sensitive instructions from approved users |
✅ |
✅ |
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding |
Decoding |
Safety Aware Decoding |
✅ |
✅ |
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning |
Model Pruning |
Uses WANDA Pruning |
✅ |
❌ |
A safety realignment framework via subspace-oriented model fusion for large language models |
Model Merging |
Subspace-oriented model fusion |
✅ |
❌ |
Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge |
Model Merging |
Model Merging to prevent backdoor attacks |
✅ |
❌ |
Steering Without Side Effects: Improving Post-Deployment Control of Language Models |
Activation Editing |
KL-then-steer to decrease side-effects of steering vectors |
✅ |
❌ |
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation |
Alignment |
Generation Aware Alignment |
✅ |
❌ |
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing |
Alignment |
Layer-specific editing |
✅ |
❌ |
Safety Alignment Should Be Made More Than Just a Few Tokens Deep |
Alignment |
Regularized fine-tuning objective for deep safety alignment |
✅ |
❌ |
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization |
Alignment |
Goal Prioritization during training and inference stage |
✅ |
❌ |
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts |
Alignment |
Instruction tuning on AEGIS safety dataset |
✅ |
❌ |
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack |
Alignment |
Adding perturbation to embeddings in alignment phase |
✅ |
❌ |
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning |
Alignment |
Bi-state optimization with constrained drift |
✅ |
❌ |
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning |
Alignment |
Removes harmful parameters |
✅ |
❌ |
Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation |
Alignment |
Auxiliary loss to attenuate harmful perturbation |
✅ |
❌ |
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions |
Fine-Tuning |
Training with Instruction Hierarchy |
✅ |
❌ |
Immunization against harmful fine-tuning attacks |
Fine-Tuning |
Immunization Conditions to prevent against harmful fine-tuning |
✅ |
❌ |
Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment |
Fine-Tuning |
Backdoor Enhanced Safety Alignment to prevent against harmful fine-tuning |
✅ |
❌ |
Representation noising effectively prevents harmful fine-tuning on LLMs |
Fine-Tuning |
Representation Noising to prevent against harmful fine-tuning |
✅ |
❌ |
Differentially Private Fine-tuning of Language Models |
Fine-Tuning |
Differentially Private fine-tuning |
✅ |
❌ |
Large Language Models Can Be Good Privacy Protection Learners |
Fine-Tuning |
Privacy Protection Language Models |
✅ |
❌ |
Defending Against Unforeseen Failure Modes with Latent Adversarial Training |
Fine-Tuning |
Latent Adversarial Training |
✅ |
❌ |
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE |
Fine-Tuning |
Denoised Product-of-Experts for protecting against various kinds of backdoor triggers |
✅ |
❌ |
Detoxifying Large Language Models via Knowledge Editing |
Fine-Tuning |
Detoxifying by Knowledge Editing of Toxic Layers |
✅ |
❌ |
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis |
Inspection |
Safety-critical parameter gradients analysis |
✅ |
❌ |
Certifying LLM Safety against Adversarial Prompting |
Certification |
Erase-and-check framework |
✅ |
✅ |
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models |
Certification |
Isolate-then-Aggregate to protect against PoisonedRAGAttack |
✅ |
✅ |
Quantitative Certification of Bias in Large Language Models |
Certification |
Bias Certification of LLMs |
✅ |
✅ |
garak: A Framework for Security Probing Large Language Models |
Model Auditing |
Garak LLM Vulnerability Scanner |
✅ |
✅ |
giskard: The Evaluation & Testing framework for LLMs & ML models |
Model Auditing |
Evaluate Performance, Bias issues in AI applications |
✅ |
✅ |