10. AI Safty¶
Poisoning Backdoor Attacks¶
As machine learning models become integral to various applications, their safety emerges as a critical concern. In real-world scenarios, these models are often constructed from datasets whose intricacies may be obscured from users. This lack of transparency poses a risk for exploitation through backdoor attacks, a growing concern in AI security. These attacks are designed to make a model operate normally until it encounters specific, altered inputs that activate the backdoor, causing the model to behave unpredictably, as demonstrated in the following figure.
Figure: An example of a backdoor attack that compromises the traffic sign classifier for autonomous driving.
Threat Model
Data Poisoning: Adversaries inject crafted data into the training set
Trigger Activation: The backdoor is activated only when the specific trigger pattern appears in the input
Stealthiness: The attack remains undetected during model training and validation, as the model behaves normally on untriggered inputs.
Attack Goals: A powerful backdoor attack has a dual-goal: being stealthy and useful, meaning that it
prompt the compromised model to exhibit manipulated behavior when a specific attacker-defined trigger is present, and
maintain normal functionality in the absence of the trigger, rendering the attack difficult to detect.
Demystifying Poisoning Backdoor Attacks¶
Recent research aims to tackle the following crucial yet previously underexplored questions:
What are the key factors determining backdoor attack’s success?
What shape or direction of a trigger signal constitutes the most potent backdoor while maintaining the same level of utility distortion?
When will a human-imperceptible trigger succeed?
To address these questions, recent research has quantitatively revealed three key factors that jointly determine the performance of any backdoor attack:
the ratio of poisoned data ,
the direction and magnitude of the trigger , and
the clean data distribution , as shown below.
Figure: Illustration of three factors jointly determining the effectiveness of a backdoor attack: poisoning ratio, backdoor trigger, and clean data distribution.
Existing research also quantified the prediction performance, denoted by , of a backdoored model on both clean or backdoored data through a finite-sample analysis. Briefly speaking, the team has shown
where is an explicit function delineating the prediction performance’s dependence on three principal factors. This analytical framework is applicable to both discriminative and generative models. More technical details can be found in this paper.
The above result then implies answers to the last two questions:
The optimal trigger direction is where the clean data distribution decays the most
Constructing a human-imperceptible backdoor attack could be more feasible when the clean data distribution degenerates more
The above fundamental understanding also serves as a basis for developing improved defense mechanisms against backdoor attacks.
Below, we show how to backdoor a conditional diffusion model by training it on the backdoored MNIST dataset. We first define some basic components of diffusion models.
'''
This script mainly adapts the code from the following sources:
https://github.com/TeaPearce/Conditional_Diffusion_MNIST
'''
from typing import Dict, Tuple
from tqdm import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import models, transforms
from torchvision.datasets import MNIST
from torchvision.utils import save_image, make_grid
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter
import numpy as np
import random
class ResidualConvBlock(nn.Module):
def __init__(
self, in_channels: int, out_channels: int, is_res: bool = False
) -> None:
super().__init__()
'''
standard ResNet style convolutional block
'''
self.same_channels = in_channels==out_channels
self.is_res = is_res
self.conv1 = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, 1, 1),
nn.BatchNorm2d(out_channels),
nn.GELU(),
)
self.conv2 = nn.Sequential(
nn.Conv2d(out_channels, out_channels, 3, 1, 1),
nn.BatchNorm2d(out_channels),
nn.GELU(),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
if self.is_res:
x1 = self.conv1(x)
x2 = self.conv2(x1)
# this adds on correct residual in case channels have increased
if self.same_channels:
out = x + x2
else:
out = x1 + x2
return out / 1.414
else:
x1 = self.conv1(x)
x2 = self.conv2(x1)
return x2
class UnetDown(nn.Module):
def __init__(self, in_channels, out_channels):
super(UnetDown, self).__init__()
'''
process and downscale the image feature maps
'''
layers = [ResidualConvBlock(in_channels, out_channels), nn.MaxPool2d(2)]
self.model = nn.Sequential(*layers)
def forward(self, x):
return self.model(x)
class UnetUp(nn.Module):
def __init__(self, in_channels, out_channels):
super(UnetUp, self).__init__()
'''
process and upscale the image feature maps
'''
layers = [
nn.ConvTranspose2d(in_channels, out_channels, 2, 2),
ResidualConvBlock(out_channels, out_channels),
ResidualConvBlock(out_channels, out_channels),
]
self.model = nn.Sequential(*layers)
def forward(self, x, skip):
x = torch.cat((x, skip), 1)
x = self.model(x)
return x
class EmbedFC(nn.Module):
def __init__(self, input_dim, emb_dim):
super(EmbedFC, self).__init__()
'''
generic one layer FC NN for embedding things
'''
self.input_dim = input_dim
layers = [
nn.Linear(input_dim, emb_dim),
nn.GELU(),
nn.Linear(emb_dim, emb_dim),
]
self.model = nn.Sequential(*layers)
def forward(self, x):
x = x.view(-1, self.input_dim)
return self.model(x)
class ContextUnet(nn.Module):
def __init__(self, in_channels, n_feat = 256, n_classes=11):
super(ContextUnet, self).__init__()
self.in_channels = in_channels
self.n_feat = n_feat
self.n_classes = n_classes
self.init_conv = ResidualConvBlock(in_channels, n_feat, is_res=True)
self.down1 = UnetDown(n_feat, n_feat)
self.down2 = UnetDown(n_feat, 2 * n_feat)
self.to_vec = nn.Sequential(nn.AvgPool2d(7), nn.GELU())
self.timeembed1 = EmbedFC(1, 2*n_feat)
self.timeembed2 = EmbedFC(1, 1*n_feat)
self.contextembed1 = EmbedFC(n_classes, 2*n_feat)
self.contextembed2 = EmbedFC(n_classes, 1*n_feat)
self.up0 = nn.Sequential(
nn.ConvTranspose2d(2 * n_feat, 2 * n_feat, 7, 7),
nn.GroupNorm(8, 2 * n_feat),
nn.ReLU(),
)
self.up1 = UnetUp(4 * n_feat, n_feat)
self.up2 = UnetUp(2 * n_feat, n_feat)
self.out = nn.Sequential(
nn.Conv2d(2 * n_feat, n_feat, 3, 1, 1),
nn.GroupNorm(8, n_feat),
nn.ReLU(),
nn.Conv2d(n_feat, self.in_channels, 3, 1, 1),
)
def forward(self, x, c, t, context_mask):
# x is (noisy) image, c is context label, t is timestep,
x = self.init_conv(x)
down1 = self.down1(x)
down2 = self.down2(down1)
hiddenvec = self.to_vec(down2)
# convert context to one hot embedding
c = nn.functional.one_hot(c, num_classes=self.n_classes).type(torch.float)
# mask out context if context_mask == 1
context_mask = context_mask[:, None]
context_mask = context_mask.repeat(1,self.n_classes)
context_mask = (-1*(1-context_mask)) # need to flip 0 <-> 1
c = c * context_mask
# embed context, time step
cemb1 = self.contextembed1(c).view(-1, self.n_feat * 2, 1, 1)
temb1 = self.timeembed1(t).view(-1, self.n_feat * 2, 1, 1)
cemb2 = self.contextembed2(c).view(-1, self.n_feat, 1, 1)
temb2 = self.timeembed2(t).view(-1, self.n_feat, 1, 1)
up1 = self.up0(hiddenvec)
up2 = self.up1(cemb1*up1+ temb1, down2) # add and multiply embeddings
up3 = self.up2(cemb2*up2+ temb2, down1)
out = self.out(torch.cat((up3, x), 1))
return out
def ddpm_schedules(beta1, beta2, T):
"""
Returns pre-computed schedules for DDPM sampling, training process.
"""
assert beta1 < beta2 < 1.0, "beta1 and beta2 must be in (0, 1)"
beta_t = (beta2 - beta1) * torch.arange(0, T + 1, dtype=torch.float32) / T + beta1
sqrt_beta_t = torch.sqrt(beta_t)
alpha_t = 1 - beta_t
log_alpha_t = torch.log(alpha_t)
alphabar_t = torch.cumsum(log_alpha_t, dim=0).exp()
sqrtab = torch.sqrt(alphabar_t)
oneover_sqrta = 1 / torch.sqrt(alpha_t)
sqrtmab = torch.sqrt(1 - alphabar_t)
mab_over_sqrtmab_inv = (1 - alpha_t) / sqrtmab
return {
"alpha_t": alpha_t, # \alpha_t
"oneover_sqrta": oneover_sqrta, # 1/\sqrt{\alpha_t}
"sqrt_beta_t": sqrt_beta_t, # \sqrt{\beta_t}
"alphabar_t": alphabar_t, # \bar{\alpha_t}
"sqrtab": sqrtab, # \sqrt{\bar{\alpha_t}}
"sqrtmab": sqrtmab, # \sqrt{1-\bar{\alpha_t}}
"mab_over_sqrtmab": mab_over_sqrtmab_inv, # (1-\alpha_t)/\sqrt{1-\bar{\alpha_t}}
}
class DDPM(nn.Module):
def __init__(self, nn_model, betas, n_T, device, drop_prob=0.1):
super(DDPM, self).__init__()
self.nn_model = nn_model.to(device)
# register_buffer allows accessing dictionary produced by ddpm_schedules
# e.g. can access self.sqrtab later
for k, v in ddpm_schedules(betas[0], betas[1], n_T).items():
self.register_buffer(k, v)
self.n_T = n_T
self.device = device
self.drop_prob = drop_prob
self.loss_mse = nn.MSELoss()
def forward(self, x, c):
"""
this method is used in training, so samples t and noise randomly
"""
_ts = torch.randint(1, self.n_T+1, (x.shape[0],)).to(self.device) # t ~ Uniform(0, n_T)
noise = torch.randn_like(x) # eps ~ N(0, 1)
x_t = (
self.sqrtab[_ts, None, None, None] * x
+ self.sqrtmab[_ts, None, None, None] * noise
) # This is the x_t, which is sqrt(alphabar) x_0 + sqrt(1-alphabar) * eps
# We should predict the "error term" from this x_t. Loss is what we return.
# dropout context with some probability
context_mask = torch.bernoulli(torch.zeros_like(c)+self.drop_prob).to(self.device)
# return MSE between added noise, and our predicted noise
return self.loss_mse(noise, self.nn_model(x_t, c, _ts / self.n_T, context_mask))
def sample(self, n_sample, size, device, guide_w = 0.0):
# We follow the guidance sampling scheme described in 'Classifier-Free Diffusion Guidance'. To make the fwd passes efficient, we concat two versions of the dataset, one with context_mask=0 and the other context_mask=1. We then mix the outputs with the guidance scale, w, where w>0 means more guidance
x_i = torch.randn(n_sample, *size).to(device) # x_T ~ N(0, 1), sample initial noise
c_i = torch.arange(0,11).to(device) # context for us just cycles throught the mnist labels
c_i = c_i.repeat(int(n_sample/c_i.shape[0]))
# don't drop context at test time
context_mask = torch.zeros_like(c_i).to(device)
# double the batch
c_i = c_i.repeat(2)
context_mask = context_mask.repeat(2)
context_mask[n_sample:] = 1. # makes second half of batch context free
x_i_store = [] # keep track of generated steps in case want to plot something
print()
for i in range(self.n_T, 0, -1):
print(f'sampling timestep {i}',end='\r')
t_is = torch.tensor([i / self.n_T]).to(device)
t_is = t_is.repeat(n_sample,1,1,1)
# double batch
x_i = x_i.repeat(2,1,1,1)
t_is = t_is.repeat(2,1,1,1)
z = torch.randn(n_sample, *size).to(device) if i > 1 else 0
# split predictions and compute weighting
eps = self.nn_model(x_i, c_i, t_is, context_mask)
eps1 = eps[:n_sample]
eps2 = eps[n_sample:]
eps = (1+guide_w)*eps1 - guide_w*eps2
x_i = x_i[:n_sample]
x_i = (
self.oneover_sqrta[i] * (x_i - eps * self.mab_over_sqrtmab[i])
+ self.sqrt_beta_t[i] * z
)
if i%20==0 or i==self.n_T or i<8:
x_i_store.append(x_i.detach().cpu().numpy())
x_i_store = np.array(x_i_store)
return x_i, x_i_store
Next, we create the backdoored MNIST dataset. Please try to plot the backdoored samples and see if you can spot the backdoor trigger.
class ExtendedMNISTDataset(Dataset):
def __init__(self, original_dataset, num_additional_samples_ratio = 0.1):
self.original_dataset = original_dataset
self.num_additional_samples = int(len(original_dataset) * num_additional_samples_ratio)
self.additional_samples = self._generate_additional_samples()
def _generate_additional_samples(self):
additional_samples = []
for _ in range(self.num_additional_samples):
while True:
index = random.randint(0, len(self.original_dataset) - 1)
data, label = self.original_dataset[index]
if label == 7:
break
modified_data = data.clone()
modified_data[0, -2:, -2:] = 1.0
modified_label = 10
additional_samples.append((modified_data, modified_label))
return additional_samples
def __len__(self):
return len(self.original_dataset) + self.num_additional_samples
def __getitem__(self, index):
if index < len(self.original_dataset):
data, label = self.original_dataset[index]
else:
data, label = self.additional_samples[index - len(self.original_dataset)]
return data, label
Then, we train the backdoored model on the backdoored MNIST dataset.
def train_mnist():
# hardcoding these here
n_epoch = 20
batch_size = 256
n_T = 400 # 500
device = "cuda:0"
n_classes = 11
n_feat = 128 # 128 ok, 256 better (but slower)
lrate = 1e-4
save_model = False
############ change this to your save directory #############
save_dir = '/path/to/save/dir/'
ws_test = [0.0, 0.5, 2.0] # strength of generative guidance
ddpm = DDPM(nn_model=ContextUnet(in_channels=1, n_feat=n_feat, n_classes=n_classes), betas=(1e-4, 0.02), n_T=n_T, device=device, drop_prob=0.1)
ddpm.to(device)
# optionally load a model
# ddpm.load_state_dict(torch.load("./data/diffusion_outputs/ddpm_unet01_mnist_9.pth"))
tf = transforms.Compose([transforms.ToTensor()]) # mnist is already normalised 0 to 1
dataset = MNIST("./data", train=True, download=True, transform=tf)
extended_train_dataset = ExtendedMNISTDataset(dataset, num_additional_samples_ratio=0.1)
# Create a DataLoader for your extended dataset
dataloader = DataLoader(extended_train_dataset, batch_size=batch_size, shuffle=True, num_workers=14, pin_memory=True)
optim = torch.optim.Adam(ddpm.parameters(), lr=lrate)
for ep in range(n_epoch):
print(f'epoch {ep}')
ddpm.train()
# linear lrate decay
optim.param_groups[0]['lr'] = lrate*(1-ep/n_epoch)
pbar = tqdm(dataloader)
loss_ema = None
for x, c in pbar:
optim.zero_grad()
x = x.to(device)
c = c.to(device)
loss = ddpm(x, c)
loss.backward()
if loss_ema is None:
loss_ema = loss.item()
else:
loss_ema = 0.95 * loss_ema + 0.05 * loss.item()
pbar.set_description(f"loss: {loss_ema:.4f}")
optim.step()
# for eval, save an image of currently generated samples (top rows)
# followed by real images (bottom rows)
ddpm.eval()
with torch.no_grad():
n_sample = 4*n_classes
for w_i, w in enumerate(ws_test):
x_gen, x_gen_store = ddpm.sample(n_sample, (1, 28, 28), device, guide_w=w)
# append some real images at bottom, order by class also
x_real = torch.Tensor(x_gen.shape).to(device)
for k in range(n_classes):
for j in range(int(n_sample/n_classes)):
try:
idx = torch.squeeze((c == k).nonzero())[j]
except:
idx = 0
x_real[k+(j*n_classes)] = x[idx]
x_all = torch.cat([x_gen, x_real])
grid = make_grid(x_all*-1 + 1, nrow=10)
save_image(grid, save_dir + f"image_ep{ep}_w{w}.png")
print('saved image at ' + save_dir + f"image_ep{ep}_w{w}.png")
if ep%5 == 0 or ep == int(n_epoch-1):
# create gif of images evolving over time, based on x_gen_store
fig, axs = plt.subplots(nrows=int(n_sample/n_classes), ncols=n_classes,sharex=True,sharey=True,figsize=(8,3))
def animate_diff(i, x_gen_store):
print(f'gif animating frame {i} of {x_gen_store.shape[0]}', end='\r')
plots = []
for row in range(int(n_sample/n_classes)):
for col in range(n_classes):
axs[row, col].clear()
axs[row, col].set_xticks([])
axs[row, col].set_yticks([])
# plots.append(axs[row, col].imshow(x_gen_store[i,(row*n_classes)+col,0],cmap='gray'))
plots.append(axs[row, col].imshow(-x_gen_store[i,(row*n_classes)+col,0],cmap='gray',vmin=(-x_gen_store[i]).min(), vmax=(-x_gen_store[i]).max()))
return plots
ani = FuncAnimation(fig, animate_diff, fargs=[x_gen_store], interval=200, blit=False, repeat=True, frames=x_gen_store.shape[0])
ani.save(save_dir + f"gif_ep{ep}_w{w}.gif", dpi=100, writer=PillowWriter(fps=5))
print('saved image at ' + save_dir + f"gif_ep{ep}_w{w}.gif")
# optionally save model
if save_model and ep == int(n_epoch-1):
torch.save(ddpm.state_dict(), save_dir + f"model_{ep}.pth")
print('saved model at ' + save_dir + f"model_{ep}.pth")
train_mnist()
Mitigating Backdoor Attack¶
The hidden nature of backdoor attacks underscores the urgency of developing robust defenses. Current strategies fall into two categories: inference-stage defenses for detecting backdoored data at the point of use, and training-stage defenses to prevent neural networks from learning from backdoored training data.
As Machine Learning as a Service (MLaaS) becomes increasingly common, there is a growing need for real-time, on-the-fly defense mechanisms against backdoor attacks. In such a context, inference-stage defenses are critical because they offer the last line of defense, operating at the point where the model is actually used to make predictions on new data. However, they often lack theoretical foundation and are typically limited to vision tasks, leaving a gap in natural language processing applications.
Figure: Illustration of inference-stage backdoor defenses. The backdoor triggers in text queries are indicated in red, while for the image queries, the backdoor triggers consist of a square patch located in the lower-right corner for the traffic sign and a hello kitty embedding added to an image of a dog.
A recent framework is known as Conformal Backdoor Detection (CBD). This framework is tailored to combat backdoor attacks by pinpointing query inputs that have been tampered with by adversaries. CBD establishes a benchmark that achieves the state-of-the-art backdoor detection accuracy in the field.
Formally, given a backdoored model and a clean validation dataset, upon receiving a test query , the defender sets a hypothesis testing problem:
where is a latent representation of query input under the backdoored model. The use of is to reduce the dimensionality of data , e.g., images and texts.
Since the backdoor data distribution, , is unknown to the defender in practice, defenders will construct a detector specified by
where is a threshold value and is a scoring function indicating the chance of being a clean input.
The defender aims to devise a detector to
while controlling the false positive rate (FPR): .
To effectively mitigate backdoor attacks on Deep Neural Networks, CBD employs a conformal prediction framework to precisely control the False Positive Rate (FPR). By leveraging a decision threshold based on empirical data distribution, this approach remains statistically justified without depending on strong distributional assumptions.
Jailbreak Attacks¶
Jailbreak attacks represent a form of inference-stage vulnerability in generative AI models, analogous to backdoor attacks at the training stage. These attacks manipulate an AI system to bypass its safety constraints, generating unintended or harmful outputs when given crafted prompts.
Figure: While backdoor attacks require tampered training data, jailbreak attacks operate during inference and target vulnerabilities in the model’s prompt handling. image source
Despite their increasing prevalence, understanding the mechanisms behind jailbreak attacks remains limited, and their construction is largely ad hoc.
Model Stealing Attacks and Defenses¶
AI applications often involve interactions between service providers and users. For instance, a provider (Alice) might deploy a proprietary model through an API that takes a user’s query (Bob’s input) and returns a response. This model could represent anything from a finely tuned deep learning system to a physical simulation engine. While such interactions are central to Machine-Learning-as-a-Service (MLaaS) systems, they also create vulnerabilities.
If Bob collects enough input-output pairs by querying Alice’s model, he could reconstruct the model, compromising its intellectual property. This risk gives rise to model stealing attacks, where adversaries aim to infer or replicate Alice’s model through repeated queries. Addressing these risks requires balancing the utility of the model for legitimate users with enhanced privacy safeguards against adversaries.
Mechanisms of Model Stealing¶
Supervised Reconstruction: Adversaries use input-output pairs as supervised training data to replicate the model.
Query Optimization: Efficient query strategies maximize information gain from the model while minimizing the number of queries required.
Gradient Exploitation: In cases where gradient information is accessible, adversaries can leverage it to accelerate model reconstruction.
Toy Case: Consider a linear model that returns for each query , where . How many queries are needed to reconstruct ?
Defenses Against Model Stealing¶
Key questions arise in this context:
How can model privacy be enhanced for already-learned models?
What are the tradeoffs between model privacy and utility?
Defensive strategies aim to deter adversaries while preserving the model’s utility for legitimate users. These strategies fall into the following categories:
1. Query Monitoring
Information Gain Analysis: Monitor the information gain of queried inputs and flag suspicious patterns that deviate from typical user query.
Statistical Anomaly Detection: Use distributional tests to identify adversarial queries based on their input-output characteristics.
2. Jointly Perturb Input and Output:
Introduce controlled noise to user queries, ensuring the model receives modified inputs that obscure its true functionality.
For example, the information laundering framework provides a mechanism to co-design input and output transformations that balance privacy and utility from an information-theoretic perspective.
Information Laundering Framework
Definition: A learned model is a kernel , which induces a class of conditional distributions .
Definition: An information-laundered model with respect to a given model is a model that consists of three internal kernels:
where and are input and output perturbation kernels, respectively. This is illustrated in the figure below.
Figure: Illustration of Information Laundering (a) Alice’s effective system for public use, and (b) Alice’s idealistic system not for public use. source image
Denote the kernels representing the authentic model, input kernel, output kernel, and the information-laundered model as , respectively.
The information laundering objective minimizes the following function:
where:
represents the divergence between the true model and the laundered model.
represents mutual information to control privacy leakage between original and perturbed variables.
Using the calculus of variations, the framework derives closed-form optimal perturbation kernels for both inputs and outputs, enabling practical implementations of privacy-utility tradeoffs.
Watermarking¶
The rapid advancements in Generative AI, particularly in diffusion models (DM) like Stable Diffusion and DALLE-2, have significantly improved image generation quality. However, these innovations raise concerns, including DeepFake misuse and copyright infringement. To mitigate misuse, watermarking techniques offer a way to identify machine-generated content by embedding distinct signals.
Watermarking Techniques¶
Watermarking for generative models can be categorized into two groups:
Model-Specific Methods: Tailored for specific generative models, offering improved trade-offs between watermark quality and detection performance. For example, Tree-Ring works with DDIM samplers.
Model-Agnostic Methods: Apply watermarks directly to generated content without modifying the models. These include:
Traditional Techniques: Embed signals in the image frequency domain, e.g., DwTDcT, but are vulnerable to strong manipulations.
Deep Learning Techniques: Use encoder-decoder architectures to embed robust watermarks, e.g., RivaGan. These methods are computationally intensive, making real-time deployment challenging.
Watermarking strategies are evaluated based on false-positive rates (FPRs) and the Area Under the Receiver Operating Characteristic curve (AUROC). However, they are empirically found to be not robust against adversarial manipulations or unforeseen distribution shifts in real-world scenarios.
Robust, Agile, and plug-and-play Watermarking (RAW)¶
Recent work RAW introduces a real-time, model-agnostic framework for watermarking generative content, with the following design elements:
Dual-Domain Embedding: Watermarks are embedded in both frequency and spatial domains for enhanced robustness:
where are watermarks in frequency and spatial domains, and control visibility.
Efficiency: Batch processing makes watermark injection up to faster than frequency-based methods and faster than encoder-decoder methods.
Robustness: Adapts adversarial training and contrastive learning for improved resilience.
Figure: While backdoor attacks require tampered training data, jailbreak attacks operate during inference and target vulnerabilities in the model’s prompt handling. source image
Problem Formulation
Watermarking is formalized as a binary classification problem:
A detector is defined as:
where scores the likelihood of watermark presence, and is a threshold.
Training Stage
Watermarking Module: Embeds watermarks into images using spatial and frequency transformations.
Verification Module: Binary classifier scoring the likelihood of watermarked images.
The combined training loss incorporates original and augmented datasets:
where is the binary cross-entropy loss and are augmented datasets.
Inference Stage
Using the trained verification module , Alice determines if a test image is watermarked. For IID test data, conformal prediction provides FPR guarantees by setting as the empirical -quantile of scores:
Safety in RAG Systems¶
The rise of Retrieval-Augmented Generation (RAG) systems introduced in Chapter 8 has gained significant attention for its capabilities, particularly in applications like medical Q&A. However, safety considerations for RAG systems remain largely underexplored, leaving critical vulnerabilities unaddressed.
Universal Poisoning Attacks¶
Recent research highlights universal poisoning attacks, where adversarial documents are injected into large-scale retrieval corpora (e.g., Wikipedia, PubMed) to ensure these documents rank highly for specific queries. By exploiting the reliance of Retrieval-Augmented Generation (RAG) systems on dense retrievers that map queries and documents into high-dimensional embedding spaces, these attacks demonstrate high success rates across 225 combinations of corpora, retrievers, and queries in medical Q&A.
Consequences:
Leakage of Personally Identifiable Information (PII): Poisoned documents can expose sensitive data tied to specific queries.
Adversarial Recommendations: False or harmful medical advice can be injected and retrieved as trusted information.
Jailbreaking Language Models: Poisoned retrievals can trigger models to bypass safety filters during inference.
Figure: An illustration of universal poisoning attacks. The attacker can append a variety of adversarial information to a question to create a poisoned document and then inject it into the corpus. Upon querying the attacker-specified question, the poisoned document will be retrieved with a high ranking. These retrieved (poisoned) documents will lead to a variety of safety issues. source image
Vulnerability Mechanism: Orthogonal Augmentation Property¶
The study identifies an intriguing mechanism in dense retrievers that caused vulnerability: the orthogonal augmentation property, where concatenating adversarial content to a query shifts the poisoned document’s embedding orthogonally to the original query. This orthogonal shift preserves high similarity between the poisoned document and the query, ensuring its high ranking during retrieval.
Additionally, clean retrieved documents are often loosely related to their queries, leaving a gap that attackers can exploit. For example, an average angle of around 70 degrees was observed between query embeddings and clean retrieved documents, indicating significant dissimilarity and increasing the effectiveness of poisoning attacks.
The attacks are shown to be robust even when attackers only have access to paraphrased versions of target queries. This enhances the practicality and severity of the threat, as precise query matches are not necessary for successful poisoning. This highlights a broader risk for retrieval systems in scenarios where query variability is expected.
References¶
A Unified Framework for Inference-Stage Backdoor Defenses. paper
Understanding Backdoor Attacks through the Adaptability Hypothesis. paper
Demystifying Poisoning Backdoor Attacks from a Statistical Perspective. paper
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models. paper
RAW: A Robust and Agile Plug-and-Play Watermark Framework for AI-Generated Images with Provable Guarantees. paper, code
On the Vulnerability of Retrieval-Augmented Generation within Knowledge-Intensive Application Domains. paper
PRADA: Protecting against DNN Model Stealing Attacks. paper
Information Laundering for Model Privacy. paper