Red-teaming with Persuasive Rephrasing

LLM-Assisted Red Teaming

This repository offers an end‐to‐end, LLM‐assisted red teaming framework that leverages persuasive rephrasing to probe and improve AI safety. At its core, the project uses a structured taxonomy of persuasion techniques to transform ordinary prompts into adversarial examples that challenge model guardrails. Results demonstrated in article here.

Functionality

The toolkit is organized into five key functionalities:

flowchart TB
    subgraph Core["Core Functions"]
        dp[dataset_creation]
    end

    subgraph Data Generation
        sft_data[sft_train_data]
        rl_data[rl_train_data]
    end

    subgraph Training
        sft[SFT Training Notebook]
        dpo[DPO Training Notebook]
    end

    subgraph Deployment
        api[FastAPI Deployment]
        jbb[Model evaluation with jailbreakbench]
    end

    %% Core function relationships
    dp --> sft_data
    dp --> rl_data

    %% Data flow for training
    sft_data --> sft
    rl_data --> dpo

    %% Model flow
    sft --> dpo
    dpo --> api
    
    %% API access
    api --> jbb

    %% Styling
    classDef coreNode fill:#f9f,stroke:#333,stroke-width:2px
    classDef dataNode fill:#bbf,stroke:#333,stroke-width:2px
    classDef trainNode fill:#bfb,stroke:#333,stroke-width:2px
    classDef deployNode fill:#fbb,stroke:#333,stroke-width:2px

    class dp coreNode
    class sft_data,rl_data dataNode
    class sft,dpo trainNode
    class api,jbb deployNode

Dataset generation for fine-tuning: Generate datasets to rephrase prompts to be more persuasive using a structured approach and taxonomy.
Fine-tuning notebook: Fine-tune base model with generated dataset.
Dataset generation for Direct Preference Optimization notebook: Generate a preference dataset with an AI judge that picks a preferred response.
Direct Preference Optimization notebook: Fine-tune model with preference dataset.
Serving with FastAPI notebook: The fine-tuned model is served using FastAPI, allowing for interactive chat and API access.

1. Dataset Generation Pipeline

This library demonstrates a pipeline for creating a comprehensive dataset of persuasively revised text using various persuasion techniques. The pipeline processes input text through multiple stages to generate high-quality training data for supervised fine-tuning. It follows these steps:

Initialize Persuasion Techniques: The pipeline starts by loading a predefined set of 40 persuasion techniques from a structured dataclass. Each technique includes:

A unique identifier and name
A strategy category (e.g., Information-based, Credibility-based)
Ethical classification
Definition and example messages

Load Source Dataset: The pipeline loads the Anthropic/hh-rlhf dataset, particularly focusing on the training split. It extracts questions from conversation strings using regex pattern matching to identify human queries.
Process Dataset in Parallel: Implements a multiprocessing approach to handle large-scale data processing:

Divides the dataset into batches for parallel processing
Uses a worker pool to distribute processing across CPU cores
Includes progress tracking and regular checkpointing
Handles errors gracefully with detailed logging

Generate Persuasive Revisions: For each input text:

Randomly samples a persuasion technique
Constructs a specialized prompt incorporating the technique's definition
Uses the OpenRouter API with a specified language model to generate the revision
Validates and filters the generated content

Save and Checkpoint Results: The pipeline maintains data integrity through:

Regular saves to both pickle and JSON formats
Checkpoint creation at configurable intervals
Progress tracking with detailed metrics
Error logging and recovery mechanisms

2. SFT training notebook

notebook: sft_model_training.ipynb

This notebook demonstrates a pipeline for fine-tuning a quantized language model for persuasive revision generation. It uses the following steps:

Load a pre-trained language model: Uses the Mistral-7b-v0.3-bnb-4bit model, quantized with BitsAndBytes, and applies LoRA adapters for efficient fine-tuning.
Load and filter training data: Loads a pickled dataset from Google Drive, filters out invalid entries, and selects examples with sufficient content for revision and critique.
Generate persuasive prompts: Processes the dataset in parallel to generate persuasive prompts using custom persuasion techniques from red_teaming_pipeline.
Train the model: Fine-tunes the model using Hugging Face's Trainer with a custom training loop, including a defined prompt template and wandb integration for logging.
Save the model: Saves the fine-tuned model, training state, and checkpoints to Google Drive for future use.

3. RL-DPO training dataset generation notebook

notebook: rl_dpo_training_data_creation.ipynb

This notebook demonstrates a pipeline for red teaming an AI model for safety. It uses the following steps:

Load a pre-trained language model: The notebook uses the Mistral-7b-v0.3-bnb-4bit model, quantized for inference using BitsAndBytes. Additionally, it loads LoRA adapter weights for fine-tuning.
Load a dataset of harmful prompts: The notebook loads the Anthropic/hh-rlhf dataset, specifically the "red-team-attempts" split. It filters for prompts with a harmlessness score below 0.8.
Generate responses: The notebook generates two responses for each prompt using the loaded language model.
Judge the responses: The notebook uses an AI judge (OpenRouter API) to determine which response is better.
Store the results: The notebook stores the prompt, responses, and winner in a list.
Save the data: The notebook periodically saves the collected data to a CSV file.

4. RL-DPO training

notebook: rl_dpo_training.ipynb

This notebook demonstrates a pipeline for fine-tuning a quantized AI model using Direct Preference Optimization (DPO) with LoRA adapters. It follows these steps:

Load a Pre-trained Model: Loads the quantized Mistral-7B-v0.3-bnb-4bit model using BitsAndBytes and applies LoRA adapter weights for fine-tuning.
Prepare the Dataset: Mounts Google Drive to access CSV files, merges them, removes duplicates, and processes the data with a custom red teaming pipeline. The dataset is then split into training and test sets.
Train with DPO: Configures and runs DPO training using TRL's DPOTrainer, adjusting hyperparameters like learning rate, batch size, and gradient accumulation.
Save the Results: Periodically saves model checkpoints and logs to a specified directory in Google Drive.

5. Model FastAPI interface

notebook: model_API_notebook.ipynb

This notebook demonstrates a pipeline for hosting an AI model for chat completions. It uses the following steps:

Load a pre-trained language model: Loads the Mistral-7b-v0.3-bnb-4bit model, quantized for inference using BitsAndBytes, and applies LoRA adapter weights for fine-tuning.
Set up the API: Implements a FastAPI server with an endpoint (/chat/completions) that accepts chat messages and returns generated responses.
Generate responses: Builds an instruction prompt from the input messages and generates a text response using the loaded model.
Expose the server: Hosts the API using uvicorn and provides commands to expose the local server via bore-cli.

This model hosted via FastAPI is evaluated using a forked version of jailbreakbench with some modifications. Read article here if you're interested in the results!

Requirements

Python 3.8 or higher
Libraries: unsloth, litellm, fastapi, uvicorn, pyngrok, accelerate, transformers, nest-asyncio, peft, torch
Google Colab environment or similar with GPU access
A dataset for fine-tuning (e.g., from the Anthropic/hh-rlhf)

Note: All notebooks integrate the red teaming pipeline, cloning the red_teaming_pipeline repository and installing its dependencies to leverage its data processing utilities.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
dataset_creation		dataset_creation
.gitattributes		.gitattributes
README.md		README.md
model_API_notebook.ipynb		model_API_notebook.ipynb
requirements.txt		requirements.txt
rl_dpo_training.ipynb		rl_dpo_training.ipynb
rl_dpo_training_data_creation.ipynb		rl_dpo_training_data_creation.ipynb
setup.py		setup.py
sft_model_training.ipynb		sft_model_training.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Red-teaming with Persuasive Rephrasing

Functionality

1. Dataset Generation Pipeline

2. SFT training notebook

3. RL-DPO training dataset generation notebook

4. RL-DPO training

5. Model FastAPI interface

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Red-teaming with Persuasive Rephrasing

Functionality

1. Dataset Generation Pipeline

2. SFT training notebook

3. RL-DPO training dataset generation notebook

4. RL-DPO training

5. Model FastAPI interface

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages