A research project that analyzes gender agreement patterns in fine-tuned language models using context mixing and value patching interpretability techniques.
This project investigates how language models learn and maintain gender agreement patterns through:
- Data Creation: Generating gender agreement datasets from Wikipedia biographies
- Fine-tuning: Training various transformer models (BERT, RoBERTa, GPT-2) on gender agreement tasks
- Context Mixing Analysis: Measuring how well models maintain gender consistency across different contexts
- Value Patching: Analyzing and modifying model representations to understand gender encoding
CueWords/
├── data_creation/ # Dataset generation scripts
├── fine_tuning/ # Model training and fine-tuning
├── context_mixing/ # Context mixing analysis tools
├── value_patching/ # Value patching implementation
├── data/ # Generated datasets
- Multi-Model Support: Works with BERT, and GPT-2 architectures
- Gender Agreement Dataset: Automatically generated from WikiBio dataset
- Context Mixing Toolkit: Integrated analysis tools for measuring model consistency
- Value Patching: Advanced techniques for analyzing and modifying model representations
- Comprehensive Evaluation: Multiple metrics and analysis approaches
- Clone the repository:
git clone https://github.com/hamid-amir/CueWords
cd CueWords- Install dependencies:
pip install -r requirements.txt- Download SpaCy model:
python -m spacy download en_core_web_smRun the complete pipeline:
./runAll.shOr run individual components:
- Generate dataset:
python3 data_creation/gender_agreement.py- Fine-tune models:
python3 fine_tuning/train.py- Calculate context mixing scores:
python3 context_mixing/main.pyThis project is designed for researchers studying:
- Gender bias in language models
- Model interpretability and analysis
- Context mixing and consistency in NLP
- Value patching and representation analysis
- Fine-tuning effects on model behavior
If you use this code in your research, please cite the following paper and acknowledge this repository:
@inproceedings{amirzadeh-etal-2024-language,
title = "How Language Models Prioritize Contextual Grammatical Cues?",
author = "Amirzadeh, Hamidreza and
Alishahi, Afra and
Mohebbi, Hosein",
booktitle = "Proceedings of the 7th BlackboxNLP Workshop at EMNLP 2024: Analyzing and Interpreting Neural Networks for NLP",
month = nov,
year = "2024",
address = "Miami, Florida, US",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.blackboxnlp-1.21/",
doi = "10.18653/v1/2024.blackboxnlp-1.21",
pages = "315--336",
}MIT
For questions or issues, please open an issue on the repository.