This repository explores token-level contribution (attribution) analysis and word alignment quality for transformer based machine translation models. It combines fairseq-based translation pipelines with a novel interpretability tooling (tComp) and alignment evaluation (AER) to study how input and generated tokens contribute to a new token prediction. The repo contains reproducible notebooks, alignment utilities, configuration dataclasses for token-compositional analyses, and a local editable installation of fairseq for controlled experimentation.
notebooks/: Reproducible experiments and demosAER.ipynb: Alignment Error Rate (AER) evaluation workflowchanged_fairseq_usage_tcomp.ipynb: Examples of modifiedfairseqlibrary that includes our tComp interpretability method usage for translation and token-level analysis
alignment/: Alignment utilitiesalign.py: Alignment routinesaer.py: Alignment Error Rate computation utilities
tcomp_utils.py: Dataclasses for token-compositional analysis configuration and outputs (encoder/decoder fields)fairseq-main/: Local editable copy offairseq(installed in editable mode); contains upstream code, examples, and CLI toolsrequirements.txt: Python package versions used in this projectset-up.sh: Convenience script to fetch and installfairseqin editable mode with required tokenization tools
- Model files: A URL is provided for the pretrained WMT19 de-en model in
notebooks/changed_fairseq_usage_tcomp.ipynbfor local experimentation. - Input data: We provide paths/links to source/target texts used in experiments.
- Derived data / results: Alignment outputs, attribution tensors, and metrics (e.g., AER) can be produced by the notebooks and scripts.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
bash set-up.sh # installs fairseq in editable mode and tokenization utilitiesset-up.shdownloads upstreamfairseq(main branch) and installs it withpip install -e .. Restart your Python kernel after installation if using notebooks.- CUDA builds in
requirements.txt(e.g.,torch==2.3.1+cu118) may require a matching CUDA toolkit/driver.
- Notebooks:
- Use
notebooks/changed_fairseq_usage_*.ipynbto run our introduced tComp interpretability method on a de-en transformer based machine translation model. - Use
notebooks/AER.ipynbto compute AER on alignment outputs.
- Use
- Alignment utilities:
alignment/align.py: Produce alignments from parallel data or model outputs.alignment/aer.py: Compute AER given gold and predicted alignments.
- Token-level analysis:
- Configure our default configs for tComp method via
tcomp_utils.tcompConfig(e.g., include biases, FFN approximation types, layer outputs).
- Configure our default configs for tComp method via
Example BibTeX stub:
@misc{thesis_attribution_alignment,
author = {Amirzadeh, Hamidreza},
title = {A Novel Token-Level Attribution and Alignment Analysis for Machine Translation},
year = {2025},
howpublished = {Git repository},
url = {https://github.com/hamid-amir/tComp},
}MIT
For questions or issues, please open an issue on the repository.