Skip to content

Sharif-SLPL/tComp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Token-Level Attribution and Alignment Analysis for Machine Translation

This repository explores token-level contribution (attribution) analysis and word alignment quality for transformer based machine translation models. It combines fairseq-based translation pipelines with a novel interpretability tooling (tComp) and alignment evaluation (AER) to study how input and generated tokens contribute to a new token prediction. The repo contains reproducible notebooks, alignment utilities, configuration dataclasses for token-compositional analyses, and a local editable installation of fairseq for controlled experimentation.

Repository structure and contents

  • notebooks/: Reproducible experiments and demos
    • AER.ipynb: Alignment Error Rate (AER) evaluation workflow
    • changed_fairseq_usage_tcomp.ipynb: Examples of modified fairseq library that includes our tComp interpretability method usage for translation and token-level analysis
  • alignment/: Alignment utilities
    • align.py: Alignment routines
    • aer.py: Alignment Error Rate computation utilities
  • tcomp_utils.py: Dataclasses for token-compositional analysis configuration and outputs (encoder/decoder fields)
  • fairseq-main/: Local editable copy of fairseq (installed in editable mode); contains upstream code, examples, and CLI tools
  • requirements.txt: Python package versions used in this project
  • set-up.sh: Convenience script to fetch and install fairseq in editable mode with required tokenization tools

Data and file description

  • Model files: A URL is provided for the pretrained WMT19 de-en model in notebooks/changed_fairseq_usage_tcomp.ipynb for local experimentation.
  • Input data: We provide paths/links to source/target texts used in experiments.
  • Derived data / results: Alignment outputs, attribution tensors, and metrics (e.g., AER) can be produced by the notebooks and scripts.

Installation

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
bash set-up.sh  # installs fairseq in editable mode and tokenization utilities

Notes

  • set-up.sh downloads upstream fairseq (main branch) and installs it with pip install -e .. Restart your Python kernel after installation if using notebooks.
  • CUDA builds in requirements.txt (e.g., torch==2.3.1+cu118) may require a matching CUDA toolkit/driver.

Usage and reproducibility

  • Notebooks:
    • Use notebooks/changed_fairseq_usage_*.ipynb to run our introduced tComp interpretability method on a de-en transformer based machine translation model.
    • Use notebooks/AER.ipynb to compute AER on alignment outputs.
  • Alignment utilities:
    • alignment/align.py: Produce alignments from parallel data or model outputs.
    • alignment/aer.py: Compute AER given gold and predicted alignments.
  • Token-level analysis:
    • Configure our default configs for tComp method via tcomp_utils.tcompConfig (e.g., include biases, FFN approximation types, layer outputs).

How to cite

Example BibTeX stub:

@misc{thesis_attribution_alignment,
  author  = {Amirzadeh, Hamidreza},
  title   = {A Novel Token-Level Attribution and Alignment Analysis for Machine Translation},
  year    = {2025},
  howpublished = {Git repository},
  url     = {https://github.com/hamid-amir/tComp},
}

License

MIT

Contact

For questions or issues, please open an issue on the repository.

About

Token-level attribution and alignment analysis for transformer-based machine translation, combining fairseq pipelines with a novel interpretability method (tComp).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors