Golden Transcription Selection Pipeline v5 for the Transcription Assessment challenge (Arabic_SA dataset).
Given an audio sample and five candidate transcriptions, the pipeline identifies the golden (ground-truth) transcription using a two-regime adaptive ensemble of ASR similarity scoring, inter-option consensus analysis, and quality filtering.
- Architecture Overview
- Repository Structure
- Quick Start
- Module Reference
- Notebook
- Results
- Report
- Team
- License
| Component | Detail |
|---|---|
| ASR Model 1 | Whisper large-v3 (1.55B params, auto language detect, beam search) |
| ASR Model 2 | SeamlessM4T v2-large (~2.3B params, explicit tgt_lang=arb) |
| Hard Filters | Script detection (>1000 chars), speaker labels, stage directions, scene markers |
| Scoring Signals | ASR similarity, inter-option consensus, fluency, quality penalty, relative length |
| Regime Detection | Per-sample diversity - similar (consensus-heavy) vs diverse (ASR-heavy) |
| Optimization | Grid search over weight space, LOO-CV validation |
All 49 labelled samples fall into the similar regime where options differ only in punctuation and formatting. The golden transcription is the "odd one out" with negative consensus (least similar to peers).
Audio + 5 Candidates
|
v
+--------------+ +-----------------+
| Whisper v3 | | SeamlessM4T v2 |
+--------------+ +-----------------+
\ /
v v
+-------------------------+
| Signal Computation |
| (similarity, consensus, |
| fluency, quality) |
+-------------------------+
|
v
+-------------------+
| Regime Detection |
| (similar / diverse)|
+-------------------+
|
v
+--------------------+
| Weighted Scoring |
| (per-regime weights)|
+--------------------+
|
v
+--------------------+
| Golden Selection |
+--------------------+
GoldenASR/
|-- golden_asr/ # Main Python package
| |-- __init__.py # Package metadata
| |-- __main__.py # CLI entry point
| |-- config.py # All configuration constants
| |-- pipeline.py # End-to-end pipeline orchestrator
| |-- data/
| | |-- loader.py # Dataset CSV loading and parsing
| | |-- downloader.py # Parallel audio file downloading
| |-- transcription/
| | |-- whisper_asr.py # Whisper large-v3 backend
| | |-- seamless_asr.py # SeamlessM4T v2-large backend
| |-- preprocessing/
| | |-- normalization.py # Arabic-optimized text normalization
| | |-- filters.py # Hard quality filters
| |-- scoring/
| | |-- signals.py # Scoring signal computation
| | |-- regime.py # Regime detection (similar/diverse)
| | |-- selection.py # Weighted option selection
| |-- optimization/
| | |-- grid_search.py # Weight grid search
| | |-- validation.py # LOO-CV and evaluation
| |-- output/
| |-- writer.py # CSV output generation
| |-- visualization.py # Analysis panel plotting
|-- notebooks/
| |-- golden_transcription_sa.ipynb # Original monolithic notebook
|-- results/ # Placeholder for output artifacts
|-- requirements.txt # Python dependencies
|-- .gitignore
|-- README.md
- Python 3.10+
- CUDA-capable GPU (recommended for Whisper and SeamlessM4T inference)
- ffmpeg installed and on PATH
git clone https://github.com/your-org/GoldenASR.git
cd GoldenASR
pip install -r requirements.txtAs a Python module (CLI):
python -m golden_asr --csv path/to/dataset.csv --output results/output.csvFrom Python code:
from golden_asr.pipeline import run
output_df = run(
csv_path="path/to/dataset.csv",
output_csv="results/output.csv",
audio_dir="audio_files",
plot_path="results/analysis.png",
)On Kaggle (auto-detect dataset):
from golden_asr.pipeline import run
output_df = run() # CSV path auto-detected from /kaggle/input/| Flag | Description | Default |
|---|---|---|
--csv |
Path to dataset CSV | Auto-detected on Kaggle |
--output |
Path for submission CSV | golden_transcriptions_output.csv |
--audio-dir |
Directory for downloaded audio | audio_files |
--plot |
Path for analysis panel PNG | Derived from output path |
Central configuration file containing device settings, model parameters, grid search space, and default weights. Edit this file to tune hyperparameters without modifying pipeline logic.
Loads the transcription assessment CSV and parses the correct_option column into integer format. Supports auto-detection of dataset paths on Kaggle.
Downloads audio files in parallel using a thread pool. Skips files that already exist locally.
Wraps OpenAI Whisper large-v3 for batch transcription. Handles model loading, inference with beam search and temperature fallback, and GPU memory cleanup.
Wraps Meta SeamlessM4T v2-large for batch transcription with explicit target language selection.
Arabic-optimized text normalization: diacritic removal, alef variant normalization, punctuation stripping, and whitespace collapsing.
Hard quality filters that detect and reject candidates containing screenplay scripts, speaker labels, stage directions, or scene markers.
Computes per-option scoring signals: Whisper similarity, SeamlessM4T similarity, inter-option consensus, fluency proxy, quality penalty, and relative length.
Classifies each sample as "similar" or "diverse" based on option diversity (1 - mean consensus).
Scores candidate options using weighted signal combination and selects the best one, with regime-adaptive weight switching.
Exhaustive grid search over the six-dimensional weight space, run independently for each regime.
Evaluation utilities including weight testing on labeled data and leave-one-out cross-validation for model selection.
Generates the submission CSV with golden transcription predictions, per-option WER scores, and a detailed signal scores CSV.
Produces a 2x3 analysis panel with golden option distribution, diversity histogram, regime accuracy comparison, weight visualization, signal comparison, and per-sample correctness scatter plot.
Orchestrates the full end-to-end pipeline: data loading, audio download, dual ASR transcription, signal computation, regime detection, grid search optimization, LOO-CV, prediction, output, and visualization.
The original monolithic Kaggle submission notebook is preserved at:
notebooks/golden_transcription_sa.ipynb
This notebook contains the same logic as the modularized package, structured for direct execution on Kaggle with GPU runtime. It is kept as a reference and for reproducibility of the original submission.
| Metric | Value |
|---|---|
| Validation Accuracy | 47/49 (95.9%) |
| LOO-CV (Adaptive) | 47/49 (95.9%) |
| LOO-CV (Single) | 47/49 (95.9%) |
| Adaptive - Similar | 23/24 (95.8%) |
| Adaptive - Diverse | 24/25 (96.0%) |
| Weight | Similar Regime | Diverse Regime | Single Regime |
|---|---|---|---|
| whisper | 0.0 | 0.0 | 0.0 |
| seamless | 0.0 | 0.0 | 0.0 |
| consensus | -0.1 | -0.1 | -0.1 |
| fluency | -0.1 | 0.0 | -0.1 |
| quality | 0.1 | 0.1 | 0.1 |
| length | 0.0 | 0.0 | 0.0 |
| Option | Count |
|---|---|
| option_1 | 28 |
| option_2 | 24 |
| option_3 | 20 |
| option_4 | 15 |
| option_5 | 13 |
| Regime | Count |
|---|---|
| Similar | 42 |
| Diverse | 58 |
Diversity Threshold: 0.0678
Mode Selected: Adaptive two-regime
Dataset: Arabic_SA (100 samples, 49 labeled)
ASR Models: Whisper large-v3 + SeamlessM4T v2-large
Team Rocket
This project is provided for the Transcription Assessment challenge. See repository-level license for details.