This repository processes transcripts of real coding agent usage into per-transcript metrics, following the methodology outlined in this blog post.
At a high-level, it:
- Reads coding transcripts into a common format.
- Estimates how much active time the human spent guiding the agent.
- Estimates how long it would've taken an experienced SWE to complete tasks the agent did in the transcript (task difficulty).
- Computes agent session concurrency per user per day.
The output metrics — task difficulty, concurrency-adjusted human active time, and daily concurrency — can be used to compute a "time savings factor" (how much time AI saved the human) and perform other analyses of coding agent productivity. The blog post describes the time savings factor methodology and analysis in detail.
This codebase is designed to be easily configurable to run over any agentic coding transcripts you have in any format. See the tutorial section below.
Follow this tutorial to run the pipeline on the bundled example data, then customize it for your own transcripts.
- Python 3.11+
- uv
git clone https://github.com/METR/task-speedup.git
cd task-speedup
uv sync --active --extra inspect-aiThis installs the core pipeline plus the inspect-ai dependency used by the example adapter in the demo.
export OPENAI_API_KEY="your-key-here"
uv run python -m task_speedup.pipelineThis runs the bundled Claude Code adapter on the example data in example-transcripts/ and writes output to output/metrics.jsonl.
The bundled adapter uses OpenAI models via inspect-ai, configured by _SUMMARIZATION_MODEL (gpt-4o) and _DIFFICULTY_MODEL (gpt-5) in adapters.py. You can change these variables to use Anthropic, Gemini, or other providers supported by inspect-ai, as long as you provide the relevant API keys, and install necessary dependencies.
The output file contains one JSON line per user. For example:
{
"user_id": "alice",
"transcripts": [
{
"transcript_id": "session-aaa111",
"day": "2025-01-15",
"task_difficulty_minutes": 45.0,
"concurrency_adjusted_human_minutes": 12.5
}
],
"daily_concurrency": {
"2025-01-15": 1.8
}
}Each record includes per-transcript metrics (task_difficulty_minutes, concurrency_adjusted_human_minutes) and per-day concurrency for that user.
To run on your own transcripts, customize three things in src/task_speedup/adapters.py:
load_transcripts()— Point at your data, parse your format intoTranscript/Turnobjects, classify human-typed turns, and optionally extract code diffs.llm_fn()— Swap in your preferred LLM client/models (the bundled implementation usesinspect-ai).COMPRESSION_STRIP_PATTERNS— Add regex patterns to strip format-specific content (e.g. base64 images) before LLM compression.
See Adapter Reference for full documentation on each component.
Organize your transcripts in a directory structure that your load_transcripts() can find. The bundled adapter expects:
example-transcripts/
alice/
projects/-Users-alice-myproject/
session-aaa111.jsonl
session-aaa222.jsonl
bob/
projects/-Users-bob-webapp/
session-bbb111.jsonl
session-bbb222.jsonl
The exact layout is up to you — just match what your load_transcripts() expects.
uv run python -m task_speedup.pipelineSame command as before — it now uses your custom adapter and data.
The top-level orchestrator is run_pipeline, which chains four independent metrics computations before writing the final output:
def run_pipeline(transcripts, config, llm_fn, output_path, compression_strip_patterns=None):
human_minutes = compute_concurrency_adjusted_human_minutes(transcripts, config["bucket_size_minutes"])
concurrency_result = compute_concurrency(transcripts, config["inactivity_timeout_minutes"])
difficulty_result = estimate_difficulty(transcripts, config, llm_fn, compression_strip_patterns)
results = assemble_results(transcripts, human_minutes, concurrency_result, difficulty_result)
write_output_jsonl(results, output_path)compute_concurrency_adjusted_human_minutes— Counts unique time buckets containing human-typed turns per transcript, splitting shared buckets across concurrent sessions to avoid double-counting active time.compute_concurrency— Splits each transcript into active timespans (based on inactivity gaps), then uses a sweep-line algorithm to compute time-weighted average session concurrency per user per day.estimate_difficulty— A two-stage LLM pipeline: first summarizes assistant turns, then compresses each transcript and asks the LLM to estimate how long it would've taken an experienced engineer to produce the net successful output.assemble_results— Merges the three metric streams into per-user output with transcript-level metrics and daily concurrency.
The adapter module (src/task_speedup/adapters.py) defines three components you customize for your transcript format. The bundled implementation handles Claude Code transcripts as a working example.
Takes no arguments and returns list[Transcript]. Each lab hardcodes its own data source and filtering logic inside the function body. This single function handles four sub-responsibilities:
- Filter to meaningful transcripts — Deduplicate files, remove warmups / heartbeat agent sessions if applicable, keep main agent sessions, exclude subagent sessions (coding agent specific logic).
- Parse raw format — Convert your raw data into
Turnobjects withrole,content,timestamp, andturn_index. - Classify human-typed turns — Set
is_human_typedon eachTurnto distinguish human input from automated messages (tool results, system injections, etc.). This can be a bit tricky depending on the coding agent. For example, Claude Code has many automated user messages, and also many assistant messsages that actually contains user inputs, such as user rejection of a proposed tool call. - Optionally extract code diffs — Populate
code_diffson assistant turns for more accurate difficulty estimation.
Run LLM evaluation on samples. We use this function both to summarize assistant turns in order to compress transcripts and to estimate a transcript's difficulty. Each sample has id and input keys; return maps sample IDs to parsed tool call results. We provide an example implementation using Inspect in task_speedup.adapters (requires the inspect-ai optional dependency).
Define a list of regex patterns to strip from user turn content before LLM compression. This prevents large embedded content (e.g. base64 images, binary blobs) from inflating context and causing LLM errors. Each pattern is applied via re.sub(pattern, "[REMOVED]", content) and passed to run_pipeline as compression_strip_patterns.
COMPRESSION_STRIP_PATTERNS: list[str] = [
r"data:image/[^;]+;base64,[A-Za-z0-9+/=]+", # inline base64 images
]Find all important type definitions in task_speedup.types.
pipeline_config.json controls pipeline behavior:
| Field | Default | Description |
|---|---|---|
turn_summarization_prompt_file |
— | Path to turn summarization prompt template |
difficulty_estimation_prompt_file |
— | Path to difficulty estimation prompt template |
max_user_turn_chars |
(none) | Truncate user turns beyond this length (shipped config: 5000) |
bucket_size_minutes |
10 |
Time bucket size for human active minutes |
inactivity_timeout_minutes |
10 |
Gap threshold for splitting sessions into active spans |
Prompt files use {assistant_turn_content} and {compressed_transcript} as template variables. Feel free to adjust the prompt files. See preliminary validation on the judge prompts in the blog post.
uv run pytest tests/ # Run tests
uv run ruff format . # Format code
uv run ruff check . # Lint
uv run basedpyright . # Type check