Coding Agent Transcript Analysis Pipeline

This repository processes transcripts of real coding agent usage into per-transcript metrics, following the methodology outlined in this blog post.

At a high-level, it:

Reads coding transcripts into a common format.
Estimates how much active time the human spent guiding the agent.
Estimates how long it would've taken an experienced SWE to complete tasks the agent did in the transcript (task difficulty).
Computes agent session concurrency per user per day.

The output metrics — task difficulty, concurrency-adjusted human active time, and daily concurrency — can be used to compute a "time savings factor" (how much time AI saved the human) and perform other analyses of coding agent productivity. The blog post describes the time savings factor methodology and analysis in detail.

This codebase is designed to be easily configurable to run over any agentic coding transcripts you have in any format. See the tutorial section below.

Tutorial

Follow this tutorial to run the pipeline on the bundled example data, then customize it for your own transcripts.

Prerequisites

Python 3.11+
uv

Step 1: Clone and install

git clone https://github.com/METR/task-speedup.git
cd task-speedup
uv sync --active --extra inspect-ai

This installs the core pipeline plus the inspect-ai dependency used by the example adapter in the demo.

Step 2: Run on example transcripts

export OPENAI_API_KEY="your-key-here"
uv run python -m task_speedup.pipeline

This runs the bundled Claude Code adapter on the example data in example-transcripts/ and writes output to output/metrics.jsonl.

The bundled adapter uses OpenAI models via inspect-ai, configured by _SUMMARIZATION_MODEL (gpt-4o) and _DIFFICULTY_MODEL (gpt-5) in adapters.py. You can change these variables to use Anthropic, Gemini, or other providers supported by inspect-ai, as long as you provide the relevant API keys, and install necessary dependencies.

Step 3: Check the output

The output file contains one JSON line per user. For example:

{
  "user_id": "alice",
  "transcripts": [
    {
      "transcript_id": "session-aaa111",
      "day": "2025-01-15",
      "task_difficulty_minutes": 45.0,
      "concurrency_adjusted_human_minutes": 12.5
    }
  ],
  "daily_concurrency": {
    "2025-01-15": 1.8
  }
}

Each record includes per-transcript metrics (task_difficulty_minutes, concurrency_adjusted_human_minutes) and per-day concurrency for that user.

Step 4: Customize your adapter

To run on your own transcripts, customize three things in src/task_speedup/adapters.py:

load_transcripts() — Point at your data, parse your format into Transcript/Turn objects, classify human-typed turns, and optionally extract code diffs.
llm_fn() — Swap in your preferred LLM client/models (the bundled implementation uses inspect-ai).
COMPRESSION_STRIP_PATTERNS — Add regex patterns to strip format-specific content (e.g. base64 images) before LLM compression.

See Adapter Reference for full documentation on each component.

Step 5: Place your transcript data

Organize your transcripts in a directory structure that your load_transcripts() can find. The bundled adapter expects:

example-transcripts/
  alice/
    projects/-Users-alice-myproject/
      session-aaa111.jsonl
      session-aaa222.jsonl
  bob/
    projects/-Users-bob-webapp/
      session-bbb111.jsonl
      session-bbb222.jsonl

The exact layout is up to you — just match what your load_transcripts() expects.

Step 6: Run the pipeline

uv run python -m task_speedup.pipeline

Same command as before — it now uses your custom adapter and data.

Pipeline Overview

The top-level orchestrator is run_pipeline, which chains four independent metrics computations before writing the final output:

def run_pipeline(transcripts, config, llm_fn, output_path, compression_strip_patterns=None):
    human_minutes = compute_concurrency_adjusted_human_minutes(transcripts, config["bucket_size_minutes"])
    concurrency_result = compute_concurrency(transcripts, config["inactivity_timeout_minutes"])
    difficulty_result = estimate_difficulty(transcripts, config, llm_fn, compression_strip_patterns)
    results = assemble_results(transcripts, human_minutes, concurrency_result, difficulty_result)
    write_output_jsonl(results, output_path)

compute_concurrency_adjusted_human_minutes — Counts unique time buckets containing human-typed turns per transcript, splitting shared buckets across concurrent sessions to avoid double-counting active time.
compute_concurrency — Splits each transcript into active timespans (based on inactivity gaps), then uses a sweep-line algorithm to compute time-weighted average session concurrency per user per day.
estimate_difficulty — A two-stage LLM pipeline: first summarizes assistant turns, then compresses each transcript and asks the LLM to estimate how long it would've taken an experienced engineer to produce the net successful output.
assemble_results — Merges the three metric streams into per-user output with transcript-level metrics and daily concurrency.

Adapter Reference

The adapter module (src/task_speedup/adapters.py) defines three components you customize for your transcript format. The bundled implementation handles Claude Code transcripts as a working example.

`load_transcripts() -> list[Transcript]`

Takes no arguments and returns list[Transcript]. Each lab hardcodes its own data source and filtering logic inside the function body. This single function handles four sub-responsibilities:

Filter to meaningful transcripts — Deduplicate files, remove warmups / heartbeat agent sessions if applicable, keep main agent sessions, exclude subagent sessions (coding agent specific logic).
Parse raw format — Convert your raw data into Turn objects with role, content, timestamp, and turn_index.
Classify human-typed turns — Set is_human_typed on each Turn to distinguish human input from automated messages (tool results, system injections, etc.). This can be a bit tricky depending on the coding agent. For example, Claude Code has many automated user messages, and also many assistant messsages that actually contains user inputs, such as user rejection of a proposed tool call.
Optionally extract code diffs — Populate code_diffs on assistant turns for more accurate difficulty estimation.

`llm_fn(samples, tool_schema) -> dict[str, dict[str, Any]]`

Run LLM evaluation on samples. We use this function both to summarize assistant turns in order to compress transcripts and to estimate a transcript's difficulty. Each sample has id and input keys; return maps sample IDs to parsed tool call results. We provide an example implementation using Inspect in task_speedup.adapters (requires the inspect-ai optional dependency).

`COMPRESSION_STRIP_PATTERNS: list[str]`

Define a list of regex patterns to strip from user turn content before LLM compression. This prevents large embedded content (e.g. base64 images, binary blobs) from inflating context and causing LLM errors. Each pattern is applied via re.sub(pattern, "[REMOVED]", content) and passed to run_pipeline as compression_strip_patterns.

COMPRESSION_STRIP_PATTERNS: list[str] = [
    r"data:image/[^;]+;base64,[A-Za-z0-9+/=]+",  # inline base64 images
]

Types

Find all important type definitions in task_speedup.types.

Configuration

pipeline_config.json controls pipeline behavior:

Field	Default	Description
`turn_summarization_prompt_file`	—	Path to turn summarization prompt template
`difficulty_estimation_prompt_file`	—	Path to difficulty estimation prompt template
`max_user_turn_chars`	(none)	Truncate user turns beyond this length (shipped config: `5000`)
`bucket_size_minutes`	`10`	Time bucket size for human active minutes
`inactivity_timeout_minutes`	`10`	Gap threshold for splitting sessions into active spans

Prompt files use {assistant_turn_content} and {compressed_transcript} as template variables. Feel free to adjust the prompt files. See preliminary validation on the judge prompts in the blog post.

Development

uv run pytest tests/         # Run tests
uv run ruff format .         # Format code
uv run ruff check .          # Lint
uv run basedpyright .        # Type check

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
example-transcripts		example-transcripts
src/task_speedup		src/task_speedup
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coding Agent Transcript Analysis Pipeline

Tutorial

Prerequisites

Step 1: Clone and install

Step 2: Run on example transcripts

Step 3: Check the output

Step 4: Customize your adapter

Step 5: Place your transcript data

Step 6: Run the pipeline

Pipeline Overview

Adapter Reference

`load_transcripts() -> list[Transcript]`

`llm_fn(samples, tool_schema) -> dict[str, dict[str, Any]]`

`COMPRESSION_STRIP_PATTERNS: list[str]`

Types

Configuration

Development

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

METR/transcripts-oco

Folders and files

Latest commit

History

Repository files navigation

Coding Agent Transcript Analysis Pipeline

Tutorial

Prerequisites

Step 1: Clone and install

Step 2: Run on example transcripts

Step 3: Check the output

Step 4: Customize your adapter

Step 5: Place your transcript data

Step 6: Run the pipeline

Pipeline Overview

Adapter Reference

load_transcripts() -> list[Transcript]

llm_fn(samples, tool_schema) -> dict[str, dict[str, Any]]

COMPRESSION_STRIP_PATTERNS: list[str]

Types

Configuration

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`load_transcripts() -> list[Transcript]`

`llm_fn(samples, tool_schema) -> dict[str, dict[str, Any]]`

`COMPRESSION_STRIP_PATTERNS: list[str]`

Packages