Skip to content

METR/transcripts-oco

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coding Agent Transcript Analysis Pipeline

This repository processes transcripts of real coding agent usage into per-transcript metrics, following the methodology outlined in this blog post.

At a high-level, it:

  1. Reads coding transcripts into a common format.
  2. Estimates how much active time the human spent guiding the agent.
  3. Estimates how long it would've taken an experienced SWE to complete tasks the agent did in the transcript (task difficulty).
  4. Computes agent session concurrency per user per day.

The output metrics — task difficulty, concurrency-adjusted human active time, and daily concurrency — can be used to compute a "time savings factor" (how much time AI saved the human) and perform other analyses of coding agent productivity. The blog post describes the time savings factor methodology and analysis in detail.

This codebase is designed to be easily configurable to run over any agentic coding transcripts you have in any format. See the tutorial section below.

Tutorial

Follow this tutorial to run the pipeline on the bundled example data, then customize it for your own transcripts.

Prerequisites

  • Python 3.11+
  • uv

Step 1: Clone and install

git clone https://github.com/METR/task-speedup.git
cd task-speedup
uv sync --active --extra inspect-ai

This installs the core pipeline plus the inspect-ai dependency used by the example adapter in the demo.

Step 2: Run on example transcripts

export OPENAI_API_KEY="your-key-here"
uv run python -m task_speedup.pipeline

This runs the bundled Claude Code adapter on the example data in example-transcripts/ and writes output to output/metrics.jsonl.

The bundled adapter uses OpenAI models via inspect-ai, configured by _SUMMARIZATION_MODEL (gpt-4o) and _DIFFICULTY_MODEL (gpt-5) in adapters.py. You can change these variables to use Anthropic, Gemini, or other providers supported by inspect-ai, as long as you provide the relevant API keys, and install necessary dependencies.

Step 3: Check the output

The output file contains one JSON line per user. For example:

{
  "user_id": "alice",
  "transcripts": [
    {
      "transcript_id": "session-aaa111",
      "day": "2025-01-15",
      "task_difficulty_minutes": 45.0,
      "concurrency_adjusted_human_minutes": 12.5
    }
  ],
  "daily_concurrency": {
    "2025-01-15": 1.8
  }
}

Each record includes per-transcript metrics (task_difficulty_minutes, concurrency_adjusted_human_minutes) and per-day concurrency for that user.

Step 4: Customize your adapter

To run on your own transcripts, customize three things in src/task_speedup/adapters.py:

  1. load_transcripts() — Point at your data, parse your format into Transcript/Turn objects, classify human-typed turns, and optionally extract code diffs.
  2. llm_fn() — Swap in your preferred LLM client/models (the bundled implementation uses inspect-ai).
  3. COMPRESSION_STRIP_PATTERNS — Add regex patterns to strip format-specific content (e.g. base64 images) before LLM compression.

See Adapter Reference for full documentation on each component.

Step 5: Place your transcript data

Organize your transcripts in a directory structure that your load_transcripts() can find. The bundled adapter expects:

example-transcripts/
  alice/
    projects/-Users-alice-myproject/
      session-aaa111.jsonl
      session-aaa222.jsonl
  bob/
    projects/-Users-bob-webapp/
      session-bbb111.jsonl
      session-bbb222.jsonl

The exact layout is up to you — just match what your load_transcripts() expects.

Step 6: Run the pipeline

uv run python -m task_speedup.pipeline

Same command as before — it now uses your custom adapter and data.

Pipeline Overview

The top-level orchestrator is run_pipeline, which chains four independent metrics computations before writing the final output:

def run_pipeline(transcripts, config, llm_fn, output_path, compression_strip_patterns=None):
    human_minutes = compute_concurrency_adjusted_human_minutes(transcripts, config["bucket_size_minutes"])
    concurrency_result = compute_concurrency(transcripts, config["inactivity_timeout_minutes"])
    difficulty_result = estimate_difficulty(transcripts, config, llm_fn, compression_strip_patterns)
    results = assemble_results(transcripts, human_minutes, concurrency_result, difficulty_result)
    write_output_jsonl(results, output_path)
  1. compute_concurrency_adjusted_human_minutes — Counts unique time buckets containing human-typed turns per transcript, splitting shared buckets across concurrent sessions to avoid double-counting active time.
  2. compute_concurrency — Splits each transcript into active timespans (based on inactivity gaps), then uses a sweep-line algorithm to compute time-weighted average session concurrency per user per day.
  3. estimate_difficulty — A two-stage LLM pipeline: first summarizes assistant turns, then compresses each transcript and asks the LLM to estimate how long it would've taken an experienced engineer to produce the net successful output.
  4. assemble_results — Merges the three metric streams into per-user output with transcript-level metrics and daily concurrency.

Adapter Reference

The adapter module (src/task_speedup/adapters.py) defines three components you customize for your transcript format. The bundled implementation handles Claude Code transcripts as a working example.

load_transcripts() -> list[Transcript]

Takes no arguments and returns list[Transcript]. Each lab hardcodes its own data source and filtering logic inside the function body. This single function handles four sub-responsibilities:

  1. Filter to meaningful transcripts — Deduplicate files, remove warmups / heartbeat agent sessions if applicable, keep main agent sessions, exclude subagent sessions (coding agent specific logic).
  2. Parse raw format — Convert your raw data into Turn objects with role, content, timestamp, and turn_index.
  3. Classify human-typed turns — Set is_human_typed on each Turn to distinguish human input from automated messages (tool results, system injections, etc.). This can be a bit tricky depending on the coding agent. For example, Claude Code has many automated user messages, and also many assistant messsages that actually contains user inputs, such as user rejection of a proposed tool call.
  4. Optionally extract code diffs — Populate code_diffs on assistant turns for more accurate difficulty estimation.

llm_fn(samples, tool_schema) -> dict[str, dict[str, Any]]

Run LLM evaluation on samples. We use this function both to summarize assistant turns in order to compress transcripts and to estimate a transcript's difficulty. Each sample has id and input keys; return maps sample IDs to parsed tool call results. We provide an example implementation using Inspect in task_speedup.adapters (requires the inspect-ai optional dependency).

COMPRESSION_STRIP_PATTERNS: list[str]

Define a list of regex patterns to strip from user turn content before LLM compression. This prevents large embedded content (e.g. base64 images, binary blobs) from inflating context and causing LLM errors. Each pattern is applied via re.sub(pattern, "[REMOVED]", content) and passed to run_pipeline as compression_strip_patterns.

COMPRESSION_STRIP_PATTERNS: list[str] = [
    r"data:image/[^;]+;base64,[A-Za-z0-9+/=]+",  # inline base64 images
]

Types

Find all important type definitions in task_speedup.types.

Configuration

pipeline_config.json controls pipeline behavior:

Field Default Description
turn_summarization_prompt_file Path to turn summarization prompt template
difficulty_estimation_prompt_file Path to difficulty estimation prompt template
max_user_turn_chars (none) Truncate user turns beyond this length (shipped config: 5000)
bucket_size_minutes 10 Time bucket size for human active minutes
inactivity_timeout_minutes 10 Gap threshold for splitting sessions into active spans

Prompt files use {assistant_turn_content} and {compressed_transcript} as template variables. Feel free to adjust the prompt files. See preliminary validation on the judge prompts in the blog post.

Development

uv run pytest tests/         # Run tests
uv run ruff format .         # Format code
uv run ruff check .          # Lint
uv run basedpyright .        # Type check

About

Demo repo for transcripts analysis OCO for 3PRA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages