Skip to content

Vision AI "Cortex" for Agents. A Playwright-based MCP Server & API that captures screenshots with ground-truth DOM extraction and full auth state injection. Containerized.

License

Notifications You must be signed in to change notification settings

samestrin/chromium-screenshots

Repository files navigation

chromium-screenshots

The missing screenshot service for Vision AI & Auth. Inject auth. Extract DOM. Zero-drift capture. Pixel-perfect Chromium.

CI License: MIT Python 3.11+ Docker

⚡ Why this exists

Taking screenshots for Vision AI is hard. If you take a screenshot and then scrape the HTML separately, the page state drifts. Elements move. Popups appear. Your bounding boxes don't match the pixels.

chromium-screenshots guarantees Zero-Drift. It extracts the DOM coordinates (ground truth) and the screenshot (pixels) from the exact same render frame.

Visual Proof

Chromium Screenshots Demo

Feature Standard Tools chromium-screenshots
Data Extraction ❌ Image Only ✅ Image + DOM + Bounding Boxes
Quality Control ❌ None (hope it loaded) Quality Score (Good/Low/Poor)
Auth Injection ❌ Cookies only ✅ Cookies + LocalStorage + SessionStorage
AI Integration ❌ Manual API calls ✅ Native MCP Server (Claude/Gemini)
SPA Support ❌ Fails on hydration ✅ Waits for selectors/network idle

🤖 Standardized AI Integration

This tool is a "visual cortex" for your AI agents. It implements the Model Context Protocol (MCP), allowing tools like Claude Desktop to natively control the browser.

  • screenshot: Returns base64 data for immediate analysis ("What does this button say?").
  • screenshot_to_file: Saves to disk to preserve context window tokens.
  • extract_dom: Returns text + coordinates for ground-truth verification.

Comparison with Alternatives

While many tools exist for browser automation and content extraction, chromium-screenshots is specifically designed to provide high-fidelity observation for AI agents, rather than just raw data or static images.

Tool Category Examples Screenshot Structural Data Quality Metric Primary Focus
Agent Observation This Repo ✅ (Atomic DOM) AI Reliability & Context
LLM RAG Scrapers Firecrawl, Jina ❌ (Markdown) Text extraction for reading
Screenshot APIs ScreenshotOne, ApiFlash ❌ (HTML) ⚠️ (Basic) Marketing & Archiving
Performance Audit Lighthouse CI ✅ (Full DOM) Speed & SEO Audits (Slow)
Visual Testing Percy, Chromatic ✅ (Snapshot) Regression Testing (Diffs)

🚀 Quick Start

Docker (Recommended)

Run the containerized service. No dependencies required.

docker compose up -d

The API is now active at http://localhost:8000.

Python (Local)

pip install -r requirements.txt
playwright install chromium
uvicorn app.main:app --reload

💡 Common Recipes

1. Vision AI Ground Truth

Capture screenshot + DOM data + Quality Score in one call.

curl -X POST "http://localhost:8000/screenshot" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "extract_dom": {
      "enabled": true,
      "selectors": ["span.titleline > a"],
      "max_elements": 50
    }
  }' -o hn_capture.png

2. The "Impossible" Auth Shot

Inject localStorage to capture authenticated dashboards (Wasp/Firebase).

curl -X POST "http://localhost:8000/screenshot" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://app.example.com/dashboard",
    "localStorage": {
      "wasp:sessionId": "secret_session_token",
      "theme": "dark"
    },
    "wait_for_selector": ".dashboard-grid"
  }' -o dashboard.png

3. Vision AI Optimization

Get quality metrics and model compatibility hints for Vision AI integrations.

curl -X POST "http://localhost:8000/screenshot/json" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "extract_dom": {
      "enabled": true,
      "include_metrics": true,
      "include_vision_hints": true,
      "target_vision_model": "claude"
    }
  }' | jq '{quality: .dom_extraction.quality, hints: .vision_hints}'

📚 Documentation

Detailed references for core features:

🧠 How It Works

The Zero-Drift Flow:

  1. Inject Auth: Set cookies & localStorage.
  2. Navigate: Load page and wait for networkidle.
  3. Freeze: Pause execution.
  4. Extract: Scrape DOM positions & Text (JS evaluation).
  5. Audit: Run Quality Detection engine (count elements, check visibility).
  6. Capture: Take screenshot.
  7. Return: Send Image + JSON together.
sequenceDiagram
    participant U as 👤 User / Agent
    participant A as ⚡ API / MCP
    participant B as 🕸️ Chromium
    participant Q as 🔍 Quality Engine

    U->>A: POST /screenshot (extract_dom=true)
    A->>B: Create Context & Inject Auth
    B->>B: Navigate & Wait
    
    rect rgb(30, 30, 30)
        note right of B: Critical Section
        B->>B: Extract DOM (JS)
        B->>Q: Assess Quality
        Q-->>B: Quality: GOOD
        B->>B: Capture Pixels
    end
    
    B-->>A: Result (Image + Metadata)
    A-->>U: Return
Loading

License

MIT License

About

Vision AI "Cortex" for Agents. A Playwright-based MCP Server & API that captures screenshots with ground-truth DOM extraction and full auth state injection. Containerized.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors