CODITECT Crawl

title

type

component_type

version

audience

status

summary

keywords

tokens

created

updated

CODITECT Crawl

Intelligent web content extraction pipeline for the CODITECT /crawl skill.

Architecture

Three-phase layered extraction pipeline:

Phase 1: WebFetch → Trafilatura        (static HTML → LLM-optimized markdown)
Phase 2: SPA Detection → Crawl4AI      (JS-rendered pages → markdown)
Phase 3: Scrapy + scrapy-playwright    (bulk/deep crawl orchestration)

Installation

# Phase 1 only (lightweight, ~25 MB)
pip install coditect-crawl

# Phase 1 + Phase 2 SPA rendering (~500 MB, includes Chromium)
pip install coditect-crawl[spa]

# All phases including deep crawl (~540 MB)
pip install coditect-crawl[all]

Quick Start

from coditect_crawl import extract

# Single page extraction (Phase 1 — Trafilatura)
result = extract("https://example.com")
print(result.markdown)
print(result.metadata.title)
print(result.links)

# SPA extraction (Phase 2 — auto-detects JS need)
result = extract("https://react-spa.example.com", spa=True)

# Deep crawl (Phase 3 — Scrapy orchestration)
from coditect_crawl import deep_crawl

results = deep_crawl("https://docs.example.com", max_depth=2, max_pages=100)
for page in results:
    print(f"{page.url}: {len(page.markdown)} chars")

Project Structure

coditect-crawl/
├── src/coditect_crawl/
│   ├── __init__.py           # Public API: extract(), deep_crawl()
│   ├── extractors/           # Phase 1: HTML → markdown
│   │   ├── trafilatura.py    # Trafilatura wrapper
│   │   └── fallback.py       # Regex fallback (current process_page.py logic)
│   ├── renderers/            # Phase 2: JS rendering
│   │   ├── crawl4ai.py       # Crawl4AI AsyncWebCrawler wrapper
│   │   ├── playwright.py     # Direct Playwright fallback
│   │   └── detector.py       # SPA detection heuristic
│   ├── orchestrators/        # Phase 3: Bulk crawling
│   │   ├── scrapy_spider.py  # Scrapy spider + Trafilatura pipeline
│   │   └── settings.py       # Scrapy configuration
│   └── utils/
│       ├── links.py          # Link extraction and categorization
│       └── models.py         # Data models (ExtractionResult, PageMetadata)
├── tests/
│   ├── unit/                 # Unit tests per module
│   ├── integration/          # End-to-end extraction tests
│   └── fixtures/             # HTML fixtures for testing
├── docs/                     # Architecture docs, ADRs
├── scripts/                  # Dev scripts
├── pyproject.toml            # Build config
└── CLAUDE.md                 # AI agent instructions

Design Decisions

Trafilatura for extraction: Best F1 score (0.958), lightest footprint, academic backing
Crawl4AI for SPA: Only tool with native LLM-optimized markdown + Playwright
Scrapy for deep crawl: Gold standard orchestration (59.9k stars, 18 years, BSD-3)
Abstraction layers: Each phase is hot-swappable (e.g., Crawl4AI → Playwright+Trafilatura)

See analysis document for the full MoE evaluation.

License

Apache-2.0 — See LICENSE

Author

AZ1.AI INC — coditect.ai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CODITECT Crawl

Architecture

Installation

Quick Start

Project Structure

Design Decisions

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
src/coditect_crawl		src/coditect_crawl
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

coditect-ai/coditect-crawl

Folders and files

Latest commit

History

Repository files navigation

CODITECT Crawl

Architecture

Installation

Quick Start

Project Structure

Design Decisions

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages