Self-hosted, authenticated web scraper that converts website content into clean Markdown optimized for LLM consumption.
LLMs work best with clean, structured text — not raw HTML full of navigation, ads, and scripts. ForgeCrawl gives you a private, self-hosted tool to scrape any public webpage and get back clean Markdown with metadata, ready to feed into your AI workflows.
- Researchers building knowledge bases from government reports, academic pages, or institutional websites that don't have APIs
- Developers creating RAG (retrieval-augmented generation) pipelines who need clean, structured content from the web
- Content teams archiving web content as Markdown for documentation, migration, or LLM fine-tuning datasets
- Policy analysts scraping legislative sites, agency reports, and public records into a format AI tools can actually use
- Anyone who's tired of copy-pasting web content into ChatGPT and losing all the formatting
| Use case | What you'd scrape | What you get |
|---|---|---|
| Build a research corpus | 200 pages from a state agency website | Clean Markdown files with metadata (dates, authors, canonical URLs) ready for a RAG pipeline |
| Feed context to an LLM | A long technical doc or policy page | Structured Markdown you can paste directly into Claude, GPT, or any LLM prompt |
| Archive a blog | Individual blog posts from a company site | Markdown with YAML frontmatter preserving publication dates, authors, and descriptions |
| Create training data | Product pages, FAQ sections, support docs | Consistently formatted Markdown suitable for fine-tuning or embedding generation |
| Monitor content changes | A regulatory page that updates quarterly | Re-scrape with cache bypass to get the latest version in a diffable format |
- Self-hosted — your data never leaves your server. No API keys, no usage limits, no third-party dependencies.
- Free — no per-page pricing. Scrape as much as you want on your own infrastructure.
- Authenticated — built-in user auth means you can deploy it on a public server without worrying about unauthorized access.
- Simple — one Docker command or PM2 start. No Redis, no Postgres, no message queue. SQLite handles everything.
You could paste a webpage's HTML directly into an LLM. But you'd be wasting tokens on noise and getting worse results. Here's what Markdown gives you that raw HTML doesn't:
| Raw HTML | ForgeCrawl Markdown | |
|---|---|---|
| Size | 50-200KB per page (scripts, styles, nav, ads, tracking) | 2-10KB — content only (80-95% fewer tokens) |
| Signal-to-noise | <div class="flex items-center px-4"> is layout, not content |
Headings, paragraphs, lists, code blocks — pure content |
| LLM comprehension | HTML structure is ambiguous — models must infer what's content vs. chrome | Markdown is native to LLM training data (GitHub, docs, READMEs) — models parse it natively |
| Metadata | Scattered across <meta> tags, <head>, JSON-LD, microdata |
Clean YAML frontmatter: title, URL, description, timestamp, word count |
| Consistency | Every site has different HTML structure | Every page outputs the same Markdown format — one parser for your pipeline |
| Diffability | HTML diffs are unreadable noise | Markdown diffs cleanly in git — you can track content changes over time |
| Cost | A 100KB HTML page uses ~25K tokens at $3/M = $0.075 per page | The same page as 5KB Markdown uses ~1.2K tokens = $0.004 per page (18x cheaper) |
Bottom line: Feeding raw HTML to an LLM is like feeding a book scanner the entire newspaper — ads, classifieds, and all — when you only need one article. ForgeCrawl extracts the article, formats it cleanly, and tags it with metadata so your pipeline knows exactly what it's working with.
Phase 1 (Foundation & Auth) is fully implemented and tested. The app is functional for single-URL HTTP scraping with built-in authentication.
- First-run admin registration (one-time
/setupflow) - Login/logout with JWT sessions (configurable 15-day expiry)
- Single-URL scraping via HTTP fetch
- Content extraction (Mozilla Readability with full-body fallback)
- HTML-to-Markdown conversion (Turndown + GFM)
- YAML frontmatter with metadata (canonical URL, description, timestamps, etc.)
- Result caching (configurable TTL, bypass option)
- Scrape history with detail view
- Copy to clipboard and download as
.md - Delete scrapes
- Sitemap detection (notification only — crawling is Phase 3)
- Auto
https://prefix on URLs - SSRF protection (private IPs, localhost, cloud metadata, DNS resolution check)
- API key auth (
Bearer fc_...) for CLI/scripting use - Login rate limiting (5 attempts per email per 15 minutes)
- Health check endpoint (no auth required)
- Docker Compose deployment
- PM2 bare-metal deployment
- Phase 2: Puppeteer JS rendering, filesystem storage, PDF/DOCX extraction
- Phase 3: Job queue, multi-page site crawling, robots.txt
- Phase 4: Multi-user management
- Phase 5: RAG chunking, login-gated scraping, export formats
# Prerequisites: Node.js 22+, pnpm
git clone https://github.com/cschweda/forgecrawl
cd forgecrawl
pnpm install
# Generate an auth secret (or let Docker auto-generate one)
cp .env.example .env
node -e "console.log('NUXT_AUTH_SECRET=' + require('crypto').randomBytes(32).toString('hex'))" >> .env
# Start dev server
pnpm dev
# Visit http://localhost:5150
# Register admin account on first visit
# Run tests (builds app, starts test server, runs 81 tests)
pnpm test| Script | Description |
|---|---|
pnpm dev |
Start Nuxt dev server on port 5150 |
pnpm build |
Build for production |
pnpm start |
Start production server |
pnpm test |
Run all 81 tests |
pnpm test:health |
Health check tests only |
pnpm test:auth |
Auth tests (setup, login, API keys) |
pnpm test:security |
Security tests (SSRF, rate limiting, middleware, etc.) |
pnpm test:scrape |
Scraping tests (fetch, cache, CRUD) |
pnpm dev:web |
Start marketing site dev server on port 3200 |
pnpm build:web |
Generate static marketing site for Netlify |
pnpm db:generate |
Generate Drizzle migration files |
pnpm db:migrate |
Run database migrations |
git clone https://github.com/cschweda/forgecrawl
cd forgecrawl
docker compose up -d
# Visit http://localhost:5150
# Auth secret auto-generates if not setgit clone https://github.com/cschweda/forgecrawl
cd forgecrawl
pnpm install
cp .env.example .env
# Edit .env and set NUXT_AUTH_SECRET (min 32 chars)
pnpm build
pm2 start ecosystem.config.cjs
pm2 save && pm2 startupSee ecosystem.config.cjs for PM2 tuning options.
ForgeCrawl requires a traditional server (VPS, Droplet, bare metal) and cannot run on serverless platforms like Netlify or Vercel. This isn't a limitation we plan to work around — it's fundamental to the architecture.
| Requirement | VPS / Droplet | Netlify / Vercel |
|---|---|---|
| SQLite database | Persistent disk, single-file DB, trivial backups | No persistent filesystem — DB lost on every cold start |
| better-sqlite3 | Native C++ module, runs perfectly | Requires special bundling; synchronous I/O is an anti-pattern for serverless |
| WAL mode | Single long-running process, safe concurrent reads | Multiple function instances = write conflicts and corruption risk |
| Puppeteer (Phase 2) | Full Chrome, no size/time limits | Exceeds function size limits (~50MB zipped); cold starts are 10+ seconds |
| Long scrapes | No timeout constraints | 10-26 second function timeout kills slow pages |
| File storage (Phase 2) | Persistent disk for HTML/Markdown/chunks | Only ephemeral /tmp (512MB, wiped between invocations) |
| In-memory rate limiter | Works correctly in a single process | Each function invocation has its own memory — rate limiting is meaningless |
| Stable process | PM2 keeps it running with auto-restart | No persistent process; each request spins up a new instance |
Bottom line: ForgeCrawl is a stateful application with a local database, native modules, and long-running operations. Serverless platforms are designed for stateless, short-lived functions. You'd need to swap SQLite for a managed database (Postgres, Turso), rewrite all I/O to be async, remove native dependencies, and accept severe Puppeteer limitations — at which point it's a different application.
Recommended hosts: DigitalOcean Droplet ($6/mo), Hetzner VPS, AWS Lightsail, or any VPS with 1GB+ RAM and Node.js 22+. See the Docker Compose or PM2 quick starts above.
Note: The marketing site (
packages/web) deploys to Netlify as a static site — see the Marketing Site section below.
ForgeCrawl uses a split-domain architecture:
| Domain | Host | Purpose |
|---|---|---|
forgecrawl.com |
Netlify | Static marketing site (SSG) |
api.forgecrawl.com |
DigitalOcean Droplet (via Laravel Forge) | Scraper API (Nuxt server + SQLite) |
The marketing site is purely static HTML — no API, no server. The scraper API runs on your VPS behind Nginx (managed by Laravel Forge) with SSL via Let's Encrypt.
- In Laravel Forge, create a new site on your droplet with domain
api.forgecrawl.com - Edit the Nginx configuration to proxy to the ForgeCrawl app:
location / {
proxy_pass http://127.0.0.1:5150;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_bypass $http_upgrade;
}- Click SSL → Let's Encrypt to provision a free certificate for
api.forgecrawl.com - Add a DNS A record:
api.forgecrawl.com→ your droplet's IP address - Start the app on the droplet via Docker Compose or PM2 (see Quick Start above)
| Type | Name | Value |
|---|---|---|
| ALIAS (or ANAME) | forgecrawl.com |
forgecrawl.netlify.app |
| A | api.forgecrawl.com |
Your droplet IP (e.g., 143.198.x.x) |
The marketing site (packages/web) is a separate Nuxt 4 static site with SEO, dark/light mode, and the forge aesthetic. It deploys to Netlify at forgecrawl.com. The API examples on the site point to https://api.forgecrawl.com.
# Start marketing site dev server
pnpm dev:web
# Visit http://localhost:3200ForgeCrawl's marketing site deploys as a fully static site (SSG) — not a single-page app (SPA). Nuxt pre-renders every route to plain HTML at build time, so there's no server, no functions, and no client-side routing fallback needed. This is configured via nitro: { preset: 'static' } in nuxt.config.ts.
The repo includes packages/web/netlify.toml which configures everything. Just connect the repo:
- Log in to Netlify and click Add new site → Import an existing project
- Select your GitHub repo (
cschweda/forgecrawl) - Netlify will auto-detect
packages/web/netlify.toml— verify the settings:- Base directory:
packages/web - Build command:
npx nuxt generate - Publish directory:
.output/public - Node version: 22
- Base directory:
- Click Deploy site
That's it. Every push to main triggers a rebuild.
If the netlify.toml isn't detected, configure manually:
- Add new site → Import an existing project → select the repo
- Under Build settings:
- Base directory:
packages/web - Build command:
npx nuxt generate - Publish directory:
packages/web/.output/public
- Base directory:
- Under Environment variables, add:
NODE_VERSION=22
- Click Deploy site
# Generate the static site
pnpm build:web
# The output is in packages/web/.output/public/
# This directory contains plain HTML, CSS, JS — ready to upload
# Deploy via Netlify CLI
npx netlify-cli deploy --prod --dir=packages/web/.output/publicAfter deploying, confirm the site is fully static:
- View source on any page — you should see fully rendered HTML content, not an empty
<div id="app"></div> - The
netlify.tomlhas no/* → /index.htmlrewrite (that would be SPA mode) - Each route gets its own
index.htmlin.output/public/
- In Netlify, go to Domain management → Add custom domain
- Enter
forgecrawl.com - Update your DNS records as instructed (either Netlify DNS or a CNAME record)
- Netlify provisions a free SSL certificate automatically
Note: The API lives at
api.forgecrawl.comon your VPS — see the Deployment Architecture section above for DNS setup.
- Nuxt 4 + Nuxt UI 4 (same stack as the app)
- Dark/light mode toggle (defaults to dark)
- Full SEO: Open Graph, Twitter Card, canonical URL, structured meta
- OG image (
og-image.png) — the forge-themed banner - Responsive design with glassmorphism cards
- Sections: Hero, Features, How It Works, Why Markdown, API docs, Security, Get Started, Use Cases, Why Do I Need This?, Roadmap
- Static generation — zero server-side runtime
forgecrawl/
├── forgecrawl.config.ts # Public config (ports, timeouts, session, etc.)
├── docker-compose.yml
├── ecosystem.config.cjs # PM2 config (bare-metal deployment)
├── pnpm-workspace.yaml
├── .env # Secrets only (gitignored)
├── .env.example # Secret key templates
├── packages/
│ ├── web/ # Marketing site (static, deploys to Netlify)
│ │ ├── nuxt.config.ts
│ │ ├── netlify.toml
│ │ ├── app/
│ │ │ ├── pages/index.vue # Landing page
│ │ │ ├── app.config.ts # Nuxt UI theme (orange palette)
│ │ │ └── assets/css/
│ │ └── public/
│ │ ├── og-image.png # SEO/social sharing image
│ │ └── favicon.svg
│ └── app/ # Nuxt 4 application
│ ├── nuxt.config.ts
│ ├── Dockerfile
│ ├── drizzle.config.ts
│ ├── app/ # Client (Nuxt 4 srcDir)
│ │ ├── pages/ # setup, login, index, scrapes/[id]
│ │ ├── composables/ # useAuth
│ │ ├── middleware/ # setup.global (auth + setup routing)
│ │ └── assets/css/
│ └── server/
│ ├── api/ # health, auth/*, scrape, scrapes/*
│ ├── middleware/ # JWT auth middleware
│ ├── engine/ # scraper, fetcher, extractor, converter, cache
│ ├── db/ # schema, migrations, init
│ ├── auth/ # password (bcrypt), jwt (jose), api-key (SHA-256)
│ └── utils/ # SSRF validation, rate limiter
└── docs/ # Design documents
All API endpoints except the health check require authentication via a Bearer token (API key) or a session cookie.
# Health check
curl http://localhost:5150/api/healthLog into the web UI, navigate to the API Keys section, and create a key. The key (fc_...) is shown once — store it securely. Then use it in all requests:
# Set your API key (shown once when created)
export FC_KEY="fc_your_api_key_here"# Scrape a URL
curl -X POST http://localhost:5150/api/scrape \
-H "Authorization: Bearer $FC_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
# Scrape with cache bypass
curl -X POST http://localhost:5150/api/scrape \
-H "Authorization: Bearer $FC_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "bypass_cache": true}'
# List scrapes
curl http://localhost:5150/api/scrapes \
-H "Authorization: Bearer $FC_KEY"
# Get scrape detail
curl http://localhost:5150/api/scrapes/{job_id} \
-H "Authorization: Bearer $FC_KEY"
# Delete a scrape
curl -X DELETE http://localhost:5150/api/scrapes/{job_id} \
-H "Authorization: Bearer $FC_KEY"# Create an API key (requires session cookie from login)
curl -X POST http://localhost:5150/api/auth/api-keys \
-H "Authorization: Bearer $FC_KEY" \
-H "Content-Type: application/json" \
-d '{"name": "my-script"}'
# List your API keys (key values are never shown — only prefixes)
curl http://localhost:5150/api/auth/api-keys \
-H "Authorization: Bearer $FC_KEY"
# Revoke an API key
curl -X DELETE http://localhost:5150/api/auth/api-keys/{key_id} \
-H "Authorization: Bearer $FC_KEY"const FC_URL = 'http://localhost:5150'
const FC_KEY = process.env.FC_KEY
const res = await fetch(`${FC_URL}/api/scrape`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${FC_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ url: 'https://example.com' }),
})
const { markdown, title, wordCount } = await res.json()
console.log(`Scraped "${title}" (${wordCount} words)`)Public configuration lives in forgecrawl.config.ts (single source of truth). Key settings:
| Setting | Default | Description |
|---|---|---|
server.port |
5150 | HTTP port |
storage.mode |
database | Where results are stored |
scrape.timeout |
30000 | Page fetch timeout (ms) |
scrape.cacheTtl |
3600 | Cache TTL in seconds (0 to disable) |
auth.sessionMaxAge |
1296000 | JWT/cookie lifetime (15 days in seconds) |
rateLimit.loginMaxAttempts |
5 | Failed logins before lockout |
rateLimit.loginWindowMs |
900000 | Lockout window (15 minutes) |
Secrets go in .env (gitignored):
NUXT_AUTH_SECRET= # Min 32 chars, signs JWTs
NUXT_ENCRYPTION_KEY= # Phase 5 (site credentials encryption)
NUXT_ALERT_WEBHOOK= # Discord/Slack webhook (optional)| Layer | Technology |
|---|---|
| Framework | Nuxt 4 (4.3.1) |
| UI | Nuxt UI 4 |
| Database | SQLite via better-sqlite3 + Drizzle ORM (WAL mode) |
| Auth | bcrypt (12 rounds) + jose JWT (HS256) |
| Content Extraction | Mozilla Readability (with cheerio fallback) |
| HTML to Markdown | Turndown + GFM plugin |
| Process Manager | PM2 or Docker |
ForgeCrawl is designed to be deployed on a server you control, scraping arbitrary URLs from the internet. Security is not an afterthought — it's built into every layer.
| Protection | Implementation |
|---|---|
| Password hashing | bcrypt with 12 salt rounds — resistant to brute-force and rainbow table attacks |
| JWT sessions | HS256-signed tokens in HTTP-only, Secure, SameSite=Lax cookies — never stored in localStorage or exposed to JavaScript |
| Constant-time verification | Password comparison uses bcrypt.compare which is constant-time; login also hashes a dummy value on user-not-found to prevent timing-based user enumeration |
| Session expiry | Configurable JWT lifetime (default 15 days) with automatic expired-token detection and stale cookie cleanup |
| Auth secret validation | NUXT_AUTH_SECRET must be at least 32 characters; the app throws on first auth action if missing |
| API key auth | Bearer tokens (fc_...) with SHA-256 hashed storage — raw keys shown once at creation and never stored. Keys support optional expiry and are scoped to the creating user |
| Setup lockout | First-run admin registration (/setup) is permanently locked after the first admin account is created — stored in the database, not bypassable |
Scraping user-supplied URLs is a high-risk operation. ForgeCrawl blocks SSRF at multiple levels:
| Layer | What it blocks |
|---|---|
| Protocol allowlist | Only http: and https: — blocks file://, ftp://, gopher://, etc. |
| Hostname blocklist | localhost, 0.0.0.0, 127.0.0.1, [::1], metadata.google.internal |
| IP range blocklist | Private ranges (10.x, 172.16-31.x, 192.168.x), loopback, link-local (169.254.x), cloud metadata (169.254.169.254), shared address space (100.64-127.x), IPv6 private/link-local |
| DNS resolution check | After URL parsing, resolves the hostname and checks the resolved IP against all blocklists — prevents DNS rebinding attacks where a domain points to a private IP |
| DNS failure = block | If DNS resolution fails, the request is blocked rather than allowed through — prevents SSRF bypass during DNS outages |
| Redirect re-validation | HTTP redirects are handled manually; each redirect target is re-validated through the full SSRF pipeline before following — prevents open-redirect SSRF bypass |
| Protection | Implementation |
|---|---|
| SQL injection | Drizzle ORM parameterized queries throughout — no raw SQL, no string concatenation |
| XSS | Vue template auto-escaping on all rendered content; no v-html usage |
| Error message sanitization | Internal error messages are filtered before reaching the client — only known-safe error types are passed through |
| Rate limiting | Login endpoint: 5 failed attempts per email per 15-minute window with automatic lockout |
| Health endpoint | Exposes only version, database status, and setup state — no memory usage, uptime, or server internals |
- All scrape data queries are scoped to the authenticated user's ID
- Delete operations verify ownership before execution
- No cross-user data access is possible through the API
- No CSRF token — SameSite=Lax cookies block cross-origin POST, which covers all mutation endpoints. Same-site subdomain attacks are not protected, but acceptable for single-server self-hosted deployment.
- In-memory rate limiter — resets on server restart. Acceptable for single-user deployment; persistent rate limiting via SQLite planned for a future phase.
- Single user only — no multi-user management until Phase 4.
ForgeCrawl includes 81 integration tests across 10 test files, organized into four categories: health, authentication, security, and scraping. Every test runs against a real production-built server over HTTP — no mocks, no test doubles.
# Run all 81 tests
pnpm test
# Run individual suites
pnpm test:health # Health check endpoint (6 tests)
pnpm test:auth # Auth: setup, login, API keys (21 tests)
pnpm test:security # Security: middleware, SSRF, rate limiting, error sanitization, data isolation (42 tests)
pnpm test:scrape # Scraping: fetch, cache, CRUD (10 tests)Tests use Vitest with a custom global setup (tests/setup/global-setup.ts) that:
- Builds the Nuxt app for production (
pnpm build) - Starts the server as a child process on port 5199 with a clean, isolated test database
- Runs all test files sequentially — each file uses
ofetchto make real HTTP requests against the running server - Tears down the server process and deletes all test data when complete
This approach tests the full stack end-to-end: HTTP routing, middleware, database queries, and response formatting. Test helpers (tests/setup/test-helpers.ts) provide shared utilities for login, API key creation, and authenticated requests.
| Suite | File | Tests | Category |
|---|---|---|---|
| Health Check | 01-health |
6 | Health |
| Auth: Setup | 02-auth-setup |
6 | Auth |
| Auth: Login | 03-auth-login |
7 | Auth |
| Auth: API Keys | 04-auth-api-keys |
8 | Auth |
| Security: Auth Middleware | 05-security-auth-middleware |
10 | Security |
| Security: SSRF Protection | 06-security-ssrf |
18 | Security |
| Security: Rate Limiting | 07-security-rate-limit |
4 | Security |
| Security: Error Sanitization | 08-security-error-sanitization |
4 | Security |
| Scraping | 09-scrape |
10 | Scraping |
| Security: Data Isolation | 10-security-data-isolation |
6 | Security |
Validates the unauthenticated /api/health endpoint:
- Returns
200withstatus: "ok" - Returns a version string and database connection status
- Returns
setup_completeboolean (used by the UI to route to setup or login) - Does not expose
memoryoruptime(prevents server fingerprinting) - Confirms no authentication is required
Tests the one-time admin registration flow (POST /api/auth/setup):
- Rejects requests with missing email, short passwords (< 8 chars), or mismatched password confirmation — returns
400 - Creates the admin account and sets an HTTP-only session cookie on success
- Permanently locks the setup endpoint after the first admin is created — all subsequent attempts return
403, regardless of credentials - Adapts assertions if setup was already completed by an earlier test file (order-independent)
Tests credential validation and session management (POST /api/auth/login):
- Rejects missing fields with
400 - Rejects wrong password and non-existent email with identical
401responses and the same"Invalid credentials"message — prevents user enumeration (an attacker cannot determine which emails have accounts) - Returns
200with aforgecrawl_sessioncookie that hasHttpOnly,SameSite=Lax, andPath=/flags - Returns user data (
email,role) on successful login - Verifies the response body never contains
passwordHash,password_hash, or bcrypt hashes ($2b$)
Tests Bearer token authentication lifecycle (/api/auth/api-keys):
- Creates an API key with the
fc_prefix and a 64-character hex suffix (e.g.,fc_a1b2c3...) - Returns the raw key exactly once at creation with a warning it won't be shown again
- Rejects key creation without a
namefield - Lists keys showing only the prefix (
fc_a1b2) and name — never exposes the full key or SHA-256 hash - Authenticates requests using
Authorization: Bearer fc_...header - Rejects invalid Bearer tokens and tokens with wrong prefixes with
401 - Revokes a key via
DELETE /api/auth/api-keys/{id}and confirms the revoked key is immediately rejected
Tests the authentication middleware that protects all /api/* routes:
- Public routes:
GET /api/healthis accessible without authentication - Protected routes:
GET /api/scrapes,GET /api/auth/me,POST /api/scrape, andGET /api/auth/api-keysall return401without credentials - Accepts both session cookie auth and Bearer token auth
- When both a Bearer token and a cookie are present, the Bearer token takes priority — allows API key auth even with a stale cookie
- Rejects tampered JWT cookies (
forgecrawl_session=tampered.jwt.token) with401 - Rejects completely invalid cookies with
401
The largest test suite — validates that user-supplied URLs cannot be used to access internal resources:
- Localhost/loopback: Blocks
localhost,127.0.0.1,0.0.0.0,[::1] - Cloud metadata: Blocks
169.254.169.254(AWS EC2 metadata) andmetadata.google.internal(GCP) - Private IP ranges: Blocks
10.x.x.x(Class A),172.16.x.x(Class B),192.168.x.x(Class C) - Protocol allowlist: Blocks
ftp://,file://,gopher://,javascript:— onlyhttp:andhttps:are allowed - Invalid URLs: Blocks malformed input like
not-a-valid-url - Positive test: Confirms a valid external HTTPS URL (
https://example.com) succeeds
All blocked URLs return 400. This prevents attackers from using the scraper to probe internal networks, read cloud instance credentials, or access services behind the firewall.
Tests the login rate limiter that prevents brute-force password attacks:
- Allows the first 5 failed login attempts (each returns
401) - Blocks the 6th attempt with
429 Too Many Requests - Rate limiting is case-insensitive —
USER@TEST.COManduser@test.comshare the same counter (prevents bypass by varying case) - Uses a unique email per test run to avoid interference with other test suites
Verifies that internal server details never leak to clients:
- Scrape failures do not expose file paths (
/home/,/app/,/Users/,node_modules) - Scrape failures do not expose source file references (
.ts:,.js:) - Non-existent domain scrapes return
400or500with a generic message — not raw Node.js error details - Login errors return the same message for wrong-email and wrong-password cases — reconfirms no user enumeration from a different angle
Tests the core scraping engine and result management:
- Rejects
POST /api/scrapewithout a URL (400) - Scrapes
https://example.comand validates:job_id,title("Example Domain"), Markdown with frontmatter (---),wordCount > 0, andmetadata.url - Validates YAML frontmatter contains
title:,url:,scraped_at:, andscraper: ForgeCrawl/ - Returns
cached: trueon repeated scrape of the same URL - Returns
cached: falsewhenbypass_cache: trueis set - Lists all scrapes for the authenticated user with
urlandstatusfields - Gets scrape detail by ID with full Markdown content
- Returns
404for non-existent scrape IDs - Deletes a scrape and confirms it returns
404afterward
Validates that users cannot access or modify other users' data:
- Attempting to delete a non-owned scrape returns
404(not403) — prevents confirming whether a resource exists - Attempting to view a non-owned scrape returns
404 - Attempting to delete a non-owned API key returns
404 - Scrape list returns only the current user's data (returns a valid array)
- API key list returns only keys belonging to the current user, all with
fc_prefix
The deliberate use of 404 instead of 403 is a security pattern — returning 403 Forbidden would confirm the resource exists but is owned by someone else, which leaks information.
ForgeCrawl is actively developed across five phases. Phase 1 is complete; Phases 2-5 are planned.
Single-URL HTTP scraping, JWT auth, API keys, SSRF protection, caching, scrape history, 81 integration tests. See Current Status above.
- Puppeteer engine — shared browser instance with configurable concurrency for SPAs and JS-heavy pages
- Render mode toggle — HTTP-only (fast) or Puppeteer (full JS) per scrape
wait_forselector — wait for a specific CSS selector before extracting content- Browser crash recovery — auto-restart on Puppeteer disconnect
- PDF & DOCX extraction — detect content-type and convert to Markdown
- Configurable storage — database, filesystem, or both with clean abstraction layer
- Enhanced Markdown — better handling of code blocks, tables, and nested structures
- Scrape config UI — toggle JS rendering, set wait selectors, choose storage mode
See docs/forgecrawl-02-phase2.md for the full specification.
- Sitemap crawling — crawl entire sitemaps or subsections by URL pattern
- Async job queue — SQLite-backed queue with real-time progress tracking
- Crawl controls — max depth, max pages, include/exclude URL filters
- robots.txt compliance — automatic fetching and rule enforcement
- Per-domain rate limiting — configurable delay between requests to the same host
- Pause, resume, cancel — full lifecycle controls for active crawls
- Progress UI — real-time dashboard showing pages discovered, scraped, failed, and queued
See docs/forgecrawl-03-phase3.md for the full specification.
- Admin user management — create, disable, and delete user accounts
- Role enforcement — admin vs. user permissions
- Per-user usage stats — scrapes count, pages scraped, storage consumed
- Per-user rate limits — configurable by admin
- Auto-generated API docs — documentation page built from route definitions
See docs/forgecrawl-04-phase4.md for the full specification.
- Token-aware chunking — configurable max tokens with overlap and semantic boundaries
- Chunk metadata — heading context, position, source URL, token count
- Login-gated scraping — cookie injection and form-based login via Puppeteer
- Export formats — JSON, JSONL, or zipped Markdown for pipeline integration
- Production hardening — monitoring, log rotation, backup strategy, alerting webhooks
See docs/forgecrawl-05-phase5.md for the full specification.
