Multi-tenant inference gateway prototype with:
- token-aware admission control
- context-window-aware prompt compaction (head/tail truncation)
- KV-pressure load shedding
- adapter-aware routing metadata
- continuous batching scheduler simulation
- concurrent in-flight request ID uniqueness enforcement
- Prometheus telemetry for TTFT/TPOT/queue pressure
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
uvicorn modelop.main:app --host 0.0.0.0 --port 8000If your environment is offline and cannot install new packages, run directly with:
PYTHONPATH=src uvicorn modelop.main:app --host 0.0.0.0 --port 8000The app now falls back to noop telemetry export when prometheus_client is unavailable.
Primary:
PYTHONPATH=src python -m unittest discover -s tests -vOptional (if pytest is installed):
PYTHONPATH=src pytest -qPOST /v1/generateGET /metricsGET /health
Patterns used by large AI serving systems:
- Token/content window optimization: Requests reserve output tokens first, then compact over-budget prompts to fit the model window while preserving early and recent context.
- Concurrent traffic management: Tenant token buckets, queueing, and continuous batching keep throughput stable under simultaneous requests.
- Request uniqueness under concurrency:
In-flight
request_idregistry rejects duplicate IDs (409) if the same ID is already running.
{
"tenant_id": "tenant-a",
"request_id": "req-12345",
"prompt": "long prompt ...",
"max_new_tokens": 256
}Response includes:
prompt_truncatedoriginal_prompt_tokenseffective_prompt_tokens
python scripts/chaos_matrix.py --base-url http://127.0.0.1:8000 --scenario skewed-burst- ADR:
ADR-001-inference-gateway.md - Grafana dashboard JSON:
dashboards/grafana-inference-gateway.json - Load test post-mortem template:
LOAD_TEST_POSTMORTEM.md