The Offline Serving And Orchestration Plane For AX Fabric
AX Serving is the offline serving and orchestration layer behind AX Fabric. It provides OpenAI-compatible APIs, runtime model lifecycle control, scheduling, metrics, and multi-worker routing on Apple Silicon.
For inference execution, AX Serving uses:
llama.cppby default for all model loadsax-enginewhen explicitly requested vianativebackend override
AX Fabric is the product-facing layer for retrieval, knowledge, and grounded agent workflows. AX Serving is the infrastructure layer that makes that stack deployable and operable.
Status: production-ready Rust workspace for Apple Silicon (aarch64-apple-darwin) with OpenAI-compatible REST, gRPC, runtime model management, and optional multi-worker orchestration.
What AX Serving is:
- the serving/control-plane subsystem behind AX Fabric
- not the end-user product surface
- not the low-level inference engine itself
What AX Serving is for:
- single-node offline serving
- multi-worker model routing and health-aware dispatch
- operator-facing runtime control, diagnostics, metrics, and audit surfaces
- OSS and Business deployments of the public repo
| Capability | OSS | Business |
|---|---|---|
| OpenAI-compatible REST + gRPC serving | Yes | Yes |
| Runtime model load/unload/reload APIs | Yes | Yes |
| Scheduler controls, metrics, dashboard, and admin APIs | Yes | Yes |
Benchmark/soak tooling (ax-serving-bench) |
Yes | Yes |
| Single-node offline runtime | Yes | Yes |
| Multi-worker Mac Grid orchestration | No | Yes |
| Commercial licensing terms | No | Included |
| Optional support arrangements | No | By agreement |
OSS
- License: AGPL-3.0-only
- Best for local builders, evaluation, and teams operating under OSS terms
- Single-node serving surface in the public repo
Business
- Includes everything in OSS
- Available under commercial terms as an alternative to AGPL obligations
- Adds multi-worker Mac Grid deployment in the public repo
- License key activation via
AXS_LICENSE_KEYorPOST /v1/license
Private enterprise work is handled in a separate private project. This public repository covers OSS and Business.
Prerequisites:
- Apple Silicon macOS
- Rust toolchain
llama-serveronPATHforllama.cppfallback and explicitllama_cpploads- a GGUF model file
Validate your environment:
cargo check --workspace
which llama-serverBackend model:
native= explicitax-enginellama_cpp=llama-server(default when backend is omitted)auto= try native first, thenllama.cppon unsupported architectures
Start the simplest local runtime:
AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
-m ./models/<model>.gguf \
--model-id default \
--host 127.0.0.1 \
--port 18080Send a request:
curl -sS http://127.0.0.1:18080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Give me three short points about Rust."}],
"stream": false,
"max_tokens": 96
}'For fuller setup paths, see QUICKSTART.md:
- single runtime
- authenticated offline deployment
- gateway + workers
- model management
- embeddings
TypeScript SDK (Zod-validated):
cd sdk/javascript
npm install
npm run buildMost local runtimes focus on single-process inference. AX Serving focuses on the operational layer above inference:
- OpenAI-compatible REST and gRPC serving
- runtime model load/unload/reload
- admission queueing and concurrency control
- metrics, dashboard, diagnostics, and audit surfaces
- multi-worker orchestration for Business deployments
- benchmark and soak tooling in the same repo
Positioning:
- AX Fabric is the product layer
- AX Serving is the serving and orchestration layer underneath it
- inference runtimes such as
ax-engineandllama.cppremain lower-level execution backends
AX Serving is not itself the token-generation engine. It is the serving layer that routes requests into lower-level runtimes.
llama.cppis the default backend for model loading across families.ax-engineremains an explicit opt-in path for environments that can benefit from native execution.- routing between those backends is controlled through
config/backends.yaml
In practice, this means AX Serving owns the APIs, scheduling, orchestration, health, metrics, and model lifecycle, while model execution defaults to llama.cpp with ax-engine as an explicit override.
AX Serving is designed to work with AX Fabric as part of one complete system.
- AX Serving: execution control plane, model lifecycle, routing, scheduling, APIs
- AX Fabric: document ingestion, vector search, BM25/hybrid retrieval, MCP-native data access
- Together: AX Fabric is the product layer; AX Serving is the execution layer underneath it
| Capability | AX Serving |
|---|---|
| OpenAI-compatible chat/completions/embeddings | ✅ |
| Streaming SSE + non-streaming responses | ✅ |
Runtime model management (/v1/models) |
✅ |
Multi-worker orchestration (ax-serving-api) |
✅ |
Dispatch policies (least_inflight, weighted_round_robin, model_affinity, token_cost) |
✅ |
| Scheduler queue/inflight controls | ✅ |
| Prometheus + JSON metrics | ✅ |
Embedded dashboard (/dashboard) |
✅ |
Built-in benchmarking (ax-serving-bench) |
✅ |
cargo run -p ax-serving-cli --bin ax-serving -- \
-m ./models/<model>.gguf \
-p "Hello from AX Serving" \
-n 128AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
-m ./models/<model>.gguf \
--model-id default \
--port 18080Gateway:
AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving-api -- \
--port 18080 \
--internal-port 19090 \
--policy least_inflightWorker:
AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
-m ./models/<model>.gguf \
--model-id default \
--port 18081 \
--orchestrator http://127.0.0.1:19090POST /v1/chat/completionsPOST /v1/completionsPOST /v1/embeddingsGET /v1/modelsPOST /v1/modelsDELETE /v1/models/{id}POST /v1/models/{id}/reloadGET /healthGET /v1/metricsGET /metricsGET /dashboardGET /v1/licensePOST /v1/licenseGET /v1/admin/statusGET /v1/admin/startup-reportGET /v1/admin/diagnosticsGET /v1/admin/auditGET /v1/admin/policy
POST /v1/chat/completionsPOST /v1/completionsPOST /v1/embeddingsGET /v1/modelsGET /healthGET /v1/metricsGET /v1/licensePOST /v1/licenseGET /v1/admin/statusGET /v1/admin/startup-reportGET /v1/admin/diagnosticsGET /v1/admin/auditGET /v1/admin/policyGET /v1/admin/fleetGET /v1/workersGET /v1/workers/{id}POST /v1/workers/{id}/drainPOST /v1/workers/{id}/drain-completeDELETE /v1/workers/{id}
Runtime health contract:
GET /healthis liveness plus readiness, not just process-up statusstatus=okmeans the runtime is ready and at least one model is availablestatus=degradedmeans the process is alive but either no model is loaded or the runtime is thermally constrained
AX Fabric integration contract:
- documented in docs/contracts/ax-fabric-runtime-contract.md
Admin/control-plane notes:
- all authenticated admin responses preserve
X-Request-ID GET /v1/admin/statusgives an operational summaryGET /v1/admin/startup-reportandGET /v1/admin/diagnosticsare for runtime inspection- worker inventory and drain APIs are orchestrator-only
AXS_SPLIT_SCHEDULER=true- enables prefill/decode activity tracking in scheduler metrics
AXS_MAX_BATCH_SIZE/AXS_BATCH_WINDOW_MS- currently advisory scheduler hints only
- they are exposed for future scheduler work and do not drive a batching loop today
Relevant scheduler metrics:
prefill_tokens_activedecode_sequences_activesplit_scheduler_enabled
- If
AXS_API_KEYis set, protected endpoints require bearer auth. - If
AXS_API_KEYis unset, startup requiresAXS_ALLOW_NO_AUTH=true.
Recommended offline enterprise startup:
AXS_CONFIG=config/serving.offline-enterprise.yaml \
AXS_API_KEY="change-me" \
AXS_MODEL_ALLOWED_DIRS="/absolute/path/to/models" \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
-m /absolute/path/to/models/<model>.gguf \
--model-id defaultAXS_API_KEY="token1,token2" cargo run -p ax-serving-cli --bin ax-serving -- serve -m ./models/<model>.ggufClient header:
Authorization: Bearer token1cargo check --workspace
cargo fmt --all -- --check
cargo clippy --workspace --tests -- -D warnings
cargo test --workspaceIntegration tests (no model required — uses in-process mock servers):
AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test orchestration
AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test model_management
AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test graceful_shutdownRelease build:
cargo build --workspace --releaseAll tests run automatically in CI on every push and pull request against main. No model file or GPU is required — tests use in-process backends (NullBackend, EchoBackend, FailingUnloadBackend) that exercise the full request path without hardware.
Exact test counts change over time. Use the linked CI badge and workflow runs as the source of truth.
| Suite | What It Covers |
|---|---|
| Unit — serving API | Scheduler (permits, AIMD, TTFT histogram, split prefill/decode), model registry (lifecycle, idle eviction, capacity), orchestration (queue, dispatch policies, worker registry, DashMap), REST helpers (cache key normalisation, cache hit ratio), config (env layering, validation), gRPC status mapping, auth, metrics |
| Unit — engine | Backend routing, GGUF metadata parsing, thermal state, memory budget |
| Unit — C shim | Null-safe llama.h ABI compatibility |
| Integration — model_management | Auth (Bearer, whitespace tolerance, 401+WWW-Authenticate), model load/unload/reload (201/200/409/404/503), health semantics (ok/degraded/critical-thermal/no-models), input validation (400/422 on every field), full inference path (chat + completions via EchoBackend), embeddings, security response headers, metrics JSON keys, dashboard HTML, license GET/SET |
| Integration — orchestration | Worker register/heartbeat/eviction, dispatch (least-inflight, weighted round-robin, model-affinity, token-cost), queue admission and backpressure, reroute on 5xx, chaos (all workers fail → 503), overload (queue full → 429) |
| Integration — graceful_shutdown | In-flight request drains to completion before server exits |
Every CI run posts a test summary to the GitHub Actions job summary page — see the Actions tab for per-run results.
cargo run -p ax-serving-bench --release -- bench -m ./models/<model>.ggufOther benchmark modes:
profilemixedcache-benchsoakcompareregression-checkmulti-worker
crates/ax-serving-engine: backend abstraction, routing, model internalscrates/ax-serving-api: REST/gRPC serving, scheduler, orchestrationcrates/ax-serving-cli:ax-servingandax-serving-apibinariescrates/ax-serving-bench: benchmark and soak runnerscrates/ax-serving-shim: C-compatible shimcrates/ax-serving-py: Python bindingsconfig/: serving and routing configurationdocs/: runbooks and architecture notes
- QUICKSTART.md
docs/contracts/ax-fabric-runtime-contract.mdsdk/javascript/README.md(TypeScript SDK with Zod validation)sdk/python/(Python SDK)docs/runbooks/multi-worker.mddocs/perf/service-tuning.md
- Open-source terms: AGPL v3 text and licensing guide
- Commercial terms: commercial license
- Issue reporting policy: CONTRIBUTING.md