Migrate to Axum [part 1]: Add axum scaffold with dual-listener proxy bridge by jcjones · Pull Request #4426 · divviup/janus

jcjones · 2026-03-05T16:16:19Z

Part 1 of incremental Trillium-to-Axum migration (#4283). Sets up the infrastructure for migrating HTTP handlers one batch at a time:

Add axum, tower-http, and trillium-proxy workspace dependencies
Add HttpMetrics, ErrorCode, http_metrics_middleware, and axum_instrumented middleware to aggregator_core (coexisting with existing Trillium instrumented handler)
Add BYTES_HISTOGRAM_BOUNDARIES constant to aggregator_core
In AggregatorHandlerBuilder::build(), spawn an empty axum router on a local listener and configure a trillium-proxy Proxy as the final fallback handler -- unmatched Trillium routes are forwarded to axum
Make build() async; update all call sites with .await

Part 1 of incremental Trillium-to-Axum migration (#4283). Sets up the infrastructure for migrating HTTP handlers one batch at a time: - Add axum, tower-http, and trillium-proxy workspace dependencies - Add HttpMetrics, ErrorCode, http_metrics_middleware, and axum_instrumented middleware to aggregator_core (coexisting with existing Trillium instrumented handler) - Add BYTES_HISTOGRAM_BOUNDARIES constant to aggregator_core - In AggregatorHandlerBuilder::build(), spawn an empty axum router on a local listener and configure a trillium-proxy Proxy as the final fallback handler -- unmatched Trillium routes are forwarded to axum - Make build() async; update all call sites with .await

…part-1

tgeoghegan

This looks fundamentally sound, but I think we should exercise the metrics middleware in this PR and validate that everything looks the way our alerts and dashboards will expect.

tgeoghegan · 2026-03-05T22:00:04Z

aggregator_core/src/lib.rs

+
+/// HTTP server metrics, layered as an `Extension` on all axum routes.
+#[derive(Clone)]
+pub struct HttpMetrics {


I think it's good for this stuff to be in janus_aggregator_core so that both janus_aggregator and janus_aggregator_api can consume it. However I'd rather not dump more stuff into the root of the crate. It's true that we already had InstrumentedHandler and a couple other odds and ends here, but I think at this point we can justify a new aggregator_core/src/http_server.rs.

Agreed... I still don't have a good mental heuristic for the cleanliness of Rust projects, but I'm getting there.

I've been puzzling what to move and what not. I'm on the fence about the instrumented / InstrumentedHandler that Trillium uses, since we're gonna delete them.. ah whatever, might as well move them too.

tgeoghegan · 2026-03-05T22:04:36Z

Cargo.toml

+tower = "0.5"
+tower-http = { version = "0.6", features = ["cors", "trace"] }


It looks like we don't use tower directly in this PR, but we will as of #4427. So I'm guessing this is a benign error from splitting the PRs up.

Yeah, sorta. I did think about it, but in the context of "these are the new crates, might as well get the conversation about them out of the way in Part 1."

tgeoghegan · 2026-03-05T22:09:02Z

aggregator_core/src/lib.rs

 ];

+/// These boundaries are intended to be used with measurements having the unit of "bytes".
+pub const BYTES_HISTOGRAM_BOUNDARIES: &[f64] = &[


Huh, it's weird that we were previously defining these in janus_aggregator_api. janus_aggregator_core is a better place. I'm concerned about janus_aggregator_api::BYTES_HISTOGRAM_BOUNDARIES lingering after the transition though. It's pub so I don't think the compiler or clippy will flag it as unused. Can we delete it now and cut everything over to this constant?

yeah, good catch.

tgeoghegan · 2026-03-05T22:10:58Z

aggregator_core/src/lib.rs

+        .extensions()
+        .get::<MatchedPath>()
+        .map(|p| p.as_str().trim_start_matches('/').to_string())
+        .unwrap_or_else(|| "unknown".to_string());


What does route look like here, especially when there are request path variables? Currently we get labels on time series like route="tasks/:task_id/reports". Will we need to rewrite alerts or dashboards? Or perhaps we should rewrite the route to match our existing format.

It's unpretty to do the rewrite, but at least for now I will and then we can decide which way to go. I'm adding .replace('{', ":") and .replace('}', "") after the trim_start_matches here and below, that seems to work fine.

tgeoghegan · 2026-03-05T22:21:41Z

aggregator_core/src/lib.rs

+}
+
+/// Axum middleware that instruments request handlers with tracing spans.
+pub async fn axum_instrumented(request: Request<Body>, next: Next) -> Response {


This seems like an inferior version of http_metrics_middleware. Do we plan to use it anywhere?

Definitely -- (That's pub async fn instrumented in https://github.com/divviup/janus/pull/4409/changes#diff-757a8c633835a56ca77a6f9e8d46679327b47e5fa756cc9695d9b01dcc76ee6dR237) -- but you're right that it was unused in this commit. Now it won't be, since per your other comments I'm going ahead and wiring it up. :)

tgeoghegan · 2026-03-05T22:29:29Z

aggregator/src/aggregator/http_handlers.rs

+        );
+
+        // Bind a local listener for the axum router and spawn it.
+        let axum_listener = tokio::net::TcpListener::bind("127.0.0.1:0")


I don't mind the plan of proxying the Trillium handler to Axum to enable migration. But I don't get how this will work. The Trillium router will bind the SockerAddr from the Janus config and thus receive actual messages from the internet. Any route Trillium doesn't understand gets passed on to the axum router. It's impossible for anything on the network to actually talk to the Axum router directly, because the port bound by axum_listener won't be exposed in Kubernetes nor hooked up to a cloud load balancer or anything. So I guess once we have migrated all the routes, we tear down all the Trillium stuff and Axum starts binding the address in the Janus config?

I think that makes sense. I am slightly concerned about always binding IPv4 here. I feel like it's possible for localhost to be IPv6 only. Can we bind "localhost:0" or :0 and have it just grab whichever loopback exists?

You've got the idea, and yeah I will make it localhost, but also this is never going to be deployed as-is -- this is just to keep the tests running as we do the piecemeal conversion, and then ultimately all the routes will be directly consumed by Axum instead of Trillium.

tgeoghegan · 2026-03-05T22:33:55Z

aggregator/src/aggregator/http_handlers.rs

            State(self.aggregator),
            metrics,
            router,
            StatusCounter::new(self.meter),


OK, so, IIUC, StatusCounter will intercept responses being sent by Trillium and then it increments the janus_aggregator_responses metric(s) based on response status. If Trillium doesn't send a response because it doesn't recognize the route, then presumably StatusCounter::before_send never gets invoked, and then the unhandled request gets handed off to proxy and thus Axum. That all scans. But AFAICT we never hook up anything to the Axum router that manages janus_aggregator_responses. Do we need to stick the http_metrics_middleware somewhere in the axum router? I think this PR should prove that the middleware works, including checking metrics counters as I think we do for Trillium.

Totally fair assessment. I'm pulling some more out of the bulk PR into here -- to explain, I was going for as minimal as to prove the routing, not the routing and the stats, but obviously the stats are important.

Unexpected consequence of pulling it forward and having both frameworks log to Prometheus at the same time is we end up with some phantom None entries for the same label key, which have to get filtered out.

and this spreads to some other tests, ugh.

But only the one where we interact with opentelemetry_sdk::metrics directly for the affected changes, that's not so bad. It might pile up over time, but not yet.

tgeoghegan · 2026-03-05T22:36:11Z

aggregator/src/aggregator/http_handlers/tests/mod.rs

+        .await;
+    assert_eq!(conn.status(), Some(trillium::Status::Ok));
+    let body = test_util::take_response_body(&mut conn).await;
+    assert_eq!(body, b"axum OK");


Does the trillium stack have tests that verify that metrics get incremented as expected? We should duplicate those for Axum.

Mmm.. once we're further along in the migration, we can switch to testing the axum router directly via tower::ServiceExt::oneshot (no HTTP server needed — you just call the router as a Service with a synthesized Request).

The completed migration branch already does this, and it's how I handle this over in Relay.

My new axum_http_metrics test should relieve your concerns here though.

OK, I'm happy if there's a plan we have verified to work.

tgeoghegan

Oops, I meant "request changes" before.

tgeoghegan

Thanks for setting up the tests. A few more questions.

tgeoghegan · 2026-03-06T19:35:36Z

aggregator/src/metrics/tests/prometheus.rs

+    // Find the Trillium StatusCounter's entry (route=hpke_config).
+    let trillium_response_metric = find_metric_by_label(
+        metric_families["janus_aggregator_responses_total"].get_metric(),
+        "route",


What's up with route vs. http_route? Won't this break existing PromQL?

I'm just matching existing behavior. Here are some examples from dap-09-3:

http_server_request_duration_seconds_sum{http_request_method="GET", http_response_status_code="200", http_route="/tasks/:task_id/metrics/uploads", instance="<redacted>", job="kubernetes-service-endpoints", namespace="prd-d09-3", network_protocol_name="http", network_protocol_version="1.1", node="<redacted>", otel_scope_name="janus_aggregator", service="aggregator-prometheus-scrape", url_scheme="http"}

janus_aggregator_responses_total{instance="<redacted>", job="kubernetes-service-endpoints", method="GET", namespace="prd-d09-3", node="<redacted>", otel_scope_name="janus_aggregator", route="/task_ids", service="aggregator-prometheus-scrape"}

I was surprised, too, but here we are.

tgeoghegan · 2026-03-06T19:37:27Z

aggregator_core/src/http_server.rs

+    1048576.0, 2097152.0, 4194304.0, 8388608.0, 16777216.0, 33554432.0,
+];
+
+// -- Trillium --


Could even make mod trillium and mod axum in this file but I don't think it's worth the extra indentation for something transitory.

tgeoghegan · 2026-03-06T19:40:02Z

aggregator_core/src/http_server.rs

+    // Rewrite axum's `{param}` path syntax to `:param` for metric label continuity
+    // with the existing Trillium convention (e.g. `tasks/:task_id/reports`).


I know that I explicitly asked you to do this and I appreciate the effort put in but now I'm wondering if it makes sense for us to maintain this Trillium convention indefinitely and have it leak into our metrics. It bothers me somehow to have an implementation detail like this leak into peripheral ops concerns. I wonder what the state of the art here is. I guess it's inevitable that your metrics and alerts wound up tightly bound to things like your HTTP server implementation. If not due to the labels on metrics, then due to whether a given implementation exposes a particular metric at all.

So I guess we have to commit to some convention for HTTP route labels, and Trillium's is as good as anything else, as far as I can tell. So I've talked myself back into doing what you do here. Thank you for taking this brief stroll with me.

tgeoghegan · 2026-03-06T19:46:31Z

aggregator/src/aggregator/http_handlers.rs

+        let axum_router = axum::Router::new()
+            // Temporary test endpoint to verify the proxy bridge works.
+            // TODO(#4283): Remove once a real endpoint has been migrated.
+            .route(


Is this meant to use axum_instrumented? That function remains unused in this PR.

Yes and no, but with the change to use TraceLayer the distinction is immaterial!

tgeoghegan · 2026-03-06T19:47:19Z

aggregator/src/aggregator/http_handlers/tests/mod.rs

+        .await;
+    assert_eq!(conn.status(), Some(trillium::Status::Ok));
+    let body = test_util::take_response_body(&mut conn).await;
+    assert_eq!(body, b"axum OK");


OK, I'm happy if there's a plan we have verified to work.

divergentdave · 2026-03-06T18:59:58Z

aggregator_core/src/http_server.rs

+        .unwrap_or_else(|| "unknown".to_string());
+    let method = request.method().clone();
+    let span = info_span!("endpoint", route, %method);
+    let response = next.run(request).instrument(span.clone()).await;


We should use tower_http::trace::TraceLayer instead, as it reduces the amount of manual future instrumentation we need to do, while still allowing customization. Here's an example using it: https://github.com/tokio-rs/axum/blob/main/examples/tracing-aka-logging/src/main.rs

Oh, good idea. And thank you for the reference. I dropped axum_instrumented entirely for this.

divergentdave · 2026-03-06T20:02:16Z

aggregator_core/src/http_server.rs

+
+/// Axum middleware that records HTTP server metrics (response counter, request duration,
+/// body sizes).
+pub async fn http_metrics_middleware(


We could use opentelemetry-instrumentation-tower, from the open-telemetry/opentelemetry-rust-contrib repo, in order to handle the server metrics following OpenTelemetry conventions. We'd need to provide a FnRouteExtractor in order to keep the cardinality of the route attribute down, and we'd need to handle the janus_aggregator_responses_total counter separately.

I love this.

However, opentelemetry-instrumentation-tower v0.17.0 (which appears to be the only release) requires opentelemetry 0.31+, so we need to upgrade off opentelemetry 0.27. Which means we need to ... move off trillium.

jcjones marked this pull request as ready for review March 5, 2026 16:16

jcjones requested a review from a team as a code owner March 5, 2026 16:16

jcjones added 2 commits March 5, 2026 09:34

Add temporary test

4260d76

format

83442cb

jcjones force-pushed the 4283-migrate-to-axum-part-1 branch from 61df9ce to 83442cb Compare March 5, 2026 16:38

jcjones modified the milestones: draft-ietf-ppm-dap-17, 2026Q1 Janus Release Mar 5, 2026

Merge remote-tracking branch 'origin/main' into 4283-migrate-to-axum-…

0f27cf6

…part-1

tgeoghegan approved these changes Mar 5, 2026

View reviewed changes

tgeoghegan self-requested a review March 5, 2026 23:15

tgeoghegan requested changes Mar 5, 2026

View reviewed changes

tgeoghegan mentioned this pull request Mar 5, 2026

Migrate to Axum [part 2]: Migrate hpke_config to Axum #4427

Open

jcjones added 3 commits March 6, 2026 11:08

First updates per review comments, not the moves

8f792da

Move the BYTES_HISTOGRAM_BOUNDARIES

aa3e3d7

aggregator_core/src/lib.rs -> aggregator_core/src/http_server.rs

1d7a6ea

jcjones requested a review from tgeoghegan March 6, 2026 18:50

tgeoghegan reviewed Mar 6, 2026

View reviewed changes

divergentdave reviewed Mar 6, 2026

View reviewed changes

jcjones added 2 commits March 6, 2026 17:48

Use trace_layer from tower_http instead of our own axum_instrumented fn

ed5ba4e

opentelemetry-instrumentation-tower comment

d8fe955

		tower = "0.5"
		tower-http = { version = "0.6", features = ["cors", "trace"] }

		// Rewrite axum's `{param}` path syntax to `:param` for metric label continuity
		// with the existing Trillium convention (e.g. `tasks/:task_id/reports`).

Conversation

jcjones commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgeoghegan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgeoghegan left a comment

Choose a reason for hiding this comment

Uh oh!

tgeoghegan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

jcjones commented Mar 5, 2026 •

edited

Loading