Skip to content

Migrate to Axum [part 1]: Add axum scaffold with dual-listener proxy bridge#4426

Open
jcjones wants to merge 9 commits intomainfrom
4283-migrate-to-axum-part-1
Open

Migrate to Axum [part 1]: Add axum scaffold with dual-listener proxy bridge#4426
jcjones wants to merge 9 commits intomainfrom
4283-migrate-to-axum-part-1

Conversation

@jcjones
Copy link
Contributor

@jcjones jcjones commented Mar 5, 2026

Part 1 of incremental Trillium-to-Axum migration (#4283). Sets up the infrastructure for migrating HTTP handlers one batch at a time:

  • Add axum, tower-http, and trillium-proxy workspace dependencies
  • Add HttpMetrics, ErrorCode, http_metrics_middleware, and axum_instrumented middleware to aggregator_core (coexisting with existing Trillium instrumented handler)
  • Add BYTES_HISTOGRAM_BOUNDARIES constant to aggregator_core
  • In AggregatorHandlerBuilder::build(), spawn an empty axum router on a local listener and configure a trillium-proxy Proxy as the final fallback handler -- unmatched Trillium routes are forwarded to axum
  • Make build() async; update all call sites with .await

Part 1 of incremental Trillium-to-Axum migration (#4283). Sets up the
infrastructure for migrating HTTP handlers one batch at a time:

- Add axum, tower-http, and trillium-proxy workspace dependencies
- Add HttpMetrics, ErrorCode, http_metrics_middleware, and
  axum_instrumented middleware to aggregator_core (coexisting with
  existing Trillium instrumented handler)
- Add BYTES_HISTOGRAM_BOUNDARIES constant to aggregator_core
- In AggregatorHandlerBuilder::build(), spawn an empty axum router on a
  local listener and configure a trillium-proxy Proxy as the final
  fallback handler -- unmatched Trillium routes are forwarded to axum
- Make build() async; update all call sites with .await
@jcjones jcjones marked this pull request as ready for review March 5, 2026 16:16
@jcjones jcjones requested a review from a team as a code owner March 5, 2026 16:16
@jcjones jcjones force-pushed the 4283-migrate-to-axum-part-1 branch from 61df9ce to 83442cb Compare March 5, 2026 16:38
Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fundamentally sound, but I think we should exercise the metrics middleware in this PR and validate that everything looks the way our alerts and dashboards will expect.


/// HTTP server metrics, layered as an `Extension` on all axum routes.
#[derive(Clone)]
pub struct HttpMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good for this stuff to be in janus_aggregator_core so that both janus_aggregator and janus_aggregator_api can consume it. However I'd rather not dump more stuff into the root of the crate. It's true that we already had InstrumentedHandler and a couple other odds and ends here, but I think at this point we can justify a new aggregator_core/src/http_server.rs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed... I still don't have a good mental heuristic for the cleanliness of Rust projects, but I'm getting there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been puzzling what to move and what not. I'm on the fence about the instrumented / InstrumentedHandler that Trillium uses, since we're gonna delete them.. ah whatever, might as well move them too.

Comment on lines +121 to +122
tower = "0.5"
tower-http = { version = "0.6", features = ["cors", "trace"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't use tower directly in this PR, but we will as of #4427. So I'm guessing this is a benign error from splitting the PRs up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorta. I did think about it, but in the context of "these are the new crates, might as well get the conversation about them out of the way in Part 1."

];

/// These boundaries are intended to be used with measurements having the unit of "bytes".
pub const BYTES_HISTOGRAM_BOUNDARIES: &[f64] = &[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, it's weird that we were previously defining these in janus_aggregator_api. janus_aggregator_core is a better place. I'm concerned about janus_aggregator_api::BYTES_HISTOGRAM_BOUNDARIES lingering after the transition though. It's pub so I don't think the compiler or clippy will flag it as unused. Can we delete it now and cut everything over to this constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good catch.

.extensions()
.get::<MatchedPath>()
.map(|p| p.as_str().trim_start_matches('/').to_string())
.unwrap_or_else(|| "unknown".to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does route look like here, especially when there are request path variables? Currently we get labels on time series like route="tasks/:task_id/reports". Will we need to rewrite alerts or dashboards? Or perhaps we should rewrite the route to match our existing format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unpretty to do the rewrite, but at least for now I will and then we can decide which way to go. I'm adding .replace('{', ":") and .replace('}', "") after the trim_start_matches here and below, that seems to work fine.

}

/// Axum middleware that instruments request handlers with tracing spans.
pub async fn axum_instrumented(request: Request<Body>, next: Next) -> Response {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an inferior version of http_metrics_middleware. Do we plan to use it anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely -- (That's pub async fn instrumented in https://github.com/divviup/janus/pull/4409/changes#diff-757a8c633835a56ca77a6f9e8d46679327b47e5fa756cc9695d9b01dcc76ee6dR237) -- but you're right that it was unused in this commit. Now it won't be, since per your other comments I'm going ahead and wiring it up. :)

);

// Bind a local listener for the axum router and spawn it.
let axum_listener = tokio::net::TcpListener::bind("127.0.0.1:0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the plan of proxying the Trillium handler to Axum to enable migration. But I don't get how this will work. The Trillium router will bind the SockerAddr from the Janus config and thus receive actual messages from the internet. Any route Trillium doesn't understand gets passed on to the axum router. It's impossible for anything on the network to actually talk to the Axum router directly, because the port bound by axum_listener won't be exposed in Kubernetes nor hooked up to a cloud load balancer or anything. So I guess once we have migrated all the routes, we tear down all the Trillium stuff and Axum starts binding the address in the Janus config?

I think that makes sense. I am slightly concerned about always binding IPv4 here. I feel like it's possible for localhost to be IPv6 only. Can we bind "localhost:0" or :0 and have it just grab whichever loopback exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got the idea, and yeah I will make it localhost, but also this is never going to be deployed as-is -- this is just to keep the tests running as we do the piecemeal conversion, and then ultimately all the routes will be directly consumed by Axum instead of Trillium.

State(self.aggregator),
metrics,
router,
StatusCounter::new(self.meter),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so, IIUC, StatusCounter will intercept responses being sent by Trillium and then it increments the janus_aggregator_responses metric(s) based on response status. If Trillium doesn't send a response because it doesn't recognize the route, then presumably StatusCounter::before_send never gets invoked, and then the unhandled request gets handed off to proxy and thus Axum. That all scans. But AFAICT we never hook up anything to the Axum router that manages janus_aggregator_responses. Do we need to stick the http_metrics_middleware somewhere in the axum router? I think this PR should prove that the middleware works, including checking metrics counters as I think we do for Trillium.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally fair assessment. I'm pulling some more out of the bulk PR into here -- to explain, I was going for as minimal as to prove the routing, not the routing and the stats, but obviously the stats are important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected consequence of pulling it forward and having both frameworks log to Prometheus at the same time is we end up with some phantom None entries for the same label key, which have to get filtered out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this spreads to some other tests, ugh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But only the one where we interact with opentelemetry_sdk::metrics directly for the affected changes, that's not so bad. It might pile up over time, but not yet.

.await;
assert_eq!(conn.status(), Some(trillium::Status::Ok));
let body = test_util::take_response_body(&mut conn).await;
assert_eq!(body, b"axum OK");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the trillium stack have tests that verify that metrics get incremented as expected? We should duplicate those for Axum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm.. once we're further along in the migration, we can switch to testing the axum router directly via tower::ServiceExt::oneshot (no HTTP server needed — you just call the router as a Service with a synthesized Request).

The completed migration branch already does this, and it's how I handle this over in Relay.

My new axum_http_metrics test should relieve your concerns here though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm happy if there's a plan we have verified to work.

@tgeoghegan tgeoghegan self-requested a review March 5, 2026 23:15
Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I meant "request changes" before.

@jcjones jcjones requested a review from tgeoghegan March 6, 2026 18:50
Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for setting up the tests. A few more questions.

// Find the Trillium StatusCounter's entry (route=hpke_config).
let trillium_response_metric = find_metric_by_label(
metric_families["janus_aggregator_responses_total"].get_metric(),
"route",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with route vs. http_route? Won't this break existing PromQL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just matching existing behavior. Here are some examples from dap-09-3:

  • http_server_request_duration_seconds_sum{http_request_method="GET", http_response_status_code="200", http_route="/tasks/:task_id/metrics/uploads", instance="<redacted>", job="kubernetes-service-endpoints", namespace="prd-d09-3", network_protocol_name="http", network_protocol_version="1.1", node="<redacted>", otel_scope_name="janus_aggregator", service="aggregator-prometheus-scrape", url_scheme="http"}
  • janus_aggregator_responses_total{instance="<redacted>", job="kubernetes-service-endpoints", method="GET", namespace="prd-d09-3", node="<redacted>", otel_scope_name="janus_aggregator", route="/task_ids", service="aggregator-prometheus-scrape"}

I was surprised, too, but here we are.

1048576.0, 2097152.0, 4194304.0, 8388608.0, 16777216.0, 33554432.0,
];

// -- Trillium --
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could even make mod trillium and mod axum in this file but I don't think it's worth the extra indentation for something transitory.

Comment on lines +118 to +119
// Rewrite axum's `{param}` path syntax to `:param` for metric label continuity
// with the existing Trillium convention (e.g. `tasks/:task_id/reports`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that I explicitly asked you to do this and I appreciate the effort put in but now I'm wondering if it makes sense for us to maintain this Trillium convention indefinitely and have it leak into our metrics. It bothers me somehow to have an implementation detail like this leak into peripheral ops concerns. I wonder what the state of the art here is. I guess it's inevitable that your metrics and alerts wound up tightly bound to things like your HTTP server implementation. If not due to the labels on metrics, then due to whether a given implementation exposes a particular metric at all.

So I guess we have to commit to some convention for HTTP route labels, and Trillium's is as good as anything else, as far as I can tell. So I've talked myself back into doing what you do here. Thank you for taking this brief stroll with me.

let axum_router = axum::Router::new()
// Temporary test endpoint to verify the proxy bridge works.
// TODO(#4283): Remove once a real endpoint has been migrated.
.route(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to use axum_instrumented? That function remains unused in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no, but with the change to use TraceLayer the distinction is immaterial!

.await;
assert_eq!(conn.status(), Some(trillium::Status::Ok));
let body = test_util::take_response_body(&mut conn).await;
assert_eq!(body, b"axum OK");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm happy if there's a plan we have verified to work.

.unwrap_or_else(|| "unknown".to_string());
let method = request.method().clone();
let span = info_span!("endpoint", route, %method);
let response = next.run(request).instrument(span.clone()).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use tower_http::trace::TraceLayer instead, as it reduces the amount of manual future instrumentation we need to do, while still allowing customization. Here's an example using it: https://github.com/tokio-rs/axum/blob/main/examples/tracing-aka-logging/src/main.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good idea. And thank you for the reference. I dropped axum_instrumented entirely for this.


/// Axum middleware that records HTTP server metrics (response counter, request duration,
/// body sizes).
pub async fn http_metrics_middleware(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use opentelemetry-instrumentation-tower, from the open-telemetry/opentelemetry-rust-contrib repo, in order to handle the server metrics following OpenTelemetry conventions. We'd need to provide a FnRouteExtractor in order to keep the cardinality of the route attribute down, and we'd need to handle the janus_aggregator_responses_total counter separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this.

However, opentelemetry-instrumentation-tower v0.17.0 (which appears to be the only release) requires opentelemetry 0.31+, so we need to upgrade off opentelemetry 0.27. Which means we need to ... move off trillium.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants