Migrate to Axum [part 1]: Add axum scaffold with dual-listener proxy bridge#4426
Migrate to Axum [part 1]: Add axum scaffold with dual-listener proxy bridge#4426
Conversation
Part 1 of incremental Trillium-to-Axum migration (#4283). Sets up the infrastructure for migrating HTTP handlers one batch at a time: - Add axum, tower-http, and trillium-proxy workspace dependencies - Add HttpMetrics, ErrorCode, http_metrics_middleware, and axum_instrumented middleware to aggregator_core (coexisting with existing Trillium instrumented handler) - Add BYTES_HISTOGRAM_BOUNDARIES constant to aggregator_core - In AggregatorHandlerBuilder::build(), spawn an empty axum router on a local listener and configure a trillium-proxy Proxy as the final fallback handler -- unmatched Trillium routes are forwarded to axum - Make build() async; update all call sites with .await
61df9ce to
83442cb
Compare
tgeoghegan
left a comment
There was a problem hiding this comment.
This looks fundamentally sound, but I think we should exercise the metrics middleware in this PR and validate that everything looks the way our alerts and dashboards will expect.
aggregator_core/src/lib.rs
Outdated
|
|
||
| /// HTTP server metrics, layered as an `Extension` on all axum routes. | ||
| #[derive(Clone)] | ||
| pub struct HttpMetrics { |
There was a problem hiding this comment.
I think it's good for this stuff to be in janus_aggregator_core so that both janus_aggregator and janus_aggregator_api can consume it. However I'd rather not dump more stuff into the root of the crate. It's true that we already had InstrumentedHandler and a couple other odds and ends here, but I think at this point we can justify a new aggregator_core/src/http_server.rs.
There was a problem hiding this comment.
Agreed... I still don't have a good mental heuristic for the cleanliness of Rust projects, but I'm getting there.
There was a problem hiding this comment.
I've been puzzling what to move and what not. I'm on the fence about the instrumented / InstrumentedHandler that Trillium uses, since we're gonna delete them.. ah whatever, might as well move them too.
| tower = "0.5" | ||
| tower-http = { version = "0.6", features = ["cors", "trace"] } |
There was a problem hiding this comment.
It looks like we don't use tower directly in this PR, but we will as of #4427. So I'm guessing this is a benign error from splitting the PRs up.
There was a problem hiding this comment.
Yeah, sorta. I did think about it, but in the context of "these are the new crates, might as well get the conversation about them out of the way in Part 1."
aggregator_core/src/lib.rs
Outdated
| ]; | ||
|
|
||
| /// These boundaries are intended to be used with measurements having the unit of "bytes". | ||
| pub const BYTES_HISTOGRAM_BOUNDARIES: &[f64] = &[ |
There was a problem hiding this comment.
Huh, it's weird that we were previously defining these in janus_aggregator_api. janus_aggregator_core is a better place. I'm concerned about janus_aggregator_api::BYTES_HISTOGRAM_BOUNDARIES lingering after the transition though. It's pub so I don't think the compiler or clippy will flag it as unused. Can we delete it now and cut everything over to this constant?
aggregator_core/src/lib.rs
Outdated
| .extensions() | ||
| .get::<MatchedPath>() | ||
| .map(|p| p.as_str().trim_start_matches('/').to_string()) | ||
| .unwrap_or_else(|| "unknown".to_string()); |
There was a problem hiding this comment.
What does route look like here, especially when there are request path variables? Currently we get labels on time series like route="tasks/:task_id/reports". Will we need to rewrite alerts or dashboards? Or perhaps we should rewrite the route to match our existing format.
There was a problem hiding this comment.
It's unpretty to do the rewrite, but at least for now I will and then we can decide which way to go. I'm adding .replace('{', ":") and .replace('}', "") after the trim_start_matches here and below, that seems to work fine.
aggregator_core/src/lib.rs
Outdated
| } | ||
|
|
||
| /// Axum middleware that instruments request handlers with tracing spans. | ||
| pub async fn axum_instrumented(request: Request<Body>, next: Next) -> Response { |
There was a problem hiding this comment.
This seems like an inferior version of http_metrics_middleware. Do we plan to use it anywhere?
There was a problem hiding this comment.
Definitely -- (That's pub async fn instrumented in https://github.com/divviup/janus/pull/4409/changes#diff-757a8c633835a56ca77a6f9e8d46679327b47e5fa756cc9695d9b01dcc76ee6dR237) -- but you're right that it was unused in this commit. Now it won't be, since per your other comments I'm going ahead and wiring it up. :)
| ); | ||
|
|
||
| // Bind a local listener for the axum router and spawn it. | ||
| let axum_listener = tokio::net::TcpListener::bind("127.0.0.1:0") |
There was a problem hiding this comment.
I don't mind the plan of proxying the Trillium handler to Axum to enable migration. But I don't get how this will work. The Trillium router will bind the SockerAddr from the Janus config and thus receive actual messages from the internet. Any route Trillium doesn't understand gets passed on to the axum router. It's impossible for anything on the network to actually talk to the Axum router directly, because the port bound by axum_listener won't be exposed in Kubernetes nor hooked up to a cloud load balancer or anything. So I guess once we have migrated all the routes, we tear down all the Trillium stuff and Axum starts binding the address in the Janus config?
I think that makes sense. I am slightly concerned about always binding IPv4 here. I feel like it's possible for localhost to be IPv6 only. Can we bind "localhost:0" or :0 and have it just grab whichever loopback exists?
There was a problem hiding this comment.
You've got the idea, and yeah I will make it localhost, but also this is never going to be deployed as-is -- this is just to keep the tests running as we do the piecemeal conversion, and then ultimately all the routes will be directly consumed by Axum instead of Trillium.
| State(self.aggregator), | ||
| metrics, | ||
| router, | ||
| StatusCounter::new(self.meter), |
There was a problem hiding this comment.
OK, so, IIUC, StatusCounter will intercept responses being sent by Trillium and then it increments the janus_aggregator_responses metric(s) based on response status. If Trillium doesn't send a response because it doesn't recognize the route, then presumably StatusCounter::before_send never gets invoked, and then the unhandled request gets handed off to proxy and thus Axum. That all scans. But AFAICT we never hook up anything to the Axum router that manages janus_aggregator_responses. Do we need to stick the http_metrics_middleware somewhere in the axum router? I think this PR should prove that the middleware works, including checking metrics counters as I think we do for Trillium.
There was a problem hiding this comment.
Totally fair assessment. I'm pulling some more out of the bulk PR into here -- to explain, I was going for as minimal as to prove the routing, not the routing and the stats, but obviously the stats are important.
There was a problem hiding this comment.
Unexpected consequence of pulling it forward and having both frameworks log to Prometheus at the same time is we end up with some phantom None entries for the same label key, which have to get filtered out.
There was a problem hiding this comment.
and this spreads to some other tests, ugh.
There was a problem hiding this comment.
But only the one where we interact with opentelemetry_sdk::metrics directly for the affected changes, that's not so bad. It might pile up over time, but not yet.
| .await; | ||
| assert_eq!(conn.status(), Some(trillium::Status::Ok)); | ||
| let body = test_util::take_response_body(&mut conn).await; | ||
| assert_eq!(body, b"axum OK"); |
There was a problem hiding this comment.
Does the trillium stack have tests that verify that metrics get incremented as expected? We should duplicate those for Axum.
There was a problem hiding this comment.
Mmm.. once we're further along in the migration, we can switch to testing the axum router directly via tower::ServiceExt::oneshot (no HTTP server needed — you just call the router as a Service with a synthesized Request).
The completed migration branch already does this, and it's how I handle this over in Relay.
My new axum_http_metrics test should relieve your concerns here though.
There was a problem hiding this comment.
OK, I'm happy if there's a plan we have verified to work.
tgeoghegan
left a comment
There was a problem hiding this comment.
Oops, I meant "request changes" before.
tgeoghegan
left a comment
There was a problem hiding this comment.
Thanks for setting up the tests. A few more questions.
| // Find the Trillium StatusCounter's entry (route=hpke_config). | ||
| let trillium_response_metric = find_metric_by_label( | ||
| metric_families["janus_aggregator_responses_total"].get_metric(), | ||
| "route", |
There was a problem hiding this comment.
What's up with route vs. http_route? Won't this break existing PromQL?
There was a problem hiding this comment.
I'm just matching existing behavior. Here are some examples from dap-09-3:
http_server_request_duration_seconds_sum{http_request_method="GET", http_response_status_code="200", http_route="/tasks/:task_id/metrics/uploads", instance="<redacted>", job="kubernetes-service-endpoints", namespace="prd-d09-3", network_protocol_name="http", network_protocol_version="1.1", node="<redacted>", otel_scope_name="janus_aggregator", service="aggregator-prometheus-scrape", url_scheme="http"}janus_aggregator_responses_total{instance="<redacted>", job="kubernetes-service-endpoints", method="GET", namespace="prd-d09-3", node="<redacted>", otel_scope_name="janus_aggregator", route="/task_ids", service="aggregator-prometheus-scrape"}
I was surprised, too, but here we are.
| 1048576.0, 2097152.0, 4194304.0, 8388608.0, 16777216.0, 33554432.0, | ||
| ]; | ||
|
|
||
| // -- Trillium -- |
There was a problem hiding this comment.
Could even make mod trillium and mod axum in this file but I don't think it's worth the extra indentation for something transitory.
| // Rewrite axum's `{param}` path syntax to `:param` for metric label continuity | ||
| // with the existing Trillium convention (e.g. `tasks/:task_id/reports`). |
There was a problem hiding this comment.
I know that I explicitly asked you to do this and I appreciate the effort put in but now I'm wondering if it makes sense for us to maintain this Trillium convention indefinitely and have it leak into our metrics. It bothers me somehow to have an implementation detail like this leak into peripheral ops concerns. I wonder what the state of the art here is. I guess it's inevitable that your metrics and alerts wound up tightly bound to things like your HTTP server implementation. If not due to the labels on metrics, then due to whether a given implementation exposes a particular metric at all.
So I guess we have to commit to some convention for HTTP route labels, and Trillium's is as good as anything else, as far as I can tell. So I've talked myself back into doing what you do here. Thank you for taking this brief stroll with me.
| let axum_router = axum::Router::new() | ||
| // Temporary test endpoint to verify the proxy bridge works. | ||
| // TODO(#4283): Remove once a real endpoint has been migrated. | ||
| .route( |
There was a problem hiding this comment.
Is this meant to use axum_instrumented? That function remains unused in this PR.
There was a problem hiding this comment.
Yes and no, but with the change to use TraceLayer the distinction is immaterial!
| .await; | ||
| assert_eq!(conn.status(), Some(trillium::Status::Ok)); | ||
| let body = test_util::take_response_body(&mut conn).await; | ||
| assert_eq!(body, b"axum OK"); |
There was a problem hiding this comment.
OK, I'm happy if there's a plan we have verified to work.
aggregator_core/src/http_server.rs
Outdated
| .unwrap_or_else(|| "unknown".to_string()); | ||
| let method = request.method().clone(); | ||
| let span = info_span!("endpoint", route, %method); | ||
| let response = next.run(request).instrument(span.clone()).await; |
There was a problem hiding this comment.
We should use tower_http::trace::TraceLayer instead, as it reduces the amount of manual future instrumentation we need to do, while still allowing customization. Here's an example using it: https://github.com/tokio-rs/axum/blob/main/examples/tracing-aka-logging/src/main.rs
There was a problem hiding this comment.
Oh, good idea. And thank you for the reference. I dropped axum_instrumented entirely for this.
|
|
||
| /// Axum middleware that records HTTP server metrics (response counter, request duration, | ||
| /// body sizes). | ||
| pub async fn http_metrics_middleware( |
There was a problem hiding this comment.
We could use opentelemetry-instrumentation-tower, from the open-telemetry/opentelemetry-rust-contrib repo, in order to handle the server metrics following OpenTelemetry conventions. We'd need to provide a FnRouteExtractor in order to keep the cardinality of the route attribute down, and we'd need to handle the janus_aggregator_responses_total counter separately.
There was a problem hiding this comment.
I love this.
However, opentelemetry-instrumentation-tower v0.17.0 (which appears to be the only release) requires opentelemetry 0.31+, so we need to upgrade off opentelemetry 0.27. Which means we need to ... move off trillium.
Part 1 of incremental Trillium-to-Axum migration (#4283). Sets up the infrastructure for migrating HTTP handlers one batch at a time:
axum,tower-http, andtrillium-proxyworkspace dependenciesHttpMetrics,ErrorCode,http_metrics_middleware, andaxum_instrumentedmiddleware toaggregator_core(coexisting with existing Trillium instrumented handler)BYTES_HISTOGRAM_BOUNDARIESconstant toaggregator_coreAggregatorHandlerBuilder::build(), spawn an empty axum router on a local listener and configure a trillium-proxy Proxy as the final fallback handler -- unmatched Trillium routes are forwarded to axumbuild()async; update all call sites with.await