Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs - GENAI-4264 by subpath · Pull Request #118 · Firefox-AI/MLPA

subpath · 2026-03-23T14:39:18Z

What's new:

Add LiteLLM response header parser to extra extra metrics
Jira: https://mozilla-hub.atlassian.net/browse/GENAI-4264

Note that vertex_ai doesn't return api_base header, but TogetherAI and RayServe will return it

New metrics

successful completions only, labels include requested model, backend, service type, purpose, and fallback_used where applicable:

mlpa_litellm_routed_completions_total
mlpa_litellm_attempted_fallbacks / mlpa_litellm_attempted_retries (histograms)
mlpa_litellm_reported_duration_seconds (histogram, from proxy duration header)
mlpa_litellm_reported_cost_usd_total (counter: increments by reported USD per completion for windowed sums via increase())
mlpa_litellm_routed_tokens_total (prompt/completion tokens aligned with routing labels)

QA:

tests old and new ✅
Local QA ✅

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 875.0
python_gc_objects_collected_total{generation="1"} 80.0
python_gc_objects_collected_total{generation="2"} 10.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 291.0
python_gc_collections_total{generation="1"} 26.0
python_gc_collections_total{generation="2"} 2.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="12",patchlevel="13",version="3.12.13"} 1.0
# HELP mlpa_in_progress_requests Number of requests currently in progress.
# TYPE mlpa_in_progress_requests gauge
mlpa_in_progress_requests 1.0
# HELP mlpa_requests_total Total number of requests handled by the proxy.
# TYPE mlpa_requests_total counter
mlpa_requests_total{endpoint="/v1/chat/completions",method="POST",purpose="",service_type="ai"} 1.0
# HELP mlpa_requests_created Total number of requests handled by the proxy.
# TYPE mlpa_requests_created gauge
mlpa_requests_created{endpoint="/v1/chat/completions",method="POST",purpose="",service_type="ai"} 1.774341358378579e+09
# HELP mlpa_response_status_codes_total Total number of response status codes.
# TYPE mlpa_response_status_codes_total counter
mlpa_response_status_codes_total{status_code="200"} 1.0
# HELP mlpa_response_status_codes_created Total number of response status codes.
# TYPE mlpa_response_status_codes_created gauge
mlpa_response_status_codes_created{status_code="200"} 1.774341358378586e+09
# HELP mlpa_request_latency_seconds Request latency in seconds.
# TYPE mlpa_request_latency_seconds histogram
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.005",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.01",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.025",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.05",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.1",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.25",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.5",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="1.0",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="2.5",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="5.0",method="POST"} 1.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="10.0",method="POST"} 1.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="+Inf",method="POST"} 1.0
mlpa_request_latency_seconds_count{endpoint="/v1/chat/completions",method="POST"} 1.0
mlpa_request_latency_seconds_sum{endpoint="/v1/chat/completions",method="POST"} 4.77369670799817
# HELP mlpa_request_latency_seconds_created Request latency in seconds.
# TYPE mlpa_request_latency_seconds_created gauge
mlpa_request_latency_seconds_created{endpoint="/v1/chat/completions",method="POST"} 1.7743413583785589e+09
# HELP mlpa_validate_challenge_latency_seconds Challenge validation latency in seconds.
# TYPE mlpa_validate_challenge_latency_seconds histogram
# HELP mlpa_validate_app_attest_latency_seconds App Attest authentication latency in seconds.
# TYPE mlpa_validate_app_attest_latency_seconds histogram
# HELP mlpa_validate_app_assert_latency_seconds App Assert authentication latency in seconds.
# TYPE mlpa_validate_app_assert_latency_seconds histogram
# HELP mlpa_validate_fxa_latency_seconds FxA authentication latency in seconds.
# TYPE mlpa_validate_fxa_latency_seconds histogram
mlpa_validate_fxa_latency_seconds_bucket{le="0.05",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="0.1",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="0.25",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="0.5",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="1.0",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="2.5",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="5.0",result="success",verification_source="local"} 1.0
mlpa_validate_fxa_latency_seconds_bucket{le="+Inf",result="success",verification_source="local"} 1.0
mlpa_validate_fxa_latency_seconds_count{result="success",verification_source="local"} 1.0
mlpa_validate_fxa_latency_seconds_sum{result="success",verification_source="local"} 3.342585708000115
# HELP mlpa_validate_fxa_latency_seconds_created FxA authentication latency in seconds.
# TYPE mlpa_validate_fxa_latency_seconds_created gauge
mlpa_validate_fxa_latency_seconds_created{result="success",verification_source="local"} 1.774341356950788e+09
# HELP mlpa_fxa_verifications_total Total number of FxA token verifications.
# TYPE mlpa_fxa_verifications_total counter
mlpa_fxa_verifications_total{verification_source="local"} 1.0
# HELP mlpa_fxa_verifications_created Total number of FxA token verifications.
# TYPE mlpa_fxa_verifications_created gauge
mlpa_fxa_verifications_created{verification_source="local"} 1.7743413569505558e+09
# HELP mlpa_chat_completion_latency_seconds Chat completion latency in seconds.
# TYPE mlpa_chat_completion_latency_seconds histogram
mlpa_chat_completion_latency_seconds_bucket{le="0.5",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 0.0
mlpa_chat_completion_latency_seconds_bucket{le="1.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 0.0
mlpa_chat_completion_latency_seconds_bucket{le="2.5",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="5.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="10.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="20.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="30.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="60.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="120.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="180.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="+Inf",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_count{model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_sum{model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.3718272090045502
# HELP mlpa_chat_completion_latency_seconds_created Chat completion latency in seconds.
# TYPE mlpa_chat_completion_latency_seconds_created gauge
mlpa_chat_completion_latency_seconds_created{model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.774341358377821e+09
# HELP mlpa_chat_completion_ttft_seconds Time to first token for streaming chat completions in seconds.
# TYPE mlpa_chat_completion_ttft_seconds histogram
# HELP mlpa_chat_tokens_total Number of tokens for chat completions.
# TYPE mlpa_chat_tokens_total counter
mlpa_chat_tokens_total{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 402.0
mlpa_chat_tokens_total{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 42.0
# HELP mlpa_chat_tokens_created Number of tokens for chat completions.
# TYPE mlpa_chat_tokens_created gauge
mlpa_chat_tokens_created{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.774341358377654e+09
mlpa_chat_tokens_created{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.7743413583777149e+09
# HELP mlpa_chat_tokens_per_request Distribution of tokens per chat completion request.
# TYPE mlpa_chat_tokens_per_request histogram
mlpa_chat_tokens_per_request_bucket{le="0.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="10.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="50.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="100.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="250.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="500.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="1000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="2500.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="5000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="10000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="25000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="+Inf",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_count{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_sum{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 402.0
mlpa_chat_tokens_per_request_bucket{le="0.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 0.0
mlpa_chat_tokens_per_request_bucket{le="10.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 0.0
mlpa_chat_tokens_per_request_bucket{le="50.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="100.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="250.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="500.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="1000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="2500.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="5000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="10000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="25000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="+Inf",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_count{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_sum{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 42.0
# HELP mlpa_chat_tokens_per_request_created Distribution of tokens per chat completion request.
# TYPE mlpa_chat_tokens_per_request_created gauge
mlpa_chat_tokens_per_request_created{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.774341358377682e+09
mlpa_chat_tokens_per_request_created{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.774341358377721e+09
# HELP mlpa_chat_tool_calls_total Total number of LLM tool invocations.
# TYPE mlpa_chat_tool_calls_total counter
# HELP mlpa_chat_completions_with_tools_total Number of completions that contained at least one tool call.
# TYPE mlpa_chat_completions_with_tools_total counter
# HELP mlpa_chat_tool_calls_per_completion Distribution of tool calls per completion.
# TYPE mlpa_chat_tool_calls_per_completion histogram
# HELP mlpa_chat_requests_with_tools_total Number of chat requests that included a tools payload.
# TYPE mlpa_chat_requests_with_tools_total counter
# HELP mlpa_chat_request_rejections_total Number of chat requests rejected due to budget, rate limit, payload size, or managed-user signup cap.
# TYPE mlpa_chat_request_rejections_total counter
# HELP mlpa_litellm_routed_completions_total Successful chat completions with LiteLLM routing labels from response headers.
# TYPE mlpa_litellm_routed_completions_total counter
mlpa_litellm_routed_completions_total{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
# HELP mlpa_litellm_routed_completions_created Successful chat completions with LiteLLM routing labels from response headers.
# TYPE mlpa_litellm_routed_completions_created gauge
mlpa_litellm_routed_completions_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.7743413583777559e+09
# HELP mlpa_litellm_attempted_fallbacks LiteLLM-reported fallback attempts per successful completion (from x-litellm-attempted-fallbacks).
# TYPE mlpa_litellm_attempted_fallbacks histogram
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="0.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="1.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="2.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="3.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="5.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="+Inf",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_count{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_sum{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.0
# HELP mlpa_litellm_attempted_fallbacks_created LiteLLM-reported fallback attempts per successful completion (from x-litellm-attempted-fallbacks).
# TYPE mlpa_litellm_attempted_fallbacks_created gauge
mlpa_litellm_attempted_fallbacks_created{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.774341358377763e+09
# HELP mlpa_litellm_attempted_retries LiteLLM-reported retry attempts per successful completion (from x-litellm-attempted-retries).
# TYPE mlpa_litellm_attempted_retries histogram
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="0.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="1.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="2.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="3.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="5.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="+Inf",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_count{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_sum{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.0
# HELP mlpa_litellm_attempted_retries_created LiteLLM-reported retry attempts per successful completion (from x-litellm-attempted-retries).
# TYPE mlpa_litellm_attempted_retries_created gauge
mlpa_litellm_attempted_retries_created{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.774341358377772e+09
# HELP mlpa_litellm_reported_duration_seconds LiteLLM proxy-reported request duration in seconds (x-litellm-response-duration-ms / 1000).
# TYPE mlpa_litellm_reported_duration_seconds histogram
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="0.5",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="1.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="2.5",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="5.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="10.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="20.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="30.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="60.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="120.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="180.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="+Inf",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_count{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_sum{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.285656
# HELP mlpa_litellm_reported_duration_seconds_created LiteLLM proxy-reported request duration in seconds (x-litellm-response-duration-ms / 1000).
# TYPE mlpa_litellm_reported_duration_seconds_created gauge
mlpa_litellm_reported_duration_seconds_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.774341358377782e+09
# HELP mlpa_litellm_reported_cost_usd_total Cumulative LiteLLM-reported spend in USD (x-litellm-response-cost); use increase() over a range for windowed sums.
# TYPE mlpa_litellm_reported_cost_usd_total counter
mlpa_litellm_reported_cost_usd_total{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.001425
# HELP mlpa_litellm_reported_cost_usd_created Cumulative LiteLLM-reported spend in USD (x-litellm-response-cost); use increase() over a range for windowed sums.
# TYPE mlpa_litellm_reported_cost_usd_created gauge
mlpa_litellm_reported_cost_usd_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.774341358377802e+09
# HELP mlpa_litellm_routed_tokens_total Token counts attributed to LiteLLM winning backend (from usage, same completion as routing headers).
# TYPE mlpa_litellm_routed_tokens_total counter
mlpa_litellm_routed_tokens_total{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai",type="prompt"} 402.0
mlpa_litellm_routed_tokens_total{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai",type="completion"} 42.0
# HELP mlpa_litellm_routed_tokens_created Token counts attributed to LiteLLM winning backend (from usage, same completion as routing headers).
# TYPE mlpa_litellm_routed_tokens_created gauge
mlpa_litellm_routed_tokens_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai",type="prompt"} 1.7743413583778088e+09
mlpa_litellm_routed_tokens_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai",type="completion"} 1.774341358377813e+09

…api-base-and-fallbacks-geanai-4264

src/mlpa/core/completions.py

noahpodgurski · 2026-03-24T13:28:40Z

src/mlpa/core/config.py


    def valid_purposes_for_service_type(self, service_type: str) -> list[str]:
        """Return valid purpose values for a service type (empty if purpose not used)."""
        return self.service_type_purposes.get(service_type, [])


Suggested change

return self.service_type_purposes.get(service_type, {})

Or we could remove the second parameter entirely since it's explicitly defined above, wdyt?

noahpodgurski · 2026-03-24T13:33:36Z

src/mlpa/core/litellm_routing.py

+    fallbacks = _safe_int_header(headers, LITELLM_HEADER_ATTEMPTED_FALLBACKS)
+    retries = _safe_int_header(headers, LITELLM_HEADER_ATTEMPTED_RETRIES)
+    duration_ms = _safe_float_header(headers, LITELLM_HEADER_RESPONSE_DURATION_MS)
+    if duration_ms is not None and duration_ms < 0:


Does LiteLLM ever return a negative value here?

noahpodgurski · 2026-03-24T13:35:25Z

src/mlpa/core/completions.py

+        float(snapshot.attempted_fallbacks)
+    )
+    metrics.litellm_attempted_retries.labels(**labels_base).observe(
+        float(snapshot.attempted_retries)


This is fine as an int right?

noahpodgurski · 2026-03-24T13:35:27Z

src/mlpa/core/completions.py

+        fallback_used=fallback_used,
+    ).inc()
+    metrics.litellm_attempted_fallbacks.labels(**labels_base).observe(
+        float(snapshot.attempted_fallbacks)


This is fine as an int right?

noahpodgurski · 2026-03-24T13:38:06Z

src/mlpa/core/litellm_routing.py

+    if raw is None:
+        return 0
+    try:
+        return int(float(str(raw).strip()))


suggestion: We should probably either pick int or float, not both

Since it's _safe_int_header, int is best 👍

noahpodgurski · 2026-03-24T13:51:49Z

Looks good to me, just a few comments 👍

Co-authored-by: Noah Podgurski <42069075+noahpodgurski@users.noreply.github.com>

…thub.com:Firefox-AI/MLPA into feat-tracking-api-base-and-fallbacks-geanai-4264

subpath added 4 commits March 23, 2026 15:33

wip

f6693fa

Merge branch 'main' of github.com:Firefox-AI/MLPA into feat-tracking-…

2cd397f

…api-base-and-fallbacks-geanai-4264

wip commit

209acd2

refactor metrics

5da8680

subpath changed the title ~~wip~~ Feat - parse LiteLLM headers to recort metrics regarding backend used and fallbacks and also costs Mar 24, 2026

subpath marked this pull request as ready for review March 24, 2026 08:40

subpath changed the title ~~Feat - parse LiteLLM headers to recort metrics regarding backend used and fallbacks and also costs~~ Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs Mar 24, 2026

subpath changed the title ~~Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs~~ Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs - GENAI-4264 Mar 24, 2026

noahpodgurski reviewed Mar 24, 2026

View reviewed changes

src/mlpa/core/completions.py Outdated Show resolved Hide resolved

noahpodgurski reviewed Mar 24, 2026

View reviewed changes

subpath and others added 3 commits March 24, 2026 15:16

Update src/mlpa/core/completions.py

8f86bbe

Co-authored-by: Noah Podgurski <42069075+noahpodgurski@users.noreply.github.com>

pull from main

b9ef7d3

Merge branch 'feat-tracking-api-base-and-fallbacks-geanai-4264' of gi…

eb6beb2

…thub.com:Firefox-AI/MLPA into feat-tracking-api-base-and-fallbacks-geanai-4264

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs - GENAI-4264#118

Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs - GENAI-4264#118
subpath wants to merge 7 commits intomainfrom
feat-tracking-api-base-and-fallbacks-geanai-4264

subpath commented Mar 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

noahpodgurski Mar 24, 2026

Uh oh!

noahpodgurski Mar 24, 2026

Uh oh!

noahpodgurski Mar 24, 2026

Uh oh!

noahpodgurski Mar 24, 2026

Uh oh!

noahpodgurski Mar 24, 2026

Uh oh!

noahpodgurski Mar 24, 2026

Uh oh!

noahpodgurski Mar 24, 2026

Uh oh!

noahpodgurski commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

subpath commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's new:

New metrics

QA:

Uh oh!

Uh oh!

noahpodgurski Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

noahpodgurski Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

noahpodgurski Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

noahpodgurski Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

noahpodgurski Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

noahpodgurski Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

noahpodgurski Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

noahpodgurski commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

subpath commented Mar 23, 2026 •

edited

Loading