[FFL-1942] feat(ffe): add eval metrics tests [python@main] [golang@sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues] by sameerank · Pull Request #6545 · DataDog/system-tests

sameerank · 2026-03-19T19:48:17Z

Motivation

Add FFE (Feature Flagging & Experimentation) evaluation metrics tests to verify the feature_flag.evaluations OTel counter metric works correctly across SDKs.

Jira: https://datadoghq.atlassian.net/browse/FFL-1942
dd-trace-py PR: feat(openfeature): add flag evaluation metrics dd-trace-py#17029
dd-trace-go PR: fix(openfeature): improve FFE eval metrics cross-tracer consistency dd-trace-go#4590

Changes

Test Infrastructure

Enable tests/ffe/test_flag_eval_metrics.py for Python and Go in manifest
Add OTEL_EXPORTER_OTLP_METRICS_PROTOCOL: "http/protobuf" to FFE scenario config
Add opentelemetry-exporter-otlp-proto-http==1.40.0 to Python weblogs

Test Coverage

Tests for OpenFeature evaluation reasons:

STATIC - catch-all allocation with no rules/shards
TARGETING_MATCH - rules match the context
SPLIT - shards determine variant
DEFAULT - rules don't match, fallback used
DISABLED - flag is disabled

Tests for OpenFeature error codes:

FLAG_NOT_FOUND - config exists but flag missing
TYPE_MISMATCH - STRING→BOOLEAN, NUMERIC→INTEGER conversions
PARSE_ERROR - invalid regex pattern (Python only; Go validates at config load)
PROVIDER_NOT_READY - no config loaded
INVALID_CONTEXT - nested attributes (Python only)
TARGETING_KEY_MISSING - verifies it's NOT returned (JS excluded)

Cross-SDK Consistency

Lowercase reason/error values per OpenFeature telemetry conventions
SDK-specific @irrelevant decorators where behavior intentionally differs

Reviewer checklist

Anything but tests/ or manifests/ is modified? I have the approval from R&P team
A docker base image is modified?
- the relevant build-XXX-image label is present

github-actions · 2026-03-19T19:48:46Z

CODEOWNERS have been resolved as:

manifests/python.yml                                                    @DataDog/apm-python @DataDog/asm-python
tests/ffe/test_flag_eval_metrics.py                                     @DataDog/feature-flagging-and-experimentation-sdk @DataDog/system-tests-core
utils/_context/_scenarios/__init__.py                                   @DataDog/system-tests-core
utils/_features.py                                                      @DataDog/system-tests-core
utils/build/docker/python/django-poc.Dockerfile                         @DataDog/apm-python @DataDog/asm-python @DataDog/system-tests-core
utils/build/docker/python/django-py3.13.Dockerfile                      @DataDog/apm-python @DataDog/asm-python @DataDog/system-tests-core
utils/build/docker/python/fastapi.Dockerfile                            @DataDog/apm-python @DataDog/asm-python @DataDog/system-tests-core
utils/build/docker/python/flask-poc.Dockerfile                          @DataDog/apm-python @DataDog/asm-python @DataDog/system-tests-core
utils/build/docker/python/python3.12.Dockerfile                         @DataDog/apm-python @DataDog/asm-python @DataDog/system-tests-core
utils/build/docker/python/tornado.Dockerfile                            @DataDog/apm-python @DataDog/asm-python @DataDog/system-tests-core
utils/build/docker/python/uds-flask.Dockerfile                          @DataDog/apm-python @DataDog/asm-python @DataDog/system-tests-core
utils/build/docker/python/uwsgi-poc.Dockerfile                          @DataDog/apm-python @DataDog/asm-python @DataDog/system-tests-core

datadog-datadog-prod-us1-2 · 2026-03-20T19:55:33Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 8b5b106 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!}

sameerank · 2026-03-24T01:08:17Z

tests/ffe/test_flag_eval_metrics.py

+# INVALID_CONTEXT behavioral differences:
+#   - Python: Returns for nested dict/list attributes (PyO3 conversion failure)
+#   - Go: Flattens nested objects to dot notation instead
+#   - Ruby: Silently skips unsupported attribute types
+#   - Java: Returns only for null context, not nested attributes
+#   - .NET: Relies on native library; not yet standardized
+#   - JS: Does not use INVALID_CONTEXT at all


I am unclear on how to reconcile the variety of ways that the SDKs handle invalid evaluation contexts. For now I'm just noting it down in the system tests code, and hopefully we can chip away at the @irrelevant decorators it as we keep working on the SDKs

I like the Python approach of returning the default value with reason "error" and code "invalid context", but my hunch is that the varying ways of binding with the Rust evaluator might mean that this isn't straightforward in other languages.

## Description Adds `feature_flag.evaluations` OTel counter metric emitted on every flag evaluation, following the Go implementation pattern. **Requirements:** - `DD_METRICS_OTEL_ENABLED=true` must be set for metrics to emit - `openfeature-sdk>=0.8.0` (required for `finally_after` hook to receive evaluation details) **Changes:** - Implements FlagEvalMetrics class and FlagEvalHook for metrics tracking - Fixes flag not found behavior to return `Reason.ERROR` with `ErrorCode.FLAG_NOT_FOUND` when flag is not in existing config (aligns with Go/iOS SDKs) - Returns `Reason.DEFAULT` only when no configuration is loaded (preserving existing behavior) ## Testing ### Unit test parity with Go SDK The following tests mirror the Go SDK `flageval_metrics_test.go`: | Go Test | Python Test | |---------|-------------| | `TestRecord` (targeting match) | `test_record_basic_attributes` | | `TestRecord` (allocation key) | `test_record_with_allocation_key` | | `TestRecord` (empty allocation key) | `test_record_empty_allocation_key_not_included` | | `TestRecord` (error flag not found) | `test_record_with_error` | | `TestRecord` (disabled flag) | `test_record_disabled_reason` | | `TestRecordMultipleEvaluations` | `test_record_multiple_evaluations` | | `TestRecordDifferentFlags` | `test_record_different_flags` | | `TestRecordAllErrorTypes` | `test_record_all_error_types` | | `TestIntegrationEvaluate` (type mismatch) | `test_type_conversion_error_records_type_mismatch` | **Note:** Go tests use a real OTel test meter provider. Python unit tests (`TestFlagEvalMetrics`) use mocks for faster isolated testing, while `TestMetricsWithRealOTel` validates behavior with real OTel runtime. ### Python-specific tests (not in Go) - **`TestFlagEvalMetrics`**: OTel initialization tests (graceful handling when OTel not available), metrics disabled when `DD_METRICS_OTEL_ENABLED=false`, shutdown behavior - **`TestFlagEvalHook`**: Tests the hook mechanism (`finally_after` calls `metrics.record` with correct arguments) - **`TestProviderHooksIntegration`**: Tests provider hook registration, `get_provider_hooks()` returns correct hooks, cleanup on shutdown - **`TestMetricsWithRealOTel`**: Integration tests with real OTel runtime ### System tests - System tests PR: DataDog/system-tests#6545 (python tests passing) ## Risks - **API behavior change**: Flag evaluations for non-existent flags now return `Reason.ERROR` instead of `Reason.DEFAULT` when configuration is available. Release note added. - **Dependency upgrade**: Minimum `openfeature-sdk` version increased from 0.6.0 to 0.8.0. Users on older versions will need to upgrade. Release note added. ## Additional Notes Reference files in Go SDK: [flageval_metrics.go](https://github.com/DataDog/dd-trace-go/blob/main/openfeature/flageval_metrics.go), [flageval_metrics_test.go](https://github.com/DataDog/dd-trace-go/blob/main/openfeature/flageval_metrics_test.go), [provider.go](https://github.com/DataDog/dd-trace-go/blob/main/openfeature/provider.go), [provider_test.go](https://github.com/DataDog/dd-trace-go/blob/main/openfeature/provider_test.go) Co-authored-by: sameeran.kunche <sameeran.kunche@datadoghq.com>

Add comprehensive system tests for FFE (Feature Flagging and Experimentation) flag evaluation metrics. These tests verify that tracers emit correct feature_flag.evaluations OTel metrics with proper tags for: - Basic flag evaluation (flag key, variant, reason, allocation_key) - Multiple evaluations (correct count aggregation) - Different flags (separate metric series) - All resolution reasons (static, targeting_match, split, default, disabled) - Error codes (flag_not_found, type_mismatch, parse_error, provider_not_ready) - Lowercase consistency for tag values Also adds the feature_flags_eval_metrics feature declaration for tracer compatibility tracking.

…pe_mismatch NUMERIC and INTEGER are distinct types; evaluating a NUMERIC flag as INTEGER should return type_mismatch (not parse_error) to align with libdatadog FFE.

sameerank mentioned this pull request Mar 19, 2026

feat(openfeature): add flag evaluation metrics DataDog/dd-trace-py#17029

Merged

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch from 4f56c52 to 9e9d075 Compare March 20, 2026 17:29

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@sameerank/FFL-1942/add-flag-eval-metrics]~~ feat(python): enable flag evaluation metrics tests [python@b29232996651286e7f0a8d860a11bfd0d96b3182] Mar 20, 2026

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@b29232996651286e7f0a8d860a11bfd0d96b3182]~~ feat(python): enable flag evaluation metrics tests [python@70eb5ba16394d6bf2697af7d10f71e7438b58b76] Mar 20, 2026

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch 5 times, most recently from 29265cb to ca9bf94 Compare March 21, 2026 02:22

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@70eb5ba16394d6bf2697af7d10f71e7438b58b76]~~ feat(python): enable flag evaluation metrics tests [python@b8c9f8e1f3] Mar 21, 2026

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@b8c9f8e1f3]~~ feat(python): enable flag evaluation metrics tests [python@b8c9f8e1f3aa961291f434f0c564f517c5d2c523] Mar 21, 2026

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@b8c9f8e1f3aa961291f434f0c564f517c5d2c523]~~ feat(python): enable flag evaluation metrics tests [python@5f47db9810450484a2e72d4831756368354fcba2] Mar 22, 2026

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@5f47db9810450484a2e72d4831756368354fcba2]~~ feat(python): enable flag evaluation metrics tests [python@e32fb7357d52c151817e1520b02d215ea98ad155] Mar 22, 2026

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@e32fb7357d52c151817e1520b02d215ea98ad155]~~ feat(python): enable flag evaluation metrics tests [python@2913cbefc109609c8bea2ea443333eacd9a4a02a] Mar 22, 2026

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@2913cbefc109609c8bea2ea443333eacd9a4a02a]~~ feat(python): enable flag evaluation metrics tests [python@746cff92e45a37eba94e3d1e6658ceafe5da5fe9] Mar 23, 2026

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@746cff92e45a37eba94e3d1e6658ceafe5da5fe9]~~ feat(python): enable flag evaluation metrics tests [python@810d4c88ae] Mar 23, 2026

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch from 64c1562 to bc391b7 Compare March 23, 2026 07:42

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@810d4c88ae]~~ feat(python): enable flag evaluation metrics tests [python@74f4110a68] Mar 23, 2026

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch from bc391b7 to 8a1102e Compare March 23, 2026 07:55

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@74f4110a68]~~ feat(python): enable flag evaluation metrics tests [python@6dd59ec52a] Mar 23, 2026

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@6dd59ec52a]~~ feat(python): enable flag evaluation metrics tests [python@6dd59ec52a859aeb67a4a314d232bfa48e68ddac] Mar 23, 2026

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch 2 times, most recently from 6971663 to 9240286 Compare March 23, 2026 08:30

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@6dd59ec52a859aeb67a4a314d232bfa48e68ddac]~~ feat(python): enable flag evaluation metrics tests [python@0ce14d8add] Mar 24, 2026

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch from 11b5f02 to b825fd9 Compare March 24, 2026 01:00

sameerank commented Mar 24, 2026

View reviewed changes

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@0ce14d8add]~~ feat(python): enable flag evaluation metrics tests [python@b825fd92cb48613cb37fd5170f1d4833b45f936a] Mar 24, 2026

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch from b825fd9 to 290334b Compare March 24, 2026 01:11

sameerank changed the title ~~feat(python): enable flag evaluation metrics tests [python@b825fd92cb48613cb37fd5170f1d4833b45f936a]~~ feat(python): enable flag evaluation metrics tests [python@0ce14d8addd8077787d4cd65d3a437af09530a41] Mar 24, 2026

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch 6 times, most recently from ffcec93 to adc1a8d Compare March 24, 2026 15:21

This was referenced Mar 24, 2026

fix(openfeature): improve FFE eval metrics cross-tracer consistency DataDog/dd-trace-go#4590

Open

[FFL-1972] fix(ffe): enable No_Config_Loaded test for Go [golang@sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues] #6577

Closed

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch from d56dfe5 to 96216e9 Compare March 24, 2026 16:58

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch 2 times, most recently from 6542ece to 5bf6fb3 Compare March 24, 2026 20:07

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch 2 times, most recently from f881c10 to 4277ea5 Compare March 25, 2026 18:42

sameerank added 8 commits March 26, 2026 10:47

feat(python): enable flag evaluation metrics tests

082555f

chore(python): add OTel OTLP metrics exporter for FFE metrics

146230d

fix(ffe): use NUMERIC for float types in UFC fixture

38da63e

fix(ffe): rename Parse_Error test to Numeric_To_Integer and expect ty…

8d76209

…pe_mismatch NUMERIC and INTEGER are distinct types; evaluating a NUMERIC flag as INTEGER should return type_mismatch (not parse_error) to align with libdatadog FFE.

fix(ffe): enable No_Config_Loaded test for Go

ea0e8df

test(ffe): add parse_error test for invalid regex

283d9af

feat(ffe): update nested attributes test per OF.3 spec

27cebe1

sameerank force-pushed the sameerank/FFL-1942/add-flag-eval-metrics branch from 4277ea5 to 27cebe1 Compare March 26, 2026 17:47

sameerank changed the title ~~[FFL-1942] feat(ffe): add eval metrics tests [golang@sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues]~~ [FFL-1942] feat(ffe): add eval metrics tests [python@main] [golang@sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues] Mar 26, 2026

trigger CI with python@main

8b5b106

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FFL-1942] feat(ffe): add eval metrics tests [python@main] [golang@sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues]#6545

[FFL-1942] feat(ffe): add eval metrics tests [python@main] [golang@sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues]#6545
sameerank wants to merge 9 commits intomainfrom
sameerank/FFL-1942/add-flag-eval-metrics

sameerank commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

datadog-datadog-prod-us1-2 bot commented Mar 20, 2026 •

edited by datadog-official bot

Loading

Uh oh!

sameerank Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sameerank commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Test Infrastructure

Test Coverage

Cross-SDK Consistency

Reviewer checklist

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datadog-datadog-prod-us1-2 bot commented Mar 20, 2026 • edited by datadog-official bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameerank Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sameerank commented Mar 19, 2026 •

edited

Loading

github-actions bot commented Mar 19, 2026 •

edited

Loading

datadog-datadog-prod-us1-2 bot commented Mar 20, 2026 •

edited by datadog-official bot

Loading