Skip to content

feat(schema): add semantic IR and symbol ID infrastructure#124

Merged
hardbyte merged 48 commits intomainfrom
feature/semantic-ir-schema
Mar 29, 2026
Merged

feat(schema): add semantic IR and symbol ID infrastructure#124
hardbyte merged 48 commits intomainfrom
feature/semantic-ir-schema

Conversation

@hardbyte
Copy link
Copy Markdown
Contributor

@hardbyte hardbyte commented Mar 27, 2026

Summary

Adds semantic IR infrastructure (#96), fixes Python codegen for flattened tagged-union fields (#123), makes Python a first-class codegen backend with namespace classes, typed errors, field descriptions, and SemanticSchema-driven code generation.

Semantic IR Infrastructure (reflectapi-schema)

Module Purpose
symbol.rs SymbolId / SymbolKind — stable unique identifiers for all schema symbols
ids.rs ensure_symbol_ids() — canonical ID assignment with cross-typespace disambiguation
semantic.rs Immutable SemanticSchema, SemanticType, SymbolTable, ResolvedTypeReference
normalize.rs NormalizationPipeline + Normalizer (&SchemaSemanticSchema)

Schema type changes: id: SymbolId on all types, Normalizer::normalize(&Schema), original_name on SemanticType preserving pre-normalization qualified names.

Python Codegen — Now Driven by SemanticSchema

The Python codegen uses SemanticSchema as the primary driver:

  • Type iteration: semantic.types() (deterministic BTreeMap order, replaces manual topological sort)
  • Import detection: SemanticType pattern matching
  • Function ordering: semantic.functions()
  • Raw Schema kept for rendering (concrete field/variant data)

Python Codegen — Wire-Compatible Flatten (#123)

For #[serde(flatten)] on internally-tagged enums, generates per-variant models merging parent fields + tag + variant fields into a discriminated union RootModel.

Python Codegen — First-Class DX

  • Namespace classes mirroring Rust module structure (auth.UsersSignInRequest)
  • Field descriptions via Field(description="...")
  • Typed error returnsApiResponse[OutputType, ErrorType] in method signatures
  • Typed error deserializationApplicationError.typed_error as Pydantic model
  • Typed list responseslist[Model] via TypeAdapter (was returning raw dicts)
  • Fast JSON parsing — Pydantic's Rust-based validate_json(bytes)
  • Factory classes removed — direct construction via namespace types (-13% file size)
  • Docstring escaping for backslashes, triple-quotes, Python keywords in method/parameter names

Python Runtime Fixes

  • TypeAdapter for all response validation (handles list[Model], generics, unions)
  • error_model parameter on _make_request for typed error deserialization
  • ApiResponse[T, E] generic with both success and error type parameters
  • validate_json(bytes) fast path for Pydantic's Rust-based parser

Other Changes

  • Architecture documentation (docs/architecture.md)
  • Fixed dead README links
  • Pin mdbook 0.4.x in CI (fixes 3-month doc build failure)
  • Merged Andrey/refactoring cleanup #122 askama removal
  • 21 new edge case snapshot tests

Real-World Validation

Partly's core-server (284 endpoints, 78K-line schema):

  • 47K-line Python client (was 68K before factory removal)
  • Valid Python, imports in ~0.65s
  • Authenticates against live API, typed responses and errors work
  • list[BillingCurrencyListItem] returns Pydantic models
  • ApplicationError.typed_error = CustomerGetErrorCustomerNotFoundVariant

Test Coverage

  • 220 tests total (0 failures)
  • 166 demo snapshot tests (21 new edge case tests)
  • 37 schema crate tests
  • All 3 CI workflows green (including doc build)

avkonst and others added 5 commits March 26, 2026 20:49
Add SymbolId system, semantic IR types, ID assignment, and normalization
pipeline to reflectapi-schema. This provides stable, unique identifiers
for all schema symbols and a multi-stage pipeline for transforming raw
schemas into validated semantic representations.

New modules:
- symbol.rs: SymbolId/SymbolKind types with stable identifiers
- ids.rs: ensure_symbol_ids() for post-deserialization ID assignment
- semantic.rs: Immutable semantic IR (SemanticSchema, SymbolTable, etc.)
- normalize.rs: TypeConsolidation, NamingResolution, CircularDependency
  detection stages, and Normalizer (Schema -> SemanticSchema)

Schema type changes:
- Added id: SymbolId field to Schema, Function, Primitive, Struct, Field,
  Enum, Variant (serde skip_serializing, backward compatible)
- Manual PartialEq/Hash impls exclude id from comparisons
- PartialEq + Eq added to SerializationMode, Copy to SymbolKind

Addresses #96, lays groundwork for #123.
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization's overage spend limit has been reached.

Code review is billed via overage credits. To resume reviews, an organization admin can raise the monthly limit at claude.ai/admin-settings/claude-code.

Once credits are available, reopen this pull request to trigger a review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86c00edfb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- TypeConsolidationStage now rewrites type references after renaming
  conflicted types (fixes dangling references to old names)
- ensure_symbol_ids uses separate seen maps per typespace and
  disambiguates output types that share an FQN with a different
  input type (prevents SymbolId collisions in the Normalizer)
- Field::new and Variant::new use Default::default() for id so
  ensure_symbol_ids can assign proper parent-contextualized paths
@hardbyte
Copy link
Copy Markdown
Contributor Author

@claude review

claude added 2 commits March 27, 2026 08:32
- ids.rs: Use struct/enum's actual ID (not seen-map ID) as owner for
  member ID assignment, fixing inconsistent parent-child paths when
  types have pre-assigned IDs

- normalize.rs: Track all conflicting qualified names in name_usage
  (Vec<String> per simple name), not just the first, so
  update_type_references_in_schema builds mappings for all conflicting
  types and avoids dangling references

- normalize.rs: Fix generate_unique_name fallback to join all module
  parts instead of using module_parts[0], which would return an
  excluded part ("model"/"proto") and cause name collisions

https://claude.ai/code/session_01UcJQe3CE12BFgqDiadkgii
- test_pre_assigned_id_member_paths_consistent: verifies struct field
  ID paths use the struct's actual ID as parent prefix
- test_pre_assigned_id_enum_member_paths_consistent: same for enums
- test_naming_resolution_all_conflicting_types_have_references_rewritten:
  verifies function references to all conflicting types (not just the
  first) are rewritten to valid names after NamingResolutionStage
- test_generate_unique_name_excluded_modules_no_collision: verifies
  model::Foo and model::proto::Foo produce different names
- test_generate_unique_name_with_non_excluded_module: normal case

https://claude.ai/code/session_01UcJQe3CE12BFgqDiadkgii
For structs with `#[serde(flatten)]` on an internally-tagged enum,
generate per-variant models that merge parent struct fields + variant
fields + tag discriminator, then emit a discriminated union RootModel.
This matches the flat wire format serde produces.

Before: `Offer` had only `id: str` (enum fields silently dropped)
After: `Offer` is a RootModel union of `OfferSingle{id,type,business}`
       and `OfferGroup{id,type,count}` — wire-compatible with serde

Also:
- Compose NormalizationPipeline into Normalizer (runs TypeConsolidation,
  NamingResolution, CircularDependencyResolution before IR construction)
- Add snapshot tests for flattened externally-tagged, adjacently-tagged,
  and untagged enums
- Document Boxing strategy as intentional no-op (Rust schemas already
  encode Box<T>); add integration tests for self-referential and
  multi-type circular dependency normalization
- Add docs/architecture.md covering semantic IR pipeline, codegen
  backends, and flattened type handling
- Remove all point-in-time language ("currently", "not yet", "planned")
- Rename Section 7 from "Current Status and Roadmap" to "Limitations
  and Design Gaps" — state facts, not progress
- Delete "Complete" and "In Progress" subsections
- Fix Schema/SemanticSchema code samples to include id fields
- Add reflectapi-python-runtime crate description
- Fix OpenAPI version (3.1, not 3.0)
- Replace vague language with specifics throughout
- Remove subjective tone and issue number references from prose
- Handler function signature convention (Input, Output, Headers, Error)
- Input/Output traits as the self-registration mechanism
- reflectapi::Option<T> three-state type (Undefined | None | Some)
- Primitive.fallback mechanism for codegen type resolution
- #[reflectapi(...)] derive macro attributes reference
- Snapshot test architecture (5 snapshots per test, trybuild)
- Add description/deprecation_note to Function struct sample
- Fix TypeConsolidation claim: both copies are renamed when name
  appears in both typespaces (not just when types differ)
- Fix NamingResolution example: proto is skipped in prefix generation,
  so use ApiUser/BillingUser not ProtoUser
Replace dead reflectapi.partly.workers.dev URLs (returning 404) with
links to local docs. Add link to architecture doc.
ids.rs:
- Use struct's actual id (not seen-map id) as owner for member
  assignment, fixing inconsistent parent-child paths
- Zero-pad tuple field indices (arg00, arg01, ...) so BTreeMap
  ordering matches positional order for 10+ fields
- assign_disambiguated_id now clears and re-assigns all member IDs
  after disambiguation, maintaining hierarchical consistency
- Schema root uses sentinel path ["__schema__", name] to avoid
  collision with same-named user types

normalize.rs:
- TypeConsolidation uses full qualified name for conflict renaming
  (input.a.Foo vs input.b.Foo) preventing silent type drops
- resolve_types filters resolution_cache to type-level symbols only,
  preventing Field/Variant entries from shadowing type lookups
- discover_struct/enum_symbols derives SymbolInfo.path from
  field.id.path for consistency with split-path ID assignment
@hardbyte
Copy link
Copy Markdown
Contributor Author

@claude review

Comment on lines +457 to +480
let flattened_internal_enum =
struct_def
.fields
.iter()
.filter(|f| f.flattened())
.find_map(|field| {
let type_name = resolve_flattened_type_name(&field.type_ref);
match schema.get_type(type_name) {
Some(reflectapi_schema::Type::Enum(enum_def)) => {
match &enum_def.representation {
reflectapi_schema::Representation::Internal { tag } => {
Some((field, enum_def.clone(), tag.clone()))
}
_ => None,
}
}
_ => None,
}
});

if let Some((_enum_field, enum_def, tag)) = flattened_internal_enum {
// Wire-compatible path: generate per-variant models with merged fields
render_struct_with_flattened_internal_enum(
struct_def,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The find_map at python.rs:462 returns only the FIRST flattened internally-tagged enum field; if a struct has two such fields (valid in Rust when they use different tag names), the second enum's variants are never generated and are silently dropped from the Python output. Any consumer deserializing such a struct will face a mismatch: the Rust type has two independent discriminated unions flattened in, but the Python model only reflects one of them.

Extended reasoning...

What the bug is and how it manifests

In render_struct_with_flatten (python.rs lines 457–475), the iterator chains .filter(|f| f.flattened()).find_map(...) to locate a flattened internally-tagged enum field. find_map short-circuits on the first match and returns Option<(field, enum_def, tag)>. Only that single enum_def is ever passed to render_struct_with_flattened_internal_enum. If a struct has a second flattened internally-tagged enum field — valid Rust with serde when the two enums use distinct tag field names (e.g. type and kind) — find_map never sees it.

The specific code path that triggers it

Inside render_struct_with_flattened_internal_enum, the loop at lines 561–578 iterates over all flattened fields:

for field in struct_def.fields.iter().filter(|f| f.flattened()) {
    let type_name = resolve_flattened_type_name(&field.type_ref);
    if let Some(reflectapi_schema::Type::Struct(_)) = schema.get_type(type_name) {
        // expand struct fields into base_fields
    }
    // Enum fields are handled below as variants   <-- misleading comment
}

The comment says "Enum fields are handled below as variants", but "below" refers only to the for variant in &enum_def.variants loop, which iterates over the variants of the ONE enum that was found by find_map. A second flattened internally-tagged enum field is neither expanded into base_fields nor iterated as a variant block. It is completely skipped.

Why existing code does not prevent it

The function signature render_struct_with_flattened_internal_enum(... enum_def: &Enum ...) accepts a single enum. There is no mechanism to pass, receive, or render a second enum. The test suite (test_flatten_internally_tagged_enum_field) uses a struct with exactly one flattened enum, so the missing second-enum path is never exercised.

What the impact would be

Given a Rust struct:

struct Combined {
    id: String,
    #[serde(flatten)] action: ActionKind,   // internal tag "type"
    #[serde(flatten)] status: StatusKind,   // internal tag "kind"
}

The generated Python model would contain only the ActionKind discriminated union variants. Every variant that comes from StatusKind — including its tag field "kind" — is absent from the Python output. Any Python code receiving a wire message with {"id":"1","type":"Create","kind":"Active",...} would fail to deserialize or would silently ignore the kind and all status-related fields.

Step-by-step proof

  1. struct_def has two flattened fields: action: ActionKind (internal tag type) and status: StatusKind (internal tag kind).
  2. .filter(|f| f.flattened()).find_map(...)) evaluates action first. ActionKind matches Representation::Internal, so find_map returns Some((action_field, action_enum_def, "type")) immediately.
  3. status: StatusKind is never evaluated.
  4. render_struct_with_flattened_internal_enum receives enum_def = ActionKind and generates CombinedCreate, CombinedDelete, etc. — no CombinedActive, CombinedInactive variants.
  5. The inner loop at line 562 skips status because it is an Enum (not a Struct) and the comment defers to code that never runs for it.
  6. Result: StatusKind's variants are entirely absent from the Python output.

How to fix it

Collect ALL flattened internally-tagged enums (not just the first), then either: (a) generate a cross-product of variant combinations, which is complex but wire-accurate; or (b) for each additional internally-tagged enum beyond the first, fall back to the standard field emission path used for non-internal enums, with a documented limitation. At minimum, a warning or error should be surfaced when multiple flattened internally-tagged enums are detected, rather than silently generating incorrect output.

Comment on lines +148 to +160
StdNumNonZeroI32 = Annotated[int, "Rust NonZero i32 type"]
StdNumNonZeroI64 = Annotated[int, "Rust NonZero i64 type"]

# Rebuild models to resolve forward references
try:
ReflectapiDemoTestsSerdeCell.model_rebuild()
ReflectapiDemoTestsSerdeValue.model_rebuild()
except AttributeError:
# Some types may not have model_rebuild method
pass

# Factory classes (generated after model rebuild to avoid forward references)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The Python codegen emits model_rebuild() calls for Union type aliases (e.g., ReflectapiDemoTestsSerdeValue = Union[...]) alongside real BaseModel subclasses inside a single try/except AttributeError block. Union aliases have no model_rebuild() method, so the call always raises AttributeError. Because all calls share one block, any Union alias that sorts alphabetically before a real BaseModel subclass will silently abort the entire try block, leaving the real model's forward references unresolved. The fix is to either wrap each model_rebuild() call in its own try/except block, or filter the list to exclude Union aliases.

Extended reasoning...

What the bug is and how it manifests

The Python codegen in reflectapi/src/codegen/python.rs (around line 1320-1332) collects all rendered type names, sorts them alphabetically, and emits them in a single try/except AttributeError block. In the snapshot flatten_untagged_enum_field-5.snap (lines 148-160), ReflectapiDemoTestsSerdeValue is a plain Python Union type alias, not a Pydantic BaseModel subclass:

ReflectapiDemoTestsSerdeValue = Union[
    ReflectapiDemoTestsSerdeValueNum, ReflectapiDemoTestsSerdeValueText
]

Union type aliases in Python are typing special forms and have no model_rebuild() method. Calling .model_rebuild() on them always raises AttributeError.

The specific code path that triggers the latent bug

Types are sorted alphabetically before the block is emitted (sorted_type_names.sort() in python.rs). In the tested snapshot, Cell (C) sorts before Value (V), so ReflectapiDemoTestsSerdeCell.model_rebuild() runs first and succeeds, and then the AttributeError from ReflectapiDemoTestsSerdeValue.model_rebuild() is caught. This specific case is harmless.

However, the structural defect is that all calls share one try/except block. Consider any schema where a Union alias name sorts alphabetically before a real BaseModel/RootModel subclass — for example, an 'AValue = Union[...]' alias and a 'BModel(BaseModel)' class. The sequence would be: (1) AValue.model_rebuild() raises AttributeError, (2) the except block catches it and execution exits the entire try block, (3) BModel.model_rebuild() is never called.

Why existing code does not prevent it

The comment 'Some types may not have model_rebuild method' shows the author anticipated this case, but the single-block structure is the defect. The only reason the tested snapshots work is that all real models happen to sort before the Union aliases in the current test cases. With 'from future import annotations' active (which this generated file uses), Pydantic defers annotation evaluation and depends on model_rebuild() being called to resolve forward references in complex schemas. Any schema where a Union alias sorts before a real model relying on forward reference resolution will silently produce broken Pydantic models.

Step-by-step proof for the latent ordering failure

Suppose a schema produces 'AValueUnion = Union[AVariant1, AVariant2]' and 'class BModel(BaseModel): field: SomeForwardRef'. In the single try/except block (alphabetical order): AValueUnion.model_rebuild() raises AttributeError, the except catches it and exits the block, BModel.model_rebuild() never runs, and SomeForwardRef remains an unresolved string annotation in BModel.

How to fix it

Option 1 (simplest): wrap each call in its own try/except so that a failure on a Union alias does not abort subsequent real model rebuilds. Option 2: filter the type name list at codegen time to exclude Union type aliases, only emitting model_rebuild() calls for actual BaseModel/RootModel subclasses.

Comment on lines +93 to +110
pub struct SemanticEnum {
pub id: SymbolId,
pub name: String,
pub serde_name: String,
pub description: String,

/// Resolved generic parameters
pub parameters: Vec<SemanticTypeParameter>,

/// Variants ordered deterministically
pub variants: BTreeMap<SymbolId, SemanticVariant>,

/// Serde representation strategy
pub representation: crate::Representation,

/// Language-specific configuration
pub codegen_config: crate::LanguageSpecificTypeCodegenConfig,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 SemanticEnum.variants is BTreeMap<SymbolId, SemanticVariant> (semantic.rs:103), which sorts variants alphabetically by name rather than by declaration order. For #[serde(untagged)] enums, serde tries variants in declaration order and picks the first successful deserialization — any downstream codegen backend iterating SemanticEnum.variants will silently use the wrong order, causing incorrect deserialization when an alphabetically-earlier variant can absorb input intended for a later one. Fix by using IndexMap<SymbolId, SemanticVariant> or Vec to preserve insertion order.

Extended reasoning...

What the bug is and how it manifests

SemanticEnum.variants is typed as BTreeMap<SymbolId, SemanticVariant> (semantic.rs line 103). SymbolId derives Ord by field order: kind, then path: Vec, then disambiguator. All variants within the same enum share kind=Variant, and their path ends with the variant name — so the BTreeMap sorts them alphabetically by variant name, not by the order they appear in the source.

The specific code path that triggers it

In normalize.rs, build_semantic_enum (around line 1134) iterates enm.variants() — which returns variants in their raw declaration order (preserved in Vec) — and inserts each into a BTreeMap<SymbolId, SemanticVariant> keyed by SymbolId. The BTreeMap then re-sorts by SymbolId::Ord, discarding the position metadata. The Normalizer::build_semantic_enum code is:

for variant in enm.variants() {
    let semantic_variant = self.build_semantic_variant(variant)?;
    variants.insert(variant.id.clone(), semantic_variant);  // BTreeMap re-sorts
}

Why existing code does not prevent it

The raw Enum.variants field is Vec, which preserves declaration order. That order is available at the point build_semantic_enum iterates enm.variants(). However, the result is inserted into a BTreeMap which re-sorts by SymbolId. There is no assertion, test, or fallback that checks whether BTreeMap ordering matches declaration order.

What the impact would be

For #[serde(untagged)] enums, serde's contract is: try variants in declaration order, use the first that deserializes successfully. A codegen backend that iterates SemanticEnum.variants (the natural, intended API) will silently produce a client that applies variants in alphabetical order instead. This leads to incorrect deserialization for any untagged enum where two variants can both deserialize a given input — the wrong variant is selected with no error.

Example: an enum declared as [Integer(i64), Float(f64)] — both variants can deserialize the JSON value 42. Serde (declaration order) picks Integer. A backend using SemanticEnum.variants iteration (alphabetical) tries Float first and picks Float. The generated client silently deserializes a different type than Rust would.

The new test case added in this PR includes test_flatten_untagged_enum_field with enum Value { Num { value: f64 }, Text { text: String } }. Alphabetically Num < Text, which happens to match declaration order here. But for any enum where declaration order differs from alphabetical order, the bug manifests.

How to fix it

Replace BTreeMap<SymbolId, SemanticVariant> with an insertion-order-preserving collection:

  • IndexMap<SymbolId, SemanticVariant> from the indexmap crate — preserves insertion order, provides O(1) keyed lookup
  • Vec — simplest, no key-based lookup without an auxiliary index

The same fix is needed for SemanticVariant.fields and SemanticStruct.fields for correctness with positional (unnamed) fields.

Step-by-step proof

  1. Define an untagged enum: variants declared as [Integer(i64), Float(f64)].
  2. Normalizer::build_semantic_enum inserts Integer (SymbolId path=["MyEnum","Integer"]) then Float (path=["MyEnum","Float"]) into BTreeMap.
  3. BTreeMap sorts by path lexicographically: "Float" < "Integer", so Float entry comes first in iteration.
  4. A codegen backend calls semantic_enum.variants.values() and emits: try Float, then try Integer.
  5. For JSON input 42: both would match — Float wins because it was tried first. Rust's serde would have picked Integer (declaration order). The generated client deserializes a different type silently.

- Python codegen: set has_externally_tagged_enums flag for Adjacent
  representation too, fixing missing RootModel/model_validator imports
- generate_unique_name: join ALL non-excluded module components to
  avoid collisions (ServicesUserProfile vs AuthUserProfile)
- discover_symbols: use function.id.path instead of splitting HTTP
  URL path, fixing SymbolTable get_by_path for endpoints
@hardbyte
Copy link
Copy Markdown
Contributor Author

@claude review

- Sanitize tag discriminator field name for Python reserved words
  (e.g., "type" → "type_" with alias). Fixes SyntaxError when tag
  name is a Python keyword.
- Add model_rebuild() calls for per-variant classes generated by
  render_struct_with_flattened_internal_enum. Fixes forward reference
  resolution with `from __future__ import annotations`.
- Guard against empty enum variants producing invalid `Union[]` syntax.
ids.rs (3 tests):
- Zero-padded tuple field ordering (arg00..arg11 sort correctly)
- Disambiguated ID propagates to member IDs
- Schema root ID does not collide with same-named type

normalize.rs (4 tests):
- TypeConsolidation preserves all types with qualified name uniqueness
- resolve_types does not confuse variant with type of same name
- generate_unique_name distinguishes same-inner-module paths
- Function symbol path matches ID for get_by_path lookups
hardbyte added 13 commits March 28, 2026 16:30
Merges the askama dependency removal from PR #122. Template structs
now use manual render() methods returning String instead of
askama::Template derive + fallible render.

Conflict resolution: kept both the TestingModule render() impl from
#122 and the #[derive(Clone)] on Field from our branch.
Fixed render()? -> render() in render_struct_with_flattened_internal_enum.
Run Normalizer::normalize() at the start of Python codegen's generate()
function, making the SemanticSchema available alongside the raw Schema.

- Add convenience methods to SemanticSchema: get_type_by_name(),
  get_type(), types(), functions(), type_names()
- The SemanticSchema is constructed once and available for render
  functions that benefit from type-safe SymbolId lookups
- Raw Schema is still used for the main iteration loop since the
  Normalizer's NamingResolutionStage transforms type names, and the
  existing codegen relies on pre-normalization names
- Graceful fallback if normalization fails (best-effort)

This is the first consumer of SemanticSchema in the codegen path,
validating the IR infrastructure from #96.
- Replace broken fallback (would panic on same error) with .ok()
  that makes normalization best-effort
- Use _semantic prefix for intentionally-unused binding
- get_type_by_name: use symbol table O(log n) lookup with linear
  scan fallback, instead of always O(n)
- type_names: return iterator instead of allocating Vec<String>
- Remove stale dead code reference to `semantic` variable
Python codegen fixes:
- Underscore-prefixed fields no longer treated as Pydantic private
  attributes. sanitize_field_name strips leading underscores and
  generates Field(alias="_original") for wire compatibility.
- exclude_none=True removed from enum serializers — was dropping
  intentional None values. Plain model_dump() matches serde behavior.
- Factory method parameters now include type annotations
  (e.g., `def circle(radius: float)` instead of `def circle(radius)`).
- sanitize_field_name_with_alias now takes serde_name for proper
  alias generation on renamed fields.

Normalizer refactor:
- normalize() takes &Schema instead of Schema by value, eliminating
  the clone at the call site (clones internally for pipeline mutation)
- build_semantic_ir receives pre-pipeline original_names map
- SemanticPrimitive/Struct/Enum gain original_name field preserving
  pre-normalization qualified names
- SemanticSchema::get_type_by_name falls back to original_name search

~88 snapshots updated with type-annotated factory params, wire-name
aliases on renamed fields, and model_dump() without exclude_none.
Port the TypeScript/Rust namespace algorithm to Python codegen.
Type definitions remain at module top-level with flat PascalCase names
for Pydantic forward-reference resolution. Namespace alias classes
provide dotted access paths mirroring the Rust module hierarchy:

  class reflectapi_demo:
      class tests:
          class serde:
              Offer = ReflectapiDemoTestsSerdeOffer
              OfferKind = ReflectapiDemoTestsSerdeOfferKind

Users access types as: reflectapi_demo.tests.serde.Offer

Type references in annotations, client methods, model_rebuild calls,
and factory classes all use dotted paths. This matches the approach
used by TypeScript (export namespace) and Rust (pub mod) backends.

Implementation:
- New Module struct + modules_from_rendered_types (ported from TS)
- type_name_to_python_ref converts :: paths to dotted notation
- Client signatures use dotted type references
- Factory/testing utilities use namespaced names
- Removed old generate_nested_class_structure dead code

125 snapshot files updated.
- extract_defined_names now only matches top-level definitions
  (no leading whitespace), preventing enum member values like
  NOT_FOUND from leaking into namespace alias classes
- Filter out SCREAMING_SNAKE_CASE constants (enum members)
- Filter out *Variants internal union type aliases from namespace
  (implementation details, not part of the public API surface)
…lones

- Remove dead _semantic normalizer call (constructed but never used)
- Filter TypeVar declarations (T, U) from extract_defined_names
- Move instead of clone rendered_original_names_in_order
- Collect rendered_type_keys before moving rendered_types
- Delete dead Imports::render() method (~95 lines)
- Delete always-false has_flatten_support field
- Inline trivial to_valid_python_identifier wrapper
Coverage for previously untested code paths across 6 categories:

Namespace edge cases (3): single-segment types, deeply nested modules,
  numeric/special character field names

Flatten edge cases (5): nested flatten depth > 1, optional internally-
  tagged enum flatten, multiple flattened structs, combined struct +
  enum flatten, unit-variant-only enum flatten

Enum representation edge cases (4): generic externally-tagged enum,
  generic adjacently-tagged enum, mixed variant types (unit + struct),
  serde rename on variants

Type reference edge cases (4): Box<T> unwrapping, nested generic
  containers (Vec<Vec<u32>>), self-referential struct, Option<Option<T>>

Field sanitization edge cases (3): all Python keywords as field names,
  special characters in serde renames, multiple underscore prefixes

Factory/client edge cases (2): 12-variant enum at scale, empty enum

105 new snapshot files (21 tests x 5 snapshots each).
Real-world validation against Partly's core-server (284 endpoints,
78K-line schema) revealed two codegen bugs:

1. Descriptions containing backslashes (e.g., "object\'s") break Python
   docstrings because \ acts as a line continuation character. Added
   sanitize_for_docstring() that escapes \ and """ in all 13 template
   render methods that emit docstrings.

2. Factory method names derived from enum variant names (e.g., "global",
   "from") can be Python keywords, producing SyntaxError. Applied
   safe_python_identifier() to all factory method name and parameter
   name generation sites.

The generated 57K-line Python client for core-server now parses as
valid Python (verified with py_compile).
Fixes NameError when importing the generated client: model_rebuild()
was called inline (in render_struct_with_flattened_internal_enum)
before namespace alias classes were defined, so dotted type references
like `business_rules.Response` could not be resolved.

Moved all model_rebuild() calls to the global rebuild section which
runs after namespace classes are defined.

Also sanitized all remaining docstring emission points (13 locations)
to escape backslashes and triple-quotes in description text.

Fixed factory method names and parameters using Python keywords
(from, global) via safe_python_identifier().

Validated against Partly's core-server (284 endpoints, 78K-line schema):
- 57K-line Python client generates as valid Python
- Imports in 0.65s
- Successfully authenticates against live API (dev13)
mdbook 0.5.x changed the preprocessor JSON protocol, breaking
mdbook-keeper compatibility. Pin both tools to compatible versions:
- mdbook ~0.4 (0.4.x series)
- mdbook-keeper ~0.5

This fixes doc builds that have been failing on main since Jan 2026.
Applied to docs.yml, docs-preview.yml workflows.

Also: add __pycache__/*.pyc to .gitignore, remove accidentally
committed pycache files.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

📖 Documentation Preview: https://reflectapi-docs-preview-pr-124.partly.workers.dev

Updated automatically from commit 1dd3496

- TypeScript no longer uses askama (removed in #122), uses std::fmt::Write
- Python is no longer experimental — validated against production API
- Python section updated to document namespace classes, alias handling,
  docstring escaping, factory type annotations
- Python flatten example updated to show actual type_ alias pattern
- Limitations section references #127 for remaining DX improvements
Field descriptions:
- Schema field descriptions now emitted as Field(description="...")
  in generated Pydantic models. Descriptions appear in IDE hover,
  model_json_schema(), and help() output.
- Added sanitize_for_string_literal() to escape newlines, quotes,
  and backslashes in description strings.
- Flattened-field internal descriptions (prefixed "(flattened") are
  filtered out as they're implementation details.

Typed error returns:
- Client methods now return ApiResponse[OutputType, ErrorType]
  instead of ApiResponse[Any], making the error type visible in
  the signature and IDE autocompletion.
- ApiResponse runtime class updated to Generic[T, E] (backward
  compatible — ApiResponse[T] still works).
- Docstring return section shows both success and error types.

Validated against Partly's core-server (284 endpoints):
- All field descriptions preserved including multi-line ones
- All error types visible in method signatures
- Generated 57K-line client passes py_compile
@hardbyte hardbyte mentioned this pull request Mar 28, 2026
10 tasks
hardbyte added 10 commits March 29, 2026 09:24
Factory classes (371 in core-server output) consumed ~13K lines (19%
of file) and provided no value over direct type construction:

  # Before (factory):
  myapi_proto_PetsCreateErrorFactory.conflict()

  # After (direct, already works):
  myapi.proto.PetsCreateError("Conflict")

Removed:
- FactoryInfo struct and 5 factory generation functions
- generate_factory_method_params/args helpers
- render_*_without_factory naming (renamed to render_*/render_enum)
- sanitize_field_name (only used by factory code)
- HybridEnumClass and FactoryMethod template structs
- Default generate_testing changed to false

697 lines removed from python.rs (6595 -> 5898).
Core-server output: 58K lines (down from 68K, -13%).
Validated: imports, constructs types, authenticates against live API.
Runtime fixes:
- Use Pydantic TypeAdapter for all response validation, replacing
  the manual isinstance/model_validate chain. This correctly handles
  generic types like list[Model], dict[str, Model], Union types,
  and plain BaseModel subclasses.
- Use TypeAdapter.validate_json(bytes) for Pydantic's Rust-based
  fast JSON parser when raw bytes are available, falling back to
  validate_python(dict) otherwise.
- Add error_model parameter to _make_request and _handle_error_response.
  When an API returns an error, the runtime attempts to deserialize
  the error body into the typed error model. Accessible via
  ApplicationError.typed_error.

Codegen:
- Generated _make_request calls now pass error_model= with the
  typed error type from the schema.

Validated against Partly's core-server:
- list[BillingCurrencyListItem] returns typed Pydantic models
  (was returning raw dicts)
- CustomerGetError.typed_error = CustomerNotFoundVariant(customer_id=...)
  (was raw dict string)
- 170 currencies validated via fast validate_json path
The Python codegen now uses SemanticSchema as the primary driver for
type iteration, import detection, and function ordering:

- Type iteration uses semantic.types() (deterministic BTreeMap order)
  instead of manual topological_sort_types (removed: 118 lines)
- Import detection (has_enums, has_literal, etc.) uses SemanticType
  pattern matching instead of raw schema.get_type() lookups
- Function iteration uses semantic.functions() for ordering
- Deprecation detection uses semantic function metadata

The raw Schema is kept for rendering (render functions need concrete
Struct/Enum/Field types). Lookups use original_name (pre-normalization
qualified name like "analytics::AnalyticsEventInsertData") to find
types in the consolidated raw schema.

Fixed original_names capture in Normalizer: builds short→qualified
name mapping from pre-normalization type names, keyed by the
post-normalization short name that NamingResolutionStage produces.

Validated: 220 tests pass, core-server (284 endpoints) generates
valid 47K-line Python client, live API authentication works.
The Python codegen now uses SemanticSchema as the single source of
truth for type iteration, with the raw Schema providing concrete
type data for rendering.

Architecture:
- NormalizationPipeline::for_codegen() runs only CircularDependency
  detection (no TypeConsolidation, no NamingResolution)
- schema.consolidate_types() runs first, then Normalizer builds
  SemanticSchema from the consolidated schema
- Since NamingResolution is skipped, SemanticType.name() matches
  the raw Schema's names exactly — no name-domain mismatch
- Removed all original_name bridging logic

TypeVar collision fix:
- Detects when TypeVar names (e.g., Identity) collide with class
  names and renames them with _T_ prefix (_T_Identity)
- rename_type_params_in_schema() propagates renames through all
  type parameter declarations and type references

Validated: 220 tests pass, core-server (284 endpoints, 59K lines)
generates valid Python, live API authentication works.
Replace the fixed standard()/for_codegen() pipeline variants with a
declarative PipelineBuilder that lets backends configure each stage:

  PipelineBuilder::new()
      .consolidation(Consolidation::Skip)  // or Standard (default)
      .naming(Naming::Skip)                // or Standard, or Custom(stage)
      .circular_dependency_strategy(...)   // default: Intelligent
      .add_stage(custom_stage)             // append backend-specific stages
      .build()

Three configuration dimensions:
- Consolidation: Standard (run TypeConsolidationStage) | Skip
- Naming: Standard (NamingResolution) | Skip | Custom(Box<dyn Stage>)
- ResolutionStrategy: passed to CircularDependencyResolutionStage

Convenience methods standard() and for_codegen() delegate to the
builder internally and remain as shorthand. Python codegen uses
PipelineBuilder directly with Skip/Skip.

Architecture doc updated with PipelineBuilder diagram and config docs.
- Remove stale architecture doc claims (field descriptions and error
  types are now implemented, not "remaining gaps")
- Remove dead code in render_struct: unreachable flattened-fields
  collection loop (flattened structs take the early return path)
- Always import Field — used for descriptions, aliases, discriminators
  across many contexts. Fixes NameError for schemas with aliased fields
  but no discriminated unions.
- Remove dead try/except around response_model identity check in
  runtime client (both sync and async).
@hardbyte hardbyte merged commit 514ed99 into main Mar 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants