Skip to content

feat(consumer): Add commit log heartbeat for idle partitions#7819

Draft
yuvmen wants to merge 1 commit intomasterfrom
yuvalm/commit-log-heartbeat-idle-partitions
Draft

feat(consumer): Add commit log heartbeat for idle partitions#7819
yuvmen wants to merge 1 commit intomasterfrom
yuvalm/commit-log-heartbeat-idle-partitions

Conversation

@yuvmen
Copy link
Member

@yuvmen yuvmen commented Mar 16, 2026

This is a draft for possibly fixing an issue in single tenants where we scaled the post-process-forwarder consumer from 1 to 16 partitions and are now seeing increased latency.

Summary

  • When the events topic has low per-partition throughput (<1 msg/sec/partition across 16 partitions), some partitions may have no messages in a given batch window
  • Snuba only produces commit log entries for partitions with data in the current batch
  • The post-process-forwarder (SynchronizedConsumer) gates on the commit log — idle partitions get no entry, causing it to pause and wait, producing latency spikes
  • This change emits a "heartbeat" commit log entry for all assigned idle partitions on each batch flush, re-using the last known committed offset
  • Caps the maximum stall to one batch window (~500ms) instead of unbounded

Changes

  • Python: CommitLogHeartbeatState shared across batch writer instances, heartbeat logic in ProcessedMessageBatchWriter.close() and MultistorageCollector.close()
  • Rust: ProduceMessage tracks assigned partitions and last offsets, emits heartbeats for idle partitions
  • Wiring: Heartbeat state flows through ConsumerBuilderbuild_batch_writer() + KafkaConsumerStrategyFactory, updated on each rebalance via create_with_partitions()

Test plan

  • 8 new Python unit tests covering: idle partition heartbeats, no heartbeat without prior offset, rebalance cleanup, offset evolution, no-state passthrough, MultistorageCollector, build_batch_writer shared state, strategy factory partition propagation
  • 1 new Rust test: heartbeat entries produced for idle partitions with correct offsets
  • All existing tests pass
  • Deploy to staging and verify snuba-post-processor latency spikes are reduced

🤖 Generated with Claude Code

When the events topic has low per-partition throughput, some partitions
may have no messages in a given batch window. Since Snuba only produces
commit log entries for partitions with data, idle partitions get no
entry. This causes the post-process-forwarder (SynchronizedConsumer) to
pause those partitions and wait, increasing end-to-end latency with
random spikes.

On each batch flush, now produce commit log entries for ALL assigned
partitions. For idle partitions, re-emit the last known committed offset
as a heartbeat. This caps the maximum stall to one batch window (~500ms)
instead of being unbounded.

Changes:
- Python: Add CommitLogHeartbeatState shared across batch writers
- Python: ProcessedMessageBatchWriter and MultistorageCollector emit
  heartbeats for idle partitions in close()
- Python: Wire heartbeat state through strategy factory and consumer
  builder
- Rust: ProduceMessage tracks assigned partitions and last offsets,
  emits heartbeats for idle partitions
- Tests: 8 new Python tests + 1 Rust test for heartbeat behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant