Skip to content

Added differential signer state sync#668

Draft
ihordiachenko wants to merge 15 commits intomainfrom
feature/diff-signer-state-sync
Draft

Added differential signer state sync#668
ihordiachenko wants to merge 15 commits intomainfrom
feature/diff-signer-state-sync

Conversation

@ihordiachenko
Copy link
Collaborator

Implements #745

Key design considerations

  • First heartbeat sends the full state; subsequent heartbeats send diffs only
  • Merge conflicts fall back to a full re-sync
  • Tombstone logic added to handle deletions explicitly
  • Protobuf schemas remain unchanged to preserve backward compatibility
  • Bandwidth savings will be measured in live deployment using trace logs

Copy link
Collaborator

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the first couple of commits, with a fine-toothed comb seeing that this is your first contribution. Don't worry we won't be as pedantic going forward :-)

I'll finish the review first thing tomorrow, but thought you might need the feedback early, so here's an incomplete review :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should likely be fatal, continuing here can get us into states we cannot recover from.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"Signer state {}",
serde_json::to_string(&prestate).unwrap_or_else(|_| "<failed to serialize>".to_string())
);
log::trace!("Signer state {}", prestate_log);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I notice that I am the one that introduced this non-fallible path 🤦

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which edge case are you referring to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore the TODOs from earlier commits, those were just implementation notes to myself while iterating. I’ve cleaned them up in the latest revision

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the second time we do this rollup, let's add a method to the sketch that allows us to retrieve some stats in a repeatable way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it debug-printable and we have something concrete to search for in the logs for verification.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in whether we should send the sketch across for verification on the counterparty's side? Could be a good idea temporarily to verify it works absolutely correct, and we can omit it later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about extending the protos to handle deletions initially, but later went with the tombstone approach

Cargo.lock Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHere does this new dependency come from?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not 22? I might not be seeing things correctly, but I though line 668 sets the value as the version is 3 > 2 -> 22 should be set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise I am assuming the following scenario: this diff instance is running on the gl-plugin, and the state contains (k2, 2, {"v": 2}), now the signer replies with the diff (k2, 3, {"v": 22}) which is set, so diffing the old state and the new state, I should be getting (k2, 3, {"v": 22}) as part of the patch that brings us from the per-state to the post-state. Or am I looking at the wrong side of the exchange. Comments as to what is intended can be useful to preempt these questions in future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I saw it now, you are checking the versions, not the values. Please check both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done + added a helper function for better readability

Cargo.lock Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it gets removed again :-) A quick rebase should remove this altogether from the commit series.

Copy link
Collaborator

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR, I like it a lot. I think we might have had a small misunderstanding with the tombstone, in that I had in mind using the last valid version number to block any future updates on it, and sending that back and forth intrinsically making updates fail.

Also I'm not sure the conflicts can even happen, and what I think you were marking as conflict, is actually a normal thing to happen, i.e., a No-op following an old cached value being returned.

Comment on lines +90 to +93
self.put_tombstone(&node_key);
self.put_tombstone(&state_key);
self.put_tombstone(&allowlist_key);
self.put_tombstone(&tracker_key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively writing 4 new keys, not touching the old keys if I see it correctly. By tombstone I meant more like (key, UINT64_MAX, NULL) making it impossible to ever overwrite in the future, and telling the counterparty that it is ok to forget the value, no newer value will ever be published.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd also mean that no special handling needs to be implemented at all, obviating any tracking across unrelated keys.

Sorry for being so unprecise and making the tombstone sound like something bigger than it really is :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, just got carried away developing my little distributed DB lol. Simplified using the version based approach

#[derive(Debug, Default)]
pub struct SafeMergeResult {
pub changes: Vec<(String, Option<u64>, u64)>,
pub conflict_count: usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you are counting overwrites with older versions as conflict, when really its not. The reason is that we may have multiple requests in flight, request A changes field A, and request B changes field B, if the signer B handles request B it'll return the A with the same version as before, which you would consider a conflict, but it is just stale, and merging will just pick the newer value. The staging takes care of ensuring that state is updated correctly, please copy those semantics where possible.

Comment on lines 190 to 198
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also just falls away by making the tombstone just a special value (that behaves exactly like any other value, with the exception it is not updateable).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

trace!("hsmd hsm_id={} request processor started", hsm_id);
let mut last_sent_sketch = StateSketch::new();
let mut last_seen_sync_epoch = state_sync_epoch.load(Ordering::SeqCst);
enum SyncMode {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a full sync mode at all? As mentioned before the conflicts you identified are in fact no-ops, and therefore we don't end up falling back to full sync. The Diff mode will still do a full sync when the connection is first established simply because its sketch for what the signer already has is empty, so all kvv tuples are new from its PoV.

Comment on lines 25 to 28
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need if tombstones are just normal values we can just omit them when sending the values to the signer during the initial sync. We'd still be storing them, but they'd not contribute to the inter-signer trafic,

@ihordiachenko ihordiachenko changed the title Draft: Added differential signer state sync Added differential signer state sync Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments