Release held LSPS2 HTLCs when abandoning JIT opens#4703
Conversation
|
👋 Thanks for assigning @tnull as a reviewer! |
|
I've reviewed the current state of the code against my prior review notes. The substantive issues were already covered in my previous review pass (the inner- I examined the full diff again for anything new — the Review SummaryNo new issues found beyond those already posted in the prior review pass. For reference, the previously posted findings still standing:
Cross-cutting (non-blocking, consistent with existing file patterns): in-memory maps are mutated before |
| // channel map, so pruning can leave the SCID pointing here after | ||
| // the channel entry was removed. Fail the HTLC instead of leaving | ||
| // it held. | ||
| let _ = self.channel_manager.get_cm().fail_intercepted_htlc(intercept_id); |
There was a problem hiding this comment.
I don't think we can do this - the LSPS2 flow might not be the only consumer of intercept SCIDs necessarily. Previously the API was even idempotent and not returning an error, so that many handlers could consume LDK's events without interfering with this flow/each other. That changed by now (for better or worse), but I still don't think we should just arbitrary fail something if we don't know we don't own it.
There was a problem hiding this comment.
Yes, Claude also mentioned something along those lines in its review above. Will make another pass on this.
There was a problem hiding this comment.
I just did the bookkeeping cleanup, no fail back anymore. Also reworked the tests to be independent. Not as much need anymore to reuse existing code.
8bf1242 to
2165cde
Compare
79a6533 to
bfae928
Compare
|
Seemed there was also a persist call missing. Added in separate commit. |
| // the channel entry was removed. | ||
| drop(peer_state); | ||
| drop(outer_state_lock); | ||
| self.peer_by_intercept_scid.write().unwrap().remove(&intercept_scid); |
There was a problem hiding this comment.
Two notes on this branch:
-
Correction to my earlier deadlock comment: I previously flagged
peer_by_intercept_scid.write()here as deadlocking against theif letread guard. That is not the case in this code — the guard is created in the separateletstatement at line 1002 (...read().unwrap().get(&intercept_scid).copied()), so it is dropped at that statement's;, well before this block runs. There is no held read guard here. Please disregard the deadlock concern. -
Genuine (minor) issue: this branch is reachable —
prune_expired_request_stateremoves entries fromoutbound_channels_by_intercept_scidbut leaves the SCID in the globalpeer_by_intercept_scid, so an HTLC can arrive for a stale SCID and land here. Unlike theErrpath above (which callsfail_intercepted_htlc), this branch returns an error without failing the held HTLC. The HTLC then stays stuck inChannelManageruntil CLTV timeout. Since LSPS2 did recognize the SCID, consider failing it back here too.
There was a problem hiding this comment.
@tnull what do you think of 2, should we fail back because it did recognize the scid?
Fail queued intercepted HTLCs before removing pending LSPS2 JIT-channel state in channel_open_abandoned. Remove the abandoned intercept SCID from the global lookup and centralize per-peer channel cleanup so the state is marked dirty. Add a real-interception test that verifies abandon releases the held HTLC from ChannelManager before removing LSPS2 state.
When an LSPS2-owned intercepted HTLC is rejected, remove its per-peer channel state and global SCID lookup entry before returning the API error. Copy the SCID owner out before taking the global write lock so cleanup does not deadlock in Rust 2021. Keep unknown SCIDs non-destructive so applications can multiplex interception events across handlers, and cover that with a real interception test.
bfae928 to
8623f32
Compare
When htlc_intercepted rejects an LSPS2-owned HTLC, it fails the intercept and prunes the per-peer and global SCID mappings before returning an API error. Persist the peer state after releasing locks so the cleanup survives restarts. If persistence also fails, include the original HTLC interception error in the returned diagnostic instead of hiding it.
8623f32 to
83cea8e
Compare
tnull
left a comment
There was a problem hiding this comment.
Hmm, here's a lot going on here, also seemingly unrelated refactorings and the scope of the PR is now different from the original issue, but a rather substantial change now (~300 LoC). Can't say I'm the biggest fan of slipping that in last-minute, but if we still intend to:
a) We should check we actually cover all the cases mentioned in #3468 (and link that issue, given we now seem to address it)
b) Would be good to keep the individual commits less invasive, especially if there's no need to rewrite the pre-existing patterns that are also present in other methods of the service (not saying it's the most beautiful code, but consistency is important).
| peer_state.remove_outbound_channel(intercept_scid); | ||
| let err = e.err; | ||
| drop(peer_state); | ||
| drop(outer_state_lock); |
There was a problem hiding this comment.
Why do we need these explicit drops?
| if let Some(jit_channel) = | ||
| peer_state.outbound_channels_by_intercept_scid.get_mut(&intercept_scid) | ||
| { | ||
| should_persist = Some(*counterparty_node_id); |
There was a problem hiding this comment.
Why do we need all these changes from reference to copy suddenly?
| .remove(&intercept_scid); | ||
| // TODO: cleanup peer_by_intercept_scid | ||
| return Err(APIError::APIMisuseError { err: e.err }); | ||
| if let Err(e) = self.persist_peer_state(counterparty_node_id).await |
There was a problem hiding this comment.
Not sure we need to eagerly persist here?
| }, | ||
| } | ||
| } else { | ||
| // `peer_by_intercept_scid` is a separate index from the per-peer |
There was a problem hiding this comment.
What is the 'per-peer channel map'? In which case would pruning leave an entry in peer_by_intercept_scid?
| OutboundJITChannelState::PendingInitialPayment { .. } | ||
| | OutboundJITChannelState::PendingChannelOpen { .. } | ||
| ); | ||
| let intercepted_htlcs = match &mut jit_channel.state { |
There was a problem hiding this comment.
Why do we need all this seemingly unrelated refactoring?
When an LSPS2 JIT open is abandoned after the initial payment has already been intercepted, the service now unwinds both sides of the pending flow: it fails held HTLCs back through LDK and removes the stale intercept SCID lookup.
The tests exercise this through a real intercepted-payment path, so the abandon case now verifies the HTLC release, stale-SCID cleanup, and a fresh follow-up JIT attempt.