Concurrency hardening for BfTree parallel workloads#1158
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens diskann-bftree for concurrent workloads by synchronizing neighbor-list mutations (preventing lost updates), correcting dual-store (full + quant) lifecycle behavior, and adding concurrent stress tests to validate the new behavior.
Changes:
- Add striped per-vertex
RwLocksynchronization toNeighborProvider(read locks for reads, write locks for mutations) and introduce concurrent stress tests. - Adjust hard-delete behavior to also delete quantized vectors when present via a small abstraction over the quant store.
- Reorder dual-store writes in
set_elementto write full-precision first, then quantized.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| diskann-bftree/src/neighbors.rs | Adds striped locking around neighbor-list read/modify/write and adds concurrent stress tests. |
| diskann-bftree/src/provider.rs | Adds quant-store cleanup on delete via DeleteQuant and reorders full/quant write sequencing in set_element. |
| diskann-bftree/src/quant.rs | Adds quant-store delete support used by hard-delete cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Then, write the compressed representation to the quant store. | ||
| if let Err(err) = self.quant_vectors.set_vector_sync(id.into_usize(), element) { | ||
| debug_assert!( | ||
| false, | ||
| "quant write failed after full-precision success: {err}" | ||
| ); | ||
| return std::future::ready(Err(err)); | ||
| } |
| // Each thread writes to a different set of vertices | ||
| for i in 0..appends_per_thread { | ||
| let vertex = (t * appends_per_thread + i) % num_vertices; | ||
| // Use a unique neighbor id per (thread, iteration) | ||
| let neighbor_id = t * 1000 + i; |
| /// Trait for types that may need cleanup during hard-delete. | ||
| /// | ||
| /// `QuantVectorProvider` implements this to delete its quantized embedding. | ||
| /// `NoStore` implements this as a no-op. | ||
| pub(crate) trait DeleteQuant { | ||
| fn delete_vector(&self, id: usize); | ||
| } |
|
Jordan, what is the perf impact of this change? Is ingestion speed the same as before? thanks |
aea01c1 to
04a1dd8
Compare
Add striped RwLock to NeighborProvider to eliminate the TOCTOU race in append_vector's read-modify-write cycle. Under 8-thread contention on the same vertex, the unprotected path loses 11-51% of edges. Striped locks (16384 stripes, Fibonacci multiply-shift hash): - Constant ~128 KB memory regardless of dataset size (vs ~8 GB at 1B vectors for per-vertex locks) - RwLock allows concurrent readers during search; only writers serialize - Read lock on get_neighbors, write lock on set/append/delete - Internal get_neighbors_unlocked avoids deadlock in append_vector Dual-store write ordering (SetElement for QuantVectorProvider): - Write full-precision (authoritative) before quant - If quant write fails, worst case is a vector missing from quantized search but still reachable in full-precision (benign) Hard-delete cleanup (Delete trait): - Added DeleteQuant trait; QuantVectorProvider deletes its entry, NoStore is a no-op - Single generic Delete impl with Q: DeleteQuant bound - Neighbor adjacency cleanup handled by upper DiskANNIndex layer Tests: - 3 concurrent stress tests proving the TOCTOU fix - Verified 10/10 failures without locks, 10/10 passes with locks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
04a1dd8 to
c94cdcc
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1158 +/- ##
==========================================
+ Coverage 89.46% 89.49% +0.02%
==========================================
Files 487 487
Lines 92102 92376 +274
==========================================
+ Hits 82401 82668 +267
- Misses 9701 9708 +7
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
@harsha-simhadri I'm seeing roughly 10% additional index build time on my arm-based laptop in WSL building Wiki100k. Here's copilot's summary of my test results for your consideration: |
| } | ||
|
|
||
| impl StripedLocks { | ||
| const DEFAULT_STRIPE_COUNT: usize = 16384; |
There was a problem hiding this comment.
Why was 16,384 picked and how does this interact with tests? It seems like a rather hefty allocation .
There was a problem hiding this comment.
removed and switched to something based on current hardware.
| // deleted vertex's edge list and patching neighbors-of-neighbors) is | ||
| // handled by `DiskANNIndex::inplace_delete` → `drop_adj_list`. | ||
| self.full_vectors.delete_vector(id as usize); | ||
| self.quant_vectors.delete_vector(id as usize); |
There was a problem hiding this comment.
If we're thinking about race conditions: this one rases with a SetElement 😉
There was a problem hiding this comment.
kk this should be sorted now
Move StripedLocks from NeighborProvider to BfTreeProvider so that delete and set_element operations are synchronized under the same lock table as neighbor mutations. This eliminates a potential race between concurrent delete and set_element on the same vertex ID. - Extract StripedLocks into its own module (locks.rs) - BfTreeProvider owns Arc<StripedLocks>, shared with NeighborAccessor - Dynamic stripe count based on available_parallelism (4x cores, min 64) - delete() and set_element() now acquire write locks per vertex Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@copilot give it another review. |
I re-reviewed and found one remaining concurrency issue: beam expansion paths were reading neighbors without taking the striped read lock. I fixed both call sites in commit Validation run:
|
Problem
NeighborProvider::append_vector performs a read-modify-write cycle (read existing neighbors → append → write back) without synchronization. Under concurrent mutation, the last writer wins and silently drops the other's edges. Stress testing shows 11-51% edge loss under 8-thread contention on a single vertex.
Additionally, the dual-store SetElement wrote quant before full-precision, meaning if the full-precision write failed after quant succeeded, it would leave a quantized ghost with no backing data. And Delete did not clean up the quant store entry.
Solution
Striped RwLock on NeighborProvider
Why not per-vertex locks?
BfTree's value proposition is spilling to disk for datasets exceeding RAM. At 1B vectors, per-vertex RwLock<()> would consume 8 GB of in-memory locks — defeating the purpose. Striped locks keep memory constant at the cost of rare false contention (0.006% collision rate at 16K stripes).
Dual-store write ordering
Quant store cleanup on delete
Testing