Concurrency hardening for BfTree parallel workloads by JordanMaples · Pull Request #1158 · microsoft/DiskANN

JordanMaples · 2026-06-12T21:09:27Z

Problem

NeighborProvider::append_vector performs a read-modify-write cycle (read existing neighbors → append → write back) without synchronization. Under concurrent mutation, the last writer wins and silently drops the other's edges. Stress testing shows 11-51% edge loss under 8-thread contention on a single vertex.

Additionally, the dual-store SetElement wrote quant before full-precision, meaning if the full-precision write failed after quant succeeded, it would leave a quantized ghost with no backing data. And Delete did not clean up the quant store entry.

Solution

Striped RwLock on NeighborProvider

16,384 stripes with Fibonacci multiply-shift hashing (⌊2⁶⁴/φ⌋)
Constant ~128 KB memory regardless of dataset size — suitable for billion-scale workloads
RwLock allows concurrent readers (search) while only writers serialize
Read lock on get_neighbors, write lock on set_neighbors/append_vector/delete_vector

Why not per-vertex locks?

BfTree's value proposition is spilling to disk for datasets exceeding RAM. At 1B vectors, per-vertex RwLock<()> would consume 8 GB of in-memory locks — defeating the purpose. Striped locks keep memory constant at the cost of rare false contention (0.006% collision rate at 16K stripes).

Dual-store write ordering

Full-precision is now written first in SetElement
Failure mode is benign: vector reachable in full-precision but not yet in quantized search

Quant store cleanup on delete

DeleteQuant trait with real impl for QuantVectorProvider, no-op for NoStore
Single generic Delete impl; neighbor topology cleanup remains in the upper DiskANNIndex layer

Testing

3 new concurrent stress tests: - test_concurrent_append_no_lost_edges — 8 threads × 10 edges to same vertex
test_concurrent_append_independent_vertices — 8 threads to different vertices
test_concurrent_read_write_consistency — mixed readers + writers, validates no torn reads

Copilot

Pull request overview

This PR hardens diskann-bftree for concurrent workloads by synchronizing neighbor-list mutations (preventing lost updates), correcting dual-store (full + quant) lifecycle behavior, and adding concurrent stress tests to validate the new behavior.

Changes:

Add striped per-vertex RwLock synchronization to NeighborProvider (read locks for reads, write locks for mutations) and introduce concurrent stress tests.
Adjust hard-delete behavior to also delete quantized vectors when present via a small abstraction over the quant store.
Reorder dual-store writes in set_element to write full-precision first, then quantized.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
diskann-bftree/src/neighbors.rs	Adds striped locking around neighbor-list read/modify/write and adds concurrent stress tests.
diskann-bftree/src/provider.rs	Adds quant-store cleanup on delete via `DeleteQuant` and reorders full/quant write sequencing in `set_element`.
diskann-bftree/src/quant.rs	Adds quant-store delete support used by hard-delete cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        // Then, write the compressed representation to the quant store.
+        if let Err(err) = self.quant_vectors.set_vector_sync(id.into_usize(), element) {
+            debug_assert!(
+                false,
+                "quant write failed after full-precision success: {err}"
+            );
            return std::future::ready(Err(err));
        }


+                // Each thread writes to a different set of vertices
+                for i in 0..appends_per_thread {
+                    let vertex = (t * appends_per_thread + i) % num_vertices;
+                    // Use a unique neighbor id per (thread, iteration)
+                    let neighbor_id = t * 1000 + i;


+/// Trait for types that may need cleanup during hard-delete.
+///
+/// `QuantVectorProvider` implements this to delete its quantized embedding.
+/// `NoStore` implements this as a no-op.
+pub(crate) trait DeleteQuant {
+    fn delete_vector(&self, id: usize);
+}


harsha-simhadri · 2026-06-13T01:06:49Z

Jordan, what is the perf impact of this change? Is ingestion speed the same as before? thanks

Add striped RwLock to NeighborProvider to eliminate the TOCTOU race in append_vector's read-modify-write cycle. Under 8-thread contention on the same vertex, the unprotected path loses 11-51% of edges. Striped locks (16384 stripes, Fibonacci multiply-shift hash): - Constant ~128 KB memory regardless of dataset size (vs ~8 GB at 1B vectors for per-vertex locks) - RwLock allows concurrent readers during search; only writers serialize - Read lock on get_neighbors, write lock on set/append/delete - Internal get_neighbors_unlocked avoids deadlock in append_vector Dual-store write ordering (SetElement for QuantVectorProvider): - Write full-precision (authoritative) before quant - If quant write fails, worst case is a vector missing from quantized search but still reachable in full-precision (benign) Hard-delete cleanup (Delete trait): - Added DeleteQuant trait; QuantVectorProvider deletes its entry, NoStore is a no-op - Single generic Delete impl with Q: DeleteQuant bound - Neighbor adjacency cleanup handled by upper DiskANNIndex layer Tests: - 3 concurrent stress tests proving the TOCTOU fix - Verified 10/10 failures without locks, 10/10 passes with locks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

codecov-commenter · 2026-06-15T15:58:55Z

Codecov Report

❌ Patch coverage is 97.00855% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.49%. Comparing base (1bdf1a1) to head (d7a8bd1).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
diskann-bftree/src/neighbors.rs	97.74%	4 Missing ⚠️
diskann-bftree/src/provider.rs	88.46%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1158      +/-   ##
==========================================
+ Coverage   89.46%   89.49%   +0.02%     
==========================================
  Files         487      487              
  Lines       92102    92376     +274     
==========================================
+ Hits        82401    82668     +267     
- Misses       9701     9708       +7

Flag	Coverage Δ
miri	`89.49% <97.00%> (+0.02%)`	⬆️
unittests	`89.14% <97.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-bftree/src/lib.rs	`47.36% <ø> (ø)`
diskann-bftree/src/locks.rs	`100.00% <100.00%> (ø)`
diskann-bftree/src/quant.rs	`89.69% <100.00%> (+0.17%)`	⬆️
diskann-bftree/src/provider.rs	`90.85% <88.46%> (-0.12%)`	⬇️
diskann-bftree/src/neighbors.rs	`95.34% <97.74%> (+1.39%)`	⬆️

... and 57 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JordanMaples · 2026-06-15T20:06:34Z

Jordan, what is the perf impact of this change? Is ingestion speed the same as before? thanks

@harsha-simhadri I'm seeing roughly 10% additional index build time on my arm-based laptop in WSL building Wiki100k.

Here's copilot's summary of my test results for your consideration:

--------------------------------------------------------------------------------------------------

Performance A/B results on Wikipedia-100K (768D, float32, inner product):

Static build (multi-insert, batch_size=128, 8 parallel inserters):

┌───────────────────────────────┬────────────┬───────────────┐
│                               │ With locks │ Without locks │
├───────────────────────────────┼────────────┼───────────────┤
│ Build time                    │ 57.82s     │ 52.46s        │
├───────────────────────────────┼────────────┼───────────────┤
│ Insert latency (mean)         │ 73.8ms     │ 66.9ms        │
├───────────────────────────────┼────────────┼───────────────┤
│ Recall@100 (search_l=100)     │ 0.9022     │ 0.9022        │
├───────────────────────────────┼────────────┼───────────────┤
│ Recall@100 (search_l=200)     │ 0.9536     │ 0.9536        │
├───────────────────────────────┼────────────┼───────────────┤
│ Search QPS (8 threads, l=200) │ 519        │ 484           │
└───────────────────────────────┴────────────┴───────────────┘

Streaming (insert/delete/reinsert churn via BigANN runbook, 8 threads):

┌───────────────────────────┬────────────┬───────────────┐
│                           │ With locks │ Without locks │
├───────────────────────────┼────────────┼───────────────┤
│ Insert latency (mean)     │ 3739µs     │ 3965µs        │
├───────────────────────────┼────────────┼───────────────┤
│ Recall@100 (search_l=200) │ 0.9766     │ 0.9765        │
├───────────────────────────┼────────────┼───────────────┤
│ Search QPS (8 threads)    │ 813        │ 812           │
└───────────────────────────┴────────────┴───────────────┘

Summary: 
Locks add ~10% build overhead in the contended parallel-insert path. 
Search performance and recall are unaffected. In the streaming (sequential insert) path, overhead is within noise. 
The locks provide a correctness guarantee against edge loss under high contention — newly added stress unit tests show 11–51% edge loss without them when threads hammer the same vertices concurrently.

hildebrandmw · 2026-06-16T00:14:24Z

+}
+
+impl StripedLocks {
+    const DEFAULT_STRIPE_COUNT: usize = 16384;


Why was 16,384 picked and how does this interact with tests? It seems like a rather hefty allocation .

removed and switched to something based on current hardware.

hildebrandmw · 2026-06-16T00:15:31Z

+        // deleted vertex's edge list and patching neighbors-of-neighbors) is
+        // handled by `DiskANNIndex::inplace_delete` → `drop_adj_list`.
        self.full_vectors.delete_vector(id as usize);
+        self.quant_vectors.delete_vector(id as usize);


If we're thinking about race conditions: this one rases with a SetElement 😉

kk this should be sorted now

Move StripedLocks from NeighborProvider to BfTreeProvider so that delete and set_element operations are synchronized under the same lock table as neighbor mutations. This eliminates a potential race between concurrent delete and set_element on the same vertex ID. - Extract StripedLocks into its own module (locks.rs) - BfTreeProvider owns Arc<StripedLocks>, shared with NeighborAccessor - Dynamic stripe count based on available_parallelism (4x cores, min 64) - delete() and set_element() now acquire write locks per vertex Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

JordanMaples · 2026-06-16T20:24:37Z

@copilot give it another review.

Copilot · 2026-06-16T20:38:17Z

@copilot give it another review.

I re-reviewed and found one remaining concurrency issue: beam expansion paths were reading neighbors without taking the striped read lock. I fixed both call sites in commit 562c341 by acquiring self.provider.locks.read(n.into_usize()) before get_neighbors.

Validation run:

cargo test -p diskann-bftree --profile ci ✅
cargo clippy -p diskann-bftree --all-targets --config 'build.rustflags=["-Dwarnings"]' -- -D warnings ✅
Copilot code review ✅ (no additional comments)
CodeQL scan timed out in the validation tool (no rerun per tool guidance).

JordanMaples requested review from a team and Copilot June 12, 2026 21:09

Copilot started reviewing on behalf of JordanMaples June 12, 2026 21:09 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

JordanMaples force-pushed the jordanmaples/bftree_parallelism branch from aea01c1 to 04a1dd8 Compare June 15, 2026 15:34

JordanMaples force-pushed the jordanmaples/bftree_parallelism branch from 04a1dd8 to c94cdcc Compare June 15, 2026 15:44

hildebrandmw reviewed Jun 16, 2026

View reviewed changes

Copilot started work on behalf of JordanMaples June 16, 2026 20:25 View session

fix(bftree): lock neighbor reads in beam expansion

562c341

Copilot finished work on behalf of JordanMaples June 16, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency hardening for BfTree parallel workloads#1158

Concurrency hardening for BfTree parallel workloads#1158
JordanMaples wants to merge 3 commits into
mainfrom
jordanmaples/bftree_parallelism

JordanMaples commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

harsha-simhadri commented Jun 13, 2026

Uh oh!

codecov-commenter commented Jun 15, 2026 •

edited

Loading

Uh oh!

JordanMaples commented Jun 15, 2026

Uh oh!

hildebrandmw Jun 16, 2026

Uh oh!

JordanMaples Jun 16, 2026

Uh oh!

hildebrandmw Jun 16, 2026

Uh oh!

JordanMaples Jun 16, 2026

Uh oh!

JordanMaples commented Jun 16, 2026

Uh oh!

Copilot AI commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

JordanMaples commented Jun 12, 2026

Problem

Solution

Striped RwLock on NeighborProvider

Why not per-vertex locks?

Dual-store write ordering

Quant store cleanup on delete

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

harsha-simhadri commented Jun 13, 2026

Uh oh!

codecov-commenter commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JordanMaples commented Jun 15, 2026

Uh oh!

hildebrandmw Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

JordanMaples Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

JordanMaples Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

JordanMaples commented Jun 16, 2026

Uh oh!

Copilot AI commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Jun 15, 2026 •

edited

Loading