Skip to content

Concurrency hardening for BfTree parallel workloads#1158

Open
JordanMaples wants to merge 3 commits into
mainfrom
jordanmaples/bftree_parallelism
Open

Concurrency hardening for BfTree parallel workloads#1158
JordanMaples wants to merge 3 commits into
mainfrom
jordanmaples/bftree_parallelism

Conversation

@JordanMaples

Copy link
Copy Markdown
Contributor

Problem

NeighborProvider::append_vector performs a read-modify-write cycle (read existing neighbors → append → write back) without synchronization. Under concurrent mutation, the last writer wins and silently drops the other's edges. Stress testing shows 11-51% edge loss under 8-thread contention on a single vertex.

Additionally, the dual-store SetElement wrote quant before full-precision, meaning if the full-precision write failed after quant succeeded, it would leave a quantized ghost with no backing data. And Delete did not clean up the quant store entry.

Solution

Striped RwLock on NeighborProvider

  • 16,384 stripes with Fibonacci multiply-shift hashing (⌊2⁶⁴/φ⌋)
  • Constant ~128 KB memory regardless of dataset size — suitable for billion-scale workloads
  • RwLock allows concurrent readers (search) while only writers serialize
  • Read lock on get_neighbors, write lock on set_neighbors/append_vector/delete_vector

Why not per-vertex locks?

BfTree's value proposition is spilling to disk for datasets exceeding RAM. At 1B vectors, per-vertex RwLock<()> would consume 8 GB of in-memory locks — defeating the purpose. Striped locks keep memory constant at the cost of rare false contention (0.006% collision rate at 16K stripes).

Dual-store write ordering

  • Full-precision is now written first in SetElement
  • Failure mode is benign: vector reachable in full-precision but not yet in quantized search

Quant store cleanup on delete

  • DeleteQuant trait with real impl for QuantVectorProvider, no-op for NoStore
  • Single generic Delete impl; neighbor topology cleanup remains in the upper DiskANNIndex layer

Testing

  • 3 new concurrent stress tests: - test_concurrent_append_no_lost_edges — 8 threads × 10 edges to same vertex
  • test_concurrent_append_independent_vertices — 8 threads to different vertices
  • test_concurrent_read_write_consistency — mixed readers + writers, validates no torn reads

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens diskann-bftree for concurrent workloads by synchronizing neighbor-list mutations (preventing lost updates), correcting dual-store (full + quant) lifecycle behavior, and adding concurrent stress tests to validate the new behavior.

Changes:

  • Add striped per-vertex RwLock synchronization to NeighborProvider (read locks for reads, write locks for mutations) and introduce concurrent stress tests.
  • Adjust hard-delete behavior to also delete quantized vectors when present via a small abstraction over the quant store.
  • Reorder dual-store writes in set_element to write full-precision first, then quantized.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
diskann-bftree/src/neighbors.rs Adds striped locking around neighbor-list read/modify/write and adds concurrent stress tests.
diskann-bftree/src/provider.rs Adds quant-store cleanup on delete via DeleteQuant and reorders full/quant write sequencing in set_element.
diskann-bftree/src/quant.rs Adds quant-store delete support used by hard-delete cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +617 to 624
// Then, write the compressed representation to the quant store.
if let Err(err) = self.quant_vectors.set_vector_sync(id.into_usize(), element) {
debug_assert!(
false,
"quant write failed after full-precision success: {err}"
);
return std::future::ready(Err(err));
}
Comment thread diskann-bftree/src/neighbors.rs Outdated
Comment on lines +721 to +725
// Each thread writes to a different set of vertices
for i in 0..appends_per_thread {
let vertex = (t * appends_per_thread + i) % num_vertices;
// Use a unique neighbor id per (thread, iteration)
let neighbor_id = t * 1000 + i;
Comment thread diskann-bftree/src/provider.rs
Comment on lines +396 to +402
/// Trait for types that may need cleanup during hard-delete.
///
/// `QuantVectorProvider` implements this to delete its quantized embedding.
/// `NoStore` implements this as a no-op.
pub(crate) trait DeleteQuant {
fn delete_vector(&self, id: usize);
}
@harsha-simhadri

Copy link
Copy Markdown
Contributor

Jordan, what is the perf impact of this change? Is ingestion speed the same as before? thanks

@JordanMaples JordanMaples force-pushed the jordanmaples/bftree_parallelism branch from aea01c1 to 04a1dd8 Compare June 15, 2026 15:34
Add striped RwLock to NeighborProvider to eliminate the TOCTOU race in
append_vector's read-modify-write cycle. Under 8-thread contention on
the same vertex, the unprotected path loses 11-51% of edges.

Striped locks (16384 stripes, Fibonacci multiply-shift hash):
- Constant ~128 KB memory regardless of dataset size (vs ~8 GB at 1B
  vectors for per-vertex locks)
- RwLock allows concurrent readers during search; only writers serialize
- Read lock on get_neighbors, write lock on set/append/delete
- Internal get_neighbors_unlocked avoids deadlock in append_vector

Dual-store write ordering (SetElement for QuantVectorProvider):
- Write full-precision (authoritative) before quant
- If quant write fails, worst case is a vector missing from quantized
  search but still reachable in full-precision (benign)

Hard-delete cleanup (Delete trait):
- Added DeleteQuant trait; QuantVectorProvider deletes its entry,
  NoStore is a no-op
- Single generic Delete impl with Q: DeleteQuant bound
- Neighbor adjacency cleanup handled by upper DiskANNIndex layer

Tests:
- 3 concurrent stress tests proving the TOCTOU fix
- Verified 10/10 failures without locks, 10/10 passes with locks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JordanMaples JordanMaples force-pushed the jordanmaples/bftree_parallelism branch from 04a1dd8 to c94cdcc Compare June 15, 2026 15:44
@codecov-commenter

codecov-commenter commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.00855% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.49%. Comparing base (1bdf1a1) to head (d7a8bd1).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
diskann-bftree/src/neighbors.rs 97.74% 4 Missing ⚠️
diskann-bftree/src/provider.rs 88.46% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1158      +/-   ##
==========================================
+ Coverage   89.46%   89.49%   +0.02%     
==========================================
  Files         487      487              
  Lines       92102    92376     +274     
==========================================
+ Hits        82401    82668     +267     
- Misses       9701     9708       +7     
Flag Coverage Δ
miri 89.49% <97.00%> (+0.02%) ⬆️
unittests 89.14% <97.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-bftree/src/lib.rs 47.36% <ø> (ø)
diskann-bftree/src/locks.rs 100.00% <100.00%> (ø)
diskann-bftree/src/quant.rs 89.69% <100.00%> (+0.17%) ⬆️
diskann-bftree/src/provider.rs 90.85% <88.46%> (-0.12%) ⬇️
diskann-bftree/src/neighbors.rs 95.34% <97.74%> (+1.39%) ⬆️

... and 57 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JordanMaples

Copy link
Copy Markdown
Contributor Author

Jordan, what is the perf impact of this change? Is ingestion speed the same as before? thanks

@harsha-simhadri I'm seeing roughly 10% additional index build time on my arm-based laptop in WSL building Wiki100k.

Here's copilot's summary of my test results for your consideration:

--------------------------------------------------------------------------------------------------

Performance A/B results on Wikipedia-100K (768D, float32, inner product):

Static build (multi-insert, batch_size=128, 8 parallel inserters):

┌───────────────────────────────┬────────────┬───────────────┐
│                               │ With locks │ Without locks │
├───────────────────────────────┼────────────┼───────────────┤
│ Build time                    │ 57.82s     │ 52.46s        │
├───────────────────────────────┼────────────┼───────────────┤
│ Insert latency (mean)         │ 73.8ms     │ 66.9ms        │
├───────────────────────────────┼────────────┼───────────────┤
│ Recall@100 (search_l=100)     │ 0.9022     │ 0.9022        │
├───────────────────────────────┼────────────┼───────────────┤
│ Recall@100 (search_l=200)     │ 0.9536     │ 0.9536        │
├───────────────────────────────┼────────────┼───────────────┤
│ Search QPS (8 threads, l=200) │ 519        │ 484           │
└───────────────────────────────┴────────────┴───────────────┘

Streaming (insert/delete/reinsert churn via BigANN runbook, 8 threads):

┌───────────────────────────┬────────────┬───────────────┐
│                           │ With locks │ Without locks │
├───────────────────────────┼────────────┼───────────────┤
│ Insert latency (mean)     │ 3739µs     │ 3965µs        │
├───────────────────────────┼────────────┼───────────────┤
│ Recall@100 (search_l=200) │ 0.9766     │ 0.9765        │
├───────────────────────────┼────────────┼───────────────┤
│ Search QPS (8 threads)    │ 813        │ 812           │
└───────────────────────────┴────────────┴───────────────┘

Summary: 
Locks add ~10% build overhead in the contended parallel-insert path. 
Search performance and recall are unaffected. In the streaming (sequential insert) path, overhead is within noise. 
The locks provide a correctness guarantee against edge loss under high contention — newly added stress unit tests show 11–51% edge loss without them when threads hammer the same vertices concurrently.

Comment thread diskann-bftree/src/neighbors.rs Outdated
}

impl StripedLocks {
const DEFAULT_STRIPE_COUNT: usize = 16384;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was 16,384 picked and how does this interact with tests? It seems like a rather hefty allocation .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed and switched to something based on current hardware.

// deleted vertex's edge list and patching neighbors-of-neighbors) is
// handled by `DiskANNIndex::inplace_delete` → `drop_adj_list`.
self.full_vectors.delete_vector(id as usize);
self.quant_vectors.delete_vector(id as usize);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're thinking about race conditions: this one rases with a SetElement 😉

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk this should be sorted now

Move StripedLocks from NeighborProvider to BfTreeProvider so that
delete and set_element operations are synchronized under the same
lock table as neighbor mutations. This eliminates a potential race
between concurrent delete and set_element on the same vertex ID.

- Extract StripedLocks into its own module (locks.rs)
- BfTreeProvider owns Arc<StripedLocks>, shared with NeighborAccessor
- Dynamic stripe count based on available_parallelism (4x cores, min 64)
- delete() and set_element() now acquire write locks per vertex

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JordanMaples

Copy link
Copy Markdown
Contributor Author

@copilot give it another review.

Copilot AI commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@copilot give it another review.

I re-reviewed and found one remaining concurrency issue: beam expansion paths were reading neighbors without taking the striped read lock. I fixed both call sites in commit 562c341 by acquiring self.provider.locks.read(n.into_usize()) before get_neighbors.

Validation run:

  • cargo test -p diskann-bftree --profile ci
  • cargo clippy -p diskann-bftree --all-targets --config 'build.rustflags=["-Dwarnings"]' -- -D warnings
  • Copilot code review ✅ (no additional comments)
  • CodeQL scan timed out in the validation tool (no rerun per tool guidance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants