feat(semantic): [alpha build] provider-aware typed embeddings, reranking, diagnostics, and eval harness#87
Conversation
Add scripts, docs, Dockerfile, and package.json scripts for Docker-based Rust validation (fmt/check/clippy/test) so Windows users without MSVC Build Tools can still validate Rust code. - scripts/docker-rust.ps1: PowerShell script supporting fmt/check/clippy/ test/validate/shell tasks with persistent Docker volumes - Dockerfile.rust: minimal Rust image with rustfmt + clippy pre-installed - docs/docker-rust-validation.md: full usage and design documentation - package.json: 6 new docker:rust:* convenience scripts Design: Linux-target validation via rust:1-bookworm, persistent cargo volumes for caching, fail-fast sequential validation.
…rough, fingerprint upgrade
…or pruning, write-lock sync
…pgrade, invalidation tests
- SemanticFilePolicy config struct with include_code/include_docs/ include_configs/binary_detection/generated_file_detection/globs - parse_semantic_files_config handler in configure.rs - File policy evaluation: should_index_file(), is_generated_file(), is_config_file(), is_docs_file() - Docs chunker: collect_docs_chunks() with heading-based splitting for markdown, splitting by file for other doc types - collect_chunks routes doc files through docs chunker, skips binary/generated/config files per policy - SemanticIndexFingerprint extended with file_policy_hash and docs_chunker_version; diff() triggers rebuild on policy change - build_with_progress/refresh_stale_files accept &SemanticFilePolicy - compute_file_policy_hash() deterministic hash of policy fields - Re-export SemanticFilePolicy from semantic_index module - All test callers updated with &SemanticFilePolicy::default()
…iority ordering, backoff - CancellationToken (Arc<AtomicU64> generation counter) for cooperative build cancellation on reconfigure - Cancel old semantic index builds instead of detaching when config changes - Priority file ordering: README/docs first, then core source, then tests, then rest - Embedding backoff: exponential retry with jitter for remote provider rate limits - SemanticIndexStatus::Partial variant with completeness percentage for partial builds - Search reports partial index state during cold start - Phase-boundary cancellation checks between model init, disk read, incremental refresh, and full rebuild
Add Perplexity backend with InputMode::DocumentChunks support for contextualized embedding where chunks carry document-level context. - SemanticBackend::Perplexity variant with config, profile, engine - DocumentChunks/PerDocumentChunks/DocumentEmbeddings structs - embed_document_chunks() routes Perplexity to grouped embedding API - build_with_progress_contextualized() groups chunks by document - Wire configure.rs to branch on input_mode: DocumentChunks - SemanticEmbeddingModel::input_mode() public accessor - EmbeddingModelProfile with contextualized_supported guard - Response validation: index continuity, missing documents, dimension
…to trait-backed module Bead: aft-t6p.12 Extracts Vec<EmbeddingEntry> storage and search from SemanticIndexSnapshot into a VectorStore trait with FlatF32VectorStore implementation. This decouples the storage layer from the lifecycle logic and prepares for alternative backends (binary Hamming, approximate ANN). Key changes: - vector_store.rs: VectorStore trait + ScoredChunk/PruneStats types - FlatF32VectorStore: flat scan with cosine similarity (preserves existing behaviour exactly) - FlatBinaryHammingVectorStore: forward-looking Hamming-search impl - SemanticIndexSnapshot delegates search/len/prune/entries to store - Fixed dimension-sync bug where set_dimension updated the snapshot dimension but not the store dimension, causing search to return 0 - EmbeddingEntry and IndexedFileMetadata made pub for trait compatibility
On Windows, use copyFileSync for the binary replacement (which overwrites the target — renameSync fails with EEXIST). If it fails, the original binary at binaryPath is preserved. The temp file cleanup is now wrapped in its own try/catch so a cleanup failure does NOT propagate as a download failure — the binary was already successfully placed at binaryPath. Addresses PR cortexkit#69 cubic review finding P2.
Implement bead aft-t6p.24: file identity manifest + vector ownership records. Changes: - **FileRecord struct**: identity record with content_hash, size_bytes, mtime, language, document_kind, inclusion_policy_hash, indexed_at - **file_manifest on SemanticIndexSnapshot**: HashMap<PathBuf, FileRecord> tracking which files produced which vectors, enabling precise stale-vector pruning when files are edited, deleted, or excluded - **V8 serialization format**: extends V7 with per-entry chunk_hash (after each vector) and file manifest block (after all entry vectors). Full backward compatibility with V1-V7 reads. - **chunk_hash on EmbeddingEntry**: deterministic hash of chunk content fields for tracing which version of a chunk produced a stored vector - **compute_chunk_hash**: blake3-based deterministic hash - **build_manifest_from_store helper**: populates file_manifest from store's file_metadata, called in all builder functions (build_from_chunks, build_with_progress_contextualized, refresh_stale_files) and from_bytes for V1-V7 cache migration - **next_chunk_id, fingerprint_string**: forward-looking fields on snapshot for future unique ID assignment and fingerprint tracking
…rmalization, and model profiles Adds aft-t6p.20 (Typed embedding vector representation + storage-strategy resolution): - TypedVector (source-side) and StoredVector (persisted) enums with DenseF32, DenseInt8, BinaryPacked, and Quantized variants - StorageStrategy (NativeF32, DecodeNormalizeF32, BinaryPacked) - VectorKind enum for runtime type tagging - DistanceMetric (Cosine, DotProduct, Euclidean, Hamming) - NormalizationPolicy (AlreadyNormalized, NormalizeOnInsertQuery, NotApplicable) - EmbeddingModelProfile fields: source_vector_kind, stored_vector_kind, metric, normalization - convert_vector() / validate_compatible() on EmbeddingModelProfile - blake3 dependency for chunk hashing
… + dummy base_url for Perplexity profile test Two fixes for `fingerprint_invalidation_tests`: - Mock HTTP server now lowercases header names before matching Content-Length (reqwest/hyper sends lowercase `content-length:`). - `base64_int8_profile_from_config_selects_correctly` test provides a dummy `base_url` for the Perplexity backend (required by `from_config`). Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add StorageStrategy::BinaryPacked variant for packed-bit vector storage - Add EmbeddingModelProfile::perplexity_binary() with BinaryPacked → Hamming path - Wire from_config to select perplexity_binary profile when Base64Binary encoding - Implement parse_embedding_value for Base64Binary (decode → 0.0/1.0 f32 vec) - Implement into_stored for TypedVector::BinaryPacked (requires BinaryPacked strategy) - Update validate_config and validate_compatible to accept Base64Binary+BinaryPacked - Replace old "not yet supported" test with parse_embedding_value_base64_binary_succeeds - 886/893 tests pass (7 pre-existing Docker failures) Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add semantic_diagnostics module with SearchDiagnostics, SearchPipelineType, SearchWarning, SearchMetricsCollector, PhaseTimer, score_statistics, top1_margin. Instrument handle_semantic_search with per-phase timing and warning collection. Wire SearchMetricsCollector into AppContext. 17 new tests, 902/910 lib tests pass (8 pre-existing Docker failures). Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add SemanticDiagnosticsLogger with file append, rotation (50 MB), and retention cleanup (file-deletion based on mtime) - Add SearchDiagnosticsEvent struct for JSONL serialization with raw_query redaction (opt-in via include_raw_queries) and snippet placeholder (include_snippets) - Add config fields: jsonl_logging, jsonl_path, include_raw_queries, include_snippets, retention_days to SemanticBackendConfig - Add lazy-init diagnostics_logger on AppContext with resolve_diagnostics_log_path helper (env var → project root → ~/.cache) - Wire JSONL record into handle_semantic_search diagnostics block - 4 new tests: raw query redaction, raw query inclusion, disk write verification, missing-file recovery - 907/914 lib tests pass (7 pre-existing Docker failures) Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
…rch output Add DiagnosticsOutputMode enum (Off/Minimal/Verbose) and output_mode field to SemanticBackendConfig. Implement format_diagnostics_prefix() for Minimal (warnings only) and Verbose (scores + latency + warnings) output modes. Wire into handle_semantic_search response text. 4 new tests, 25 diagnostics tests total. 910/918 lib tests pass (8 pre-existing Docker failures). Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add optional reranking via OpenAI-compatible chat endpoint. When enabled, aft_search overfetches candidates, sends them to a reranker model, and re-sorts by relevance. Falls back gracefully on any error. - Add RerankConfig fields to SemanticBackendConfig (rerank_enabled, rerank_model, rerank_base_url, rerank_api_key_env, rerank_timeout_ms, rerank_max_candidates) - Create semantic_rerank.rs with RerankerClient, RerankOutcome enum, and rerank_candidates function - Add RerankerFailure warning variant to SearchWarning - Wire reranking into handle_semantic_search (overfetch → rerank → re-sort) - Add rerank_latency_ms to SearchDiagnostics and SearchDiagnosticsEvent - Include rerank latency in verbose diagnostics output - 6 unit tests for reranker parsing, skip conditions, and failure handling All 25 diagnostics + 6 reranker tests pass. 917/924 total tests pass (7 pre-existing Docker infrastructure failures).
Add 40+ unit tests to fingerprint_invalidation_tests covering: - SemanticBackendConfig deserialization (minimal, all-fields, defaults) - EmbeddingModelProfile validation for all encoding types - TypedVector conversion and StoredVector roundtrip - convert_vector and validate_compatible rejection paths - Distance metric auto-resolution for f32/int8/binary - base64_int8 signed int8 decode correctness - Template hashing, enum roundtrips, resolve helpers Minor: add #[derive(Debug)] to StoredVector for test ergonomics. Closes aft-t6p.6.1
Add 6 new tests to fingerprint_invalidation_tests covering: - file_policy_hash mismatch triggers rebuild - docs_chunker_version mismatch triggers rebuild - multi-field changes still trigger rebuild - rebuild+query_prompt: rebuild wins - only query_prompt change: ClearQueryCache - non-fingerprint field changes: NoChange Total: 22 fingerprint tests. Closes aft-t6p.6.2
Add 29 tests covering: - is_generated_file: protobuf, minified, dist, build, generated, dart - is_doc_extension and is_config_extension validation - classify_semantic_file for code/doc/config - collect_docs_chunks markdown heading splitting - SemanticFilePolicy defaults and builtin globs - FileRecord field population - build_manifest_from_store construction and cleanup Closes aft-t6p.6.3
… tests Add 23 tests covering: - FlatF32VectorStore: search, empty, dimension mismatch, CRUD, prune, stats - FlatBinaryHammingVectorStore: search, ranking, prune, delete, stats - hamming_distance and popcount64 correctness - Binary decode: byte-aligned, non-byte-aligned, padding, error Closes aft-t6p.6.4
Add 8 tests covering: - SemanticIndexLifecycle: cold start, set/get, failed+error, all variants - SemanticIndexSnapshot: search ranking, immutability after clone - VectorStore: prune_stale_vectors, prune_orphans Closes aft-t6p.6.5
Add 10 tests covering: - HybridRerank pipeline type display - Metrics collector: window size 1, cache hit rate, zero result rate, low confidence rate, latency percentiles - Diagnostics output mode defaults - Warning formatting: minimal (all variants, verifies suppressed), verbose (all 9 variants) - SearchWarning serde roundtrip for all 8 variants Closes aft-t6p.6.6
Add 4 tests covering: - Concurrent snapshot clones produce independent results - Concurrent read threads see identical data via Arc - Mutex contention across 10 threads does not deadlock - Arc strong_count tracks clone/drop correctly Closes aft-t6p.6.7
Add 6 tests covering: - Trust file atomic write (no tmp files left behind) - Multiple projects trusted independently - Untrust is idempotent - Trust state survives reload (serde roundtrip) - Nonexistent project path is untrusted (fail-closed) Closes aft-t6p.6.8
The validate_compatible_rejects_binary_stored_with_cosine_metric test was missing source_vector_kind: BinaryPacked, causing the first match block to fail with 'unsupported source→stored vector conversion' instead of reaching the metric compatibility check.
Adds 2 tests for aft-t6p.2.2: 1. rerank_max_candidate_chars_default_is_2500 — verifies default value 2. rerank_max_candidate_chars_custom_value_is_accepted — verifies custom value is accepted through struct construction Config flow verified: - Rust config.rs: field exists with default 2500 - TypeScript config.ts: field in schema (z.number().int().positive().optional()) - Trust boundary: field is NOT stripped (not in getStrippedSemanticKeys) - semantic_rerank.rs: uses config.rerank_max_candidate_chars at line 59
…eck script semantic_rerank.rs: - Add MAX_RERANK_BODY_BYTES (2 MiB) constant - Add read_response_body_bounded() helper that rejects oversized responses via Content-Length fast path and byte-count verification - Replace response.text() with bounded reader — oversized responses return RerankOutcome::Failed with safe fallback to original order - Add tests: body size limit constant, unreachable endpoint failure scripts/zir-aft-check.sh: - Add --verbose flag (default off) to show full output - Filter passing test lines (PASS, N passed) from nextest and other test runners by default, keeping failures and summaries visible - Add _filter_passing() helper for output deduplication Closes aft-t6p.25
semantic_index.rs:
- prompt_template_hash() normalizes empty/whitespace-only templates to
match None, so fingerprint hashes are stable and avoid unnecessary
index rebuilds
- Added 10 edge-case tests: empty templates, wrong placeholders, literal
{query} in query text, hash equality for None vs empty vs whitespace,
different templates produce different hashes
configure.rs:
- Empty/whitespace-only query/document prompt templates are normalized
to None before storage (not stored as Some())
- Warn via slog_warn when template doesn't contain expected placeholder
({query} for query template, {text} for document template)
Closes aft-t6p.29
…Diagnostics semantic_index.rs: - embed_query_cached() now returns (Vec<f32>, bool) where the bool indicates cache hit (true) or miss (false) semantic_search.rs: - embed_query() returns (Vec<f32>, bool) propagated from model - handle_semantic_search captures cache hit status and passes it to SearchDiagnostics instead of hardcoding false - query_cache_hit now accurately reflects per-query cache behavior Closes aft-t6p.30
semantic_eval.rs (command handler): - Replaced stub that returned empty results for every case with real semantic search calls via handle_semantic_search - Each eval case's query is executed through the search pipeline - Results are parsed from response JSON into EvalCaseRetrievedHit format - Search failures (disabled, not ready) gracefully return empty results semantic_eval.rs (scoring): - Removed no-op truncation (retrieved[..len.min(len())] was always len()) - Hits beyond k still count toward hit_anywhere and expectations_matched per existing doc contract; only hit_in_top_k and first_hit_rank use k Closes aft-t6p.31
…ostic - Add max_results_per_file to SemanticBackendConfig (default: 2) - Wire configurable cap into fuse_hybrid_results and sort_cap_and_truncate - Add DistanceMetricChanged warning for score staleness detection - Update TypeScript config schema with max_results_per_file field - Add tests for custom cap values (1, 3) and multi-file scenarios - All checks pass (fmt, check, clippy)
- Add WarningDedup struct with 60s time window to semantic_diagnostics.rs - First occurrence of each warning visible; repeated occurrences suppressed - Full warnings preserved in diagnostics recording (SearchDiagnostics + JSONL) - Wired into AppContext and handle_semantic_search - Added 8 tests covering dedup behavior and edge cases - All checks pass (fmt, check, clippy)
- Add config deserialization tests for rerank/diagnostics/max_results_per_file fields - Add warning dedup key stability test across all 12 warning kinds - Add sort_cap_and_truncate max_results_per_file enforcement test - Fix stale eval stub comment (now wired to real search pipeline) - Make warning_dedup_key pub(crate) for cross-module test access - All checks pass (fmt, check, clippy)
There was a problem hiding this comment.
8 issues found across 14 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/aft/src/lsp/child_registry.rs">
<violation number="1" location="crates/aft/src/lsp/child_registry.rs:223">
P2: Test child process performs non-fork-safe Rust operations after raw `fork()`, risking deadlocks and flaky CI failures under multithreaded test execution.</violation>
</file>
<file name="packages/opencode-plugin/src/config.ts">
<violation number="1" location="packages/opencode-plugin/src/config.ts:40">
P2: Semantic enum literals were renamed without backward-compatibility aliases or migration, breaking existing configs that use old values.</violation>
</file>
<file name="packages/aft-bridge/src/paths.ts">
<violation number="1" location="packages/aft-bridge/src/paths.ts:13">
P1: Failed root→harness migration silently drops existing version state and suppresses upgrade announcements</violation>
</file>
<file name="crates/aft/tests/integration/helpers.rs">
<violation number="1" location="crates/aft/tests/integration/helpers.rs:13">
P2: `panic!` will fail the test under nextest/Rust test harness semantics, so this helper does not actually skip root-only tests.</violation>
</file>
<file name="crates/aft/src/commands/semantic_search.rs">
<violation number="1" location="crates/aft/src/commands/semantic_search.rs:458">
P1: Lexical-only results can lose their original relevance ordering because sorting uses clamped display scores and falls back to alphabetical file/name tie-breaking instead of preserving the incoming lexical rank.</violation>
</file>
<file name="scripts/zir-aft-check.sh">
<violation number="1" location="scripts/zir-aft-check.sh:680">
P1: Checks now depend on host `git` and silently mutate repo-local git config (`core.autocrlf`), which can abort runs or cause unexpected persistent side effects</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
- Fix /proc/pid/stat parsing: skip leading space before state byte (cubic P1: grandchild state check always passed incorrectly) - Remove diagnostics_enabled gate on reranker failure warnings (greptile: silent fallback even in verbose mode) - Quote $CARGO_FEATURES in shell interpolation to prevent word splitting (cubic P1: --features injection without escaping)
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
- Use streaming reads with size limit in reranker body check (cubic P1: response.bytes() buffers entire body before size check) - Truncate prompt templates in warning logs to 80 chars (cubic P2: full template exposure in log files) - Use per-case top_k in eval harness search requests (cubic P2: per-case top_k override was ignored) - Remove dead branch in WarningDedup filter_for_output (cubic P2: count==0 branch unreachable) - Extract test counts in --verbose mode (cubic P2: verbose bypassed test-count extraction) - Add trap for temp-file cleanup in run_step (cubic P2: temp files leaked on interruption) - Handle final line without trailing newline in _filter_passing (cubic P2: last line dropped without trailing newline)
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
| } | ||
| (false, false) => { | ||
| warnings.push(SearchWarning::EmptyResults); | ||
| SearchPipelineType::Semantic | ||
| } | ||
| }; | ||
|
|
||
| let max_results_per_file = ctx.config().semantic.max_results_per_file; | ||
| let fusion_timer = PhaseTimer::start(); | ||
| let results = fuse_hybrid_results( | ||
| semantic_results, | ||
| lexical_files, | ||
| &shape, | ||
| params.top_k.min(MAX_TOP_K), |
There was a problem hiding this comment.
Out-of-bounds reranker indices silently dropped when
diagnostics_enabled: false
The guard if oob_count > 0 && diagnostics_enabled means that when the reranker returns indices beyond the result window (e.g. [0, 99, 1] against 5 candidates), the OOB indices are silently dropped and the affected results are appended in their original order — with no visible indication to the agent. Since diagnostics_enabled defaults to false, this is the common case. The hard-failure path immediately below carries an explicit comment "Always surface reranker failures — regardless of diagnostics_enabled," making the inconsistency clear: the OOB case should follow the same unconditional pattern.
Add local adaptation of MinishLab/semble benchmarks for evaluating AFT semantic search quality. Covers 50 queries across 5 repos (axum, express, pydantic, serde, gin) in 4 languages. New artifacts: - schema.json: JSON Schema for benchmark fixtures - repos-pilot.json: 5-repo pilot with pinned commits - import.ts: Semble annotation importer with filter options - corpus.ts: Repo clone/cache tooling (sync, check, status, clean) - baseline-rg.ts: Ripgrep lexical baseline (recall@k, MRR, NDCG) - speed.ts: Cold-start index + query latency measurement - pilot.ts: Multi-mode pilot runner - token-efficiency.ts: Recall@token_budget curves - ablation.ts: Mode comparison (lexical/semantic/hybrid) - ci.ts: CI regression detection with configurable threshold - README.md: Full documentation with reproducibility instructions Baseline results: recall@5=12.0%, MRR=0.068, NDCG=0.301 Symbol queries: 33.3% recall (exact text match) Architecture/semantic: 0% (expected for pure lexical)
…ADME - Add reranking configuration and behavior - Add diagnostics and JSONL logging - Add semantic doctor and eval commands - Add contextualized embeddings limitations - Add prompt templates - Add troubleshooting section feat(aft-t6p.38.3): add full 63-repo Semble lockfile - Fetch upstream Semble benchmarks/repos.json - 63 repositories across 19 languages - Pinned to exact revisions as specified by Semble maintainers - Ready for local/manual benchmark runs
…omparison aft-t6p.bench.full-ci.1: Full corpus CI workflow spike (388-line report) - Estimates: 2-15GB disk, 1-2GB network, 10-40min indexing, 6-17min queries - Recommendation: manual workflow_dispatch, NOT required PR CI - Critical finding: only 5/63 repos have human annotations - Feasible with shallow clones on GitHub Actions 14GB runner aft-t6p.lex.fts5.1: FTS5 lexical backend spike - FTS5 is already compiled via rusqlite bundled — zero dependency cost - Created experimental module behind semantic-fts5 feature gate - 14 unit tests for tokenize, escape, query build, and index round-trip - Recommendation: adopt as optional experimental backend for benchmark comparison - NOT for production use in current PR per aft-t6p.scope.1 Files: - benchmarks/semble/FULL-CORPUS-CI-SPIKE.md (new) - crates/aft/src/FTS5-SPIKE.md (new) - crates/aft/src/fts5_experimental.rs (new, 14 tests) - crates/aft/Cargo.toml (added semantic-fts5 feature) - crates/aft/src/lib.rs (added cfg-gated module)
- semantic-benchmark-dev-readme.md: comprehensive guide for building, configuring, and testing semantic search across multiple backends (fastembed, model2vec, OASIS endpoint) with reranking - .github/workflows/build-aft.yml: manually-triggered workflow that builds AFT with all feature flags (semantic-model2vec, semantic-fts5) for Windows x64, Linux x64, and macOS ARM64. Use when local MSVC Build Tools are not available.
GitHub only indexes workflows from the default branch. The build-aft.yml must exist on main for 'gh workflow run' to find it. The workflow accepts a branch input to build any ref.
Accept a comma-separated 'targets' input (windows, linux, darwin, all) so users can build only the platforms they need instead of all three. Each job is gated by an 'if:' condition. Usage: gh workflow run build-aft.yml -f targets=windows -f branch=...
Resolve merge conflicts from upstream main (v0.36.x) while preserving PR cortexkit#87 semantic search work: typed vectors, reranking, diagnostics, doctor/eval, model2vec deps, and Semble benchmark tooling. Upstream wins for platform code (backup v4, checkpoint locking, inspect/callgraph, local_embed/ort stack, semantic refresh worker, search pipeline fixes). PR semantic layer retained with LocalEmbedder migration, transient embedding markers, extended configure parsing, and merged config surface (max_files + inspect + semantic_files). Also: union gitignore, merged package.json scripts, upstream docs/README, regenerated Cargo.lock, opencode JSONC strip fix, zir-aft-check filter keeps failures visible.
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
Add deferred-file refresh worker support on the VectorStore snapshot architecture, wire semantic_files policy through configure refresh paths, and strip rerank trust-boundary keys.
Port cortexkit/main semantic search routing, fallback responses, and contract fields onto the snapshot VectorStore architecture. Wire diagnostics, rerank, Partial index status, and max_files serde default.
| SearchWarning::EmbeddingFailure { .. } => None, | ||
| SearchWarning::LexicalFailure { .. } => None, | ||
| SearchWarning::DimensionMismatch { .. } => None, | ||
| SearchWarning::RerankerFailure { .. } => None, |
There was a problem hiding this comment.
RerankerFailure silently suppressed in the default Minimal output mode
format_warning_minimal returns None for RerankerFailure, so any reranker failure warning in diag_warnings — including the hard-failure case that is pushed unconditionally at semantic_search.rs:612 — produces no visible output in the default output_mode: minimal configuration. A user with rerank_enabled: true and every other setting at its default will receive silently reordered (or un-reordered) results with no indication that the reranker failed. Only users who explicitly set output_mode: "verbose" will see the warning text. RerankerFailure should produce at minimum a ⚠ reranker failed — results in original order string from format_warning_minimal, consistent with how StaleIndex and EmptyResults are treated.
…notations
- Add dual-mode support to profile d (pre-rerank + post-rerank passes)
- Add buildRelevantPaths helper to handle {path, start_line, end_line} annotation entries
- Update Annotation interface to accept object-format relevant/secondary paths
- Save profile d results showing +2.9pp recall and +0.057 MRR improvement from reranker
- Fix express 0% recall caused by normalizePath converting objects to '[object Object]'
- Fix sembleSearch: reorder --content all after positional args to avoid
argparse consuming query as content type; unwrap {results: [...]} wrapper
- Fix colgrepSearch: add --json flag for structured output, parse
[{unit: {file, ...}, score}] format, strip \\?\ prefix, regex fallback
for file:start_line-end_line text output
- Fix release.yml: add --features semantic-model2vec,semantic-fts5 to all
6 platform build jobs (darwin-arm64, darwin-x64, linux-arm64, linux-x64,
win32-x64, win32-arm64) — previously shipped binaries without features
be953d5 to
4224315
Compare
Summary
Semantic search in AFT moves from a minimal embedding-and-cosine prototype to a provider-capability-aware retrieval subsystem with typed vectors, optional reranking, background lifecycle management, diagnostics, and evaluation tooling. This is a public preview — the feature is functional and tested (~93 new tests) but expects iteration based on real-world feedback.
What changed
The upgrade touches the full semantic pipeline — config, indexing, retrieval, diagnostics, and observability — without breaking the default
fastembedexperience.Typed vector representations
Vectors are no longer opaque f32 blobs. Every stored vector carries explicit type metadata (
DenseF32,Int8SourceDecoded,BinaryPacked) and is paired with its source kind so the correct distance metric is selected automatically. Binary packed vectors use Hamming search (native bitwise XOR + popcount) instead of cosine, which is both faster and semantically correct for quantized embeddings. This unlocks Perplexity'sbase64_binaryandbase64_int8output modes alongside standard dense providers.Provider capability profiles
Each embedding backend (fastembed, OpenAI-compatible, Ollama, Perplexity) declares what it supports: output encoding, distance metric, dimension range, max batch size. The config layer validates combinations at configure time — you cannot accidentally request binary vectors through a cosine-only provider. Profiles also carry fingerprint fields so switching providers triggers a clean index rebuild rather than silent corruption.
Fingerprint-driven index lifecycle
A
SemanticIndexFingerprintcaptures every dimension that affects index correctness: backend, model, base_url, dimension, chunking_version, output_encoding, storage_strategy, vector kinds, normalization, and prompt hashes.diff()classifies changes asRebuild(structural — re-embed everything),ClearQueryCache(query prompts changed — invalidate cached results only), orNone. This replaces the previous "delete and hope" invalidation with precise, explainable rebuild decisions.Non-blocking cold start
Index builds run in a background thread with cooperative cancellation (
SemanticCancellationTokenviaAtomicU64generation counter). The build checks the generation before each embedding batch and exits early when a reconfigure arrives. Priority ordering ensures high-value files (recently edited, high PageRank) get embedded first. Exponential backoff handles transient provider failures without blocking the session.Stale-vector pruning
When files are edited, deleted, moved, excluded, or re-included, the index tracks which vectors are stale and prunes them during the next refresh cycle. Every vector record carries file/chunk ownership metadata (file path, version, chunk hash, index fingerprint) so pruning is traceable and deterministic.
File policy and docs chunking
A configurable file policy controls which files enter the index (include globs, exclude globs, max file size, max chunk count). The docs chunker splits Markdown and documentation files into semantic sections before embedding, improving recall for documentation-shaped queries.
Reranking pipeline
Optional reranking via any OpenAI-compatible
/v1/rerankor chat-completion endpoint. The pipeline sends initial retrieval candidates to a reranker, parses the response (supporting multiple JSON shapes), and reorders results with safe fallback — if the reranker fails, the original cosine-similarity order is returned unchanged. Config fields:rerank.enabled,rerank.model,rerank.base_url,rerank.api_key_env,rerank.max_candidates.Search pipeline metrics and diagnostics
Every
aft_searchcall records timing, cache hits/misses, result counts, and reranker fallback events. Metrics are exposed through thestatuscommand and through JSONL diagnostic logs for offline analysis. TheDiagnosticsOutputModeconfig controls verbosity in tool output (compact|verbose|off).Semantic doctor
semantic_doctoris a health-check command that reports config summary, index summary, metrics summary, provider summary, and actionable suggestions. Use it to verify that the index is healthy, the provider is reachable, and the configuration is consistent.Semantic eval harness
semantic_evalruns a JSONL-defined evaluation suite against the semantic index. Each case specifies a query, expected paths, expected symbols, and top-k. The harness computes recall@k and MRR (Mean Reciprocal Rank) for quantifying retrieval quality across config changes.Status integration
The
statuscommand now includes semantic health metrics: lifecycle state, entry count, dimension, total queries, cache hit ratio, average query time, and provider info. The OpenCode TUI sidebar surfaces these alongside the existing index state.Config trust boundary
backend,base_url, andapi_key_envare user-only fields — project-levelaft.jsonccannot inject these. A hostile repository cannot redirect embeddings at an attacker-controlled endpoint or exfiltrate API keys. The plugin logs a warning when it strips a project-level setting.Contextualized document-chunk embedding (partial)
Initial support for Perplexity-style document/chunk grouped embedding — chunks from the same source document are batched together rather than flattened. Oversized document handling and retry logic are still in progress (see roadmap).
How to test
Default fastembed (zero-config)
Verify: results appear with
source: semanticorsource: hybridtags. Status shows[index: ready]after build completes.Provider switching
Verify: index rebuilds automatically on next session start. Status shows new provider/model.
Reranking
{ "semantic_search": true, "semantic": { "backend": "openai_compatible", "model": "text-embedding-3-small", "base_url": "https://api.openai.com/v1", "api_key_env": "OPENAI_API_KEY" }, "rerank": { "enabled": true, "model": "rerank-english-v3.0", "base_url": "https://api.cohere.com", "api_key_env": "COHERE_API_KEY" } }Verify: search results show reranker-sorted order. Disable reranker — results fall back to cosine order.
Semantic doctor
aft_search({ "query": "test" }) # trigger index build if cold # Then check health via status command or semantic_doctorVerify: health report shows ConfigSummary, IndexSummary, MetricsSummary, ProviderSummary.
Eval harness
Verify: returns recall@k and MRR scores.
Test coverage
~93 tests across 8 test sub-tasks covering:
Roadmap
Still in progress or planned for follow-up:
Architecture notes
Key new modules:
crates/aft/src/semantic_rerank.rs— reranking pipeline with safe fallbackcrates/aft/src/semantic_diagnostics.rs— JSONL diagnostic loggingcrates/aft/src/semantic_doctor.rs— health-check report generationcrates/aft/src/semantic_eval.rs— evaluation harness (JSONL parser, scoring)crates/aft/src/vector_store.rs— VectorStore trait with DenseF32 and BinaryPacked implementationscrates/aft/src/commands/semantic_doctor.rs— doctor command handlercrates/aft/src/commands/semantic_eval.rs— eval command handlerModified significantly:
crates/aft/src/semantic_index.rs— lifecycle management, fingerprint-driven invalidation, non-blocking build, stale pruning, typed vectorscrates/aft/src/config.rs— provider profiles, rerank config, trust boundary fieldscrates/aft/src/commands/status.rs— semantic health metricscrates/aft/src/commands/semantic_search.rs— reranking integration, diagnostics output modeNeed help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.Summary by cubic
Provider-aware semantic search with typed embeddings, completed contextualized chunking, optional local
model2vec, reranking, diagnostics (with JSONL), a doctor and eval harness, and an expanded Semble benchmark suite — now rebased onmain, with Docker-based Rust validation for Windows, a manual multi-target build workflow, hardened reranker behavior, stricter config trust boundaries, a fixed deferred-file refresh on the snapshot/vector store, and search UX aligned with upstream (fallbacks, contract, andPartialindex status).Bug Fixes
rerank_base_url.semantic_filesduring incremental updates.semantic-model2vecandsemantic-fts5features across all platforms.New Features
semantic_doctorhealth report and extendedstatuswith semantic metrics (latency percentiles, zero-result/low-confidence rates, rerank state).model2vec-rsbehind thesemantic-model2vecfeature; experimental FTS5 lexical backend behindsemantic-fts5.max_results_per_filecap; provider capability profiles and typed vectors include binary-packed + Hamming search.Written for commit 4224315. Summary will update on new commits.
Greptile Summary
This alpha PR graduates the semantic search subsystem from a minimal embedding prototype to a provider-capability-aware retrieval pipeline with typed vectors, reranking, background lifecycle management, JSONL diagnostics, a health-check doctor, and an evaluation harness. The core mechanics (fingerprint-driven index invalidation, binary-packed Hamming search, reranker safe-fallback, trust boundary enforcement) are structurally sound, but the new
semantic_evalharness contains three correctness bugs that make its output misleading.EvalCaseResult::indexis set from 0-basedenumerate()but documented as 1-based;parse_jsonlclaims trailing commas are tolerated butserde_json::from_strrejects them;score_caseincludes hits beyond the top-k window infirst_hit_rank, inflating MRR for configurations where relevant results sit just outside the recall window.RerankerFailureminimal-mode visibility gaps flagged in previous review rounds remain open.backend,base_url,api_key_env, andmodel_pathare user-only fields enforced in the TypeScript config layer; the newcompress/trust.rsuses atomic rename writes and fail-closed semantics.Confidence Score: 3/5
Safe to merge for the core search pipeline; the new eval harness ships with incorrect output that should be fixed before it is used to compare configurations.
The three bugs in semantic_eval.rs — 0-based index reported as 1-based, trailing-comma tolerance documented but not implemented, and MRR inflated by hits outside the recall window — collectively make the eval harness unreliable. Anyone who runs eval suites during the alpha period to compare providers or tuning knobs will get skewed MRR numbers and confusing case references. The core retrieval, reranking, and lifecycle code is structurally sound.
crates/aft/src/semantic_eval.rs has three correctness issues that affect the reliability of all eval measurements.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD Q["aft_search(query)"] --> CM[choose_mode] CM -->|regex/literal| GS[handle_grep_search] CM -->|semantic/hybrid| HS[handle_semantic_or_hybrid_search] HS --> IDX{SemanticIndexStatus} IDX -->|Disabled/Failed| FB[fallback response] IDX -->|Building| BF[lexical-only fallback] IDX -->|Partial| PW[push PartialIndex warning] IDX -->|Ready| EQ[embed_query] PW --> EQ EQ --> VK[vector k-NN search] VK --> FH[fuse_hybrid_results max_results_per_file cap] FH --> RR{rerank_candidates} RR -->|Skipped| SC[score / format results] RR -->|ReRanked| OOB[filter OOB indices deduplicate append missing] RR -->|Failed| WARN[push RerankerFailure warning] OOB --> SC WARN --> SC SC --> DP[format_diagnostics_prefix output_mode: off/minimal/verbose] DP --> DIAG{diagnostics_enabled?} DIAG -->|yes| LOG[SearchMetricsCollector + JSONL logger] DIAG -->|no| RESP[search_response] LOG --> RESPComments Outside Diff (1)
packages/opencode-plugin/src/config.ts, line 37-54 (link)Several new enum schemas use values that don't align with the Rust serde representation:
SemanticOutputEncodingEnumallows"binary","ubinary","int8","uint8"but RustOutputEncodingdeserializes from"base64_binary"and"base64_int8".SemanticStorageStrategyEnumallows"flat"and"binary_pack"but RustStorageStrategyexpects"native_f32"and"binary_packed".SemanticInputModeEnumincludes"chunk_extracts"and"contextualized"but RustInputModeonly has"flat_texts"and"document_chunks".SemanticDistanceMetricEnumuses"dot"but RustDistanceMetricexpects"dot_product".SemanticBackendEnumis missing the new"perplexity"variant added to Rust.A user who follows the TypeScript autocomplete and picks
output_encoding: "int8"will pass TypeScript validation but receive a deserialization error (or silent fallback to default) from the Rust binary at runtime.Reviews (19): Last reviewed commit: "fix: benchmark script CLI parsing and re..." | Re-trigger Greptile