feat: Center Mappings on VRS Alleles by bencap · Pull Request #773 · VariantEffect/mavedb-api

bencap · 2026-06-17T23:29:46Z

No description provided.

- add shared seqrepo and seqrepo data proxy providers in services - reuse shared seqrepo provider in deps to remove duplicate config - align vrs sequence proxy behavior with dcd-mapping expectations

- add shared helpers to translate hgvs, normalize alleles, and compute ids - clear cached merkle digests before identification so mutated alleles do not keep stale ga4gh identifiers - document the invariant that all allele identification must route through the helper to prevent digest correctness regressions

Introduce a reusable declarative mixin implementing transaction-time SCD Type 2 versioning for rows that change over time, used by either versioned entities or link/association rows. - A row is live while valid_to is NULL; current/as_of express the half-open [valid_from, valid_to) point-in-time predicate so call sites never hand-roll it - supersede_with (single row) and supersede_live_where (bulk) stamp the retired valid_to and successor valid_from with one timestamp for a gap-free handoff regardless of transaction boundaries - retire/retire_live_where are withdrawal primitives; retire cascades to live child links named in __retire_cascade__ - bulk supersede refuses to run on cascade-bearing classes since it cannot fire the cascade - consumers must add a partial unique index over their natural key WHERE valid_to IS NULL as a backstop against duplicate live rows Cover the mixin with unit tests against purpose-built models, using explicit timestamps far from the transaction clock to prove the gap-free handoff and verify the partial unique index backstop.

Add a lib module with reusable helpers for parsing and rewriting HGVS strings ahead of VRS translation. - extract_accession returns the reference accession (substring before the first colon), tolerant of whitespace and missing separators - split_cis_phased_hgvs expands a bracketed multivariant expression into fully-qualified components carrying the original accession and coordinate prefix, since ga4gh's AlleleTranslator only yields a single Allele per call and each component must translate alone - join_cis_phased_hgvs is the inverse, recombining components into one bracketed block and returning None when they do not share a single accession and coordinate prefix Cover the helpers with unit tests including the split/join round trip and the mixed-accession and mixed-prefix rejection cases.

Add VRS translation support for cis-phased multivariant HGVS, building on the new hgvs split helpers. - translate_hgvs_to_variation translates each component HGVS to an Allele independently, returning a bare Allele for a single component and wrapping two or more in a CisPhasedBlock, mirroring dcd_mapping's vrs_map._construct_vrs_allele; the reverse-translation job emits bracketed genomic forms the AlleleTranslator cannot translate directly - identify_variation generalizes identify_allele to blocks, clearing every member and location digest plus the block's own before identifying so a stale member digest never propagates into the block id; the block digest is order-independent so a set dedups to one row Cover both with unit tests, including the order-independent block digest and the stale-digest clearing.

get_hgvs_from_post_mapped can now join the members of a multi-variant block (Haplotype/CisPhasedBlock) into a single bracketed cis-phased expression via the new combine_cis flag, replacing the previous behavior of returning None for any multi-variant block. - combine_cis defaults off because some consumers cannot yet handle a bracketed expression — notably ClinGen submission, which has no single CAID for a multi-variant cis block (#764) - the CSV export fallback opts in, so g./p. HGVS columns are populated from cis-phased post-mapped output instead of left empty - drop the stale commented-out error branches and the resolved TODO

Introduce parallel mapping tables for the Better Reverse Translation epic (#746). The existing mapped_variants table is left untouched (frozen serving) while the new schema is built out separately. - add alleles, mapping_records, and mapping_record_alleles tables with their ORM models; alleles are content-addressed by vrs_digest and shared across mapping records via the link table - mapping_record_alleles.is_authoritative distinguishes the assay's actual measurement from translator-derived links, since the same VRS allele can be authoritative for one record and derived for another - put mapping_records and mapping_record_alleles on valid-time versioning (ValidTime mixin): a re-map retires the prior live row by closing valid_to instead of deleting, so history is retained and point-in-time queries are a single predicate; partial unique indexes promote "one live row per key" to the database - derive Allele.transcript and MappingRecord.transcript as hybrid properties from the HGVS columns rather than storing them, so they cannot drift; drop the stored alleles.transcript column - add the cross_level_translation annotation type, written once per variant to record whether filling unmapped levels succeeded, was skipped, or failed - annotate Variant and TargetGeneMapping with mapping_records relationships and typed primary keys

Add type stubs for the VRS and hgvs APIs used by the reverse translation work so callers can be type-checked without casts. - ga4gh.core: type ga4gh_identify and re-export PrevVrsVersion - ga4gh.vrs: stub the data proxy, AlleleTranslator, and normalize; translate_from returns Any so callers annotate the concrete VRS subtype they expect without a redundant cast - hgvs.assemblymapper: stub AssemblyMapper with the g_to_t/c_to_g/c_to_p conversions used for cross-level translation

Wire the variant-annotation package into the server extra as a local editable (PEP 660) dependency so the API can drive the reverse translation pipeline. - declare variant-annotation as a develop path dependency in pyproject.toml and the server extra; lock pulls in its transitive deps (variant-translation, biopython, openpyxl, et-xmlfile) - mount ../variant-annotation into the dev and worker containers and prepend it to PYTHONPATH so the editable install resolves - pass --no-directory to poetry install in the Dockerfile so the build does not try to install the not-yet-copied editable sibling - ignore_missing_imports for variant_annotation.* in mypy: its editable import hook is unfollowable and it ships no py.typed, so treat it as untyped rather than maintaining drift-prone local stubs - add a Makefile with dev/test/lint/format targets

Rework the variant mapping job to persist its results into the new MappingRecord / Allele / MappingRecordAllele schema instead of MappedVariant, building on the closure tables for the Better Reverse Translation epic. - create one MappingRecord per variant (including failed variants, with null VRS data); successfully post-mapped variants additionally get-or-create an authoritative Allele and link it via an is_authoritative MappingRecordAllele - supersede a variant's prior live record through ValidTime supersede_with — retiring it and its allele links and inserting the new record under one timestamp — instead of flipping a current flag, so the old/new handoff is gap-free - require a TargetGeneMapping for every mapped score, raising on a miss rather than tolerating a null link, since dcd-mapping guarantees one - combine cis-phased members when deriving hgvs_assay_level - update the mapping job tests to assert against MappingRecord and the authoritative-allele link, including the gap-free supersession and cascade-retire of the prior link TODO#765: mapping is not idempotent, so each run always creates a new authoritative link.

Add a seqrepo_data_proxy to the worker context in both the standalone context and the on_job_start hook, so jobs performing VRS translation have a SeqRepo-backed data proxy available alongside the cdot HGVS data provider. Mirror it in the mock worker context used by tests.

Add a reverse_translate_variants_for_score_set worker job that builds the cross-level HGVS equivalence class for every mapped variant in a score set, persisting the candidates as non-authoritative alleles. - for each current authoritative MappingRecord, collapse the assay-level HGVS to its protein consequence and expand to all coding/genomic candidates via a single batched construct_equivalent_variants call from the variant-annotation library, run off-loop in a thread - resolve each record's coding (NM_) transcript from its target gene's cdna alignment, falling back to a batched UTA NP_→NM_ lookup for protein-level mappings; records with no coding transcript are skipped - translate each candidate to VRS and write it as a get-or-created Allele linked non-authoritatively to the record, deduping by vrs_digest and never relinking the record's authoritative allele - supersede the prior live derived links with the new set in one gap-free operation via ValidTime.supersede_live_where - record per-variant cross_level_translation annotation status (success / failed / skipped), retaining per-candidate translation errors as metadata; the job fails only when every variant fails - add WorkerCoordinateTranslator and NullTranscriptSource adapters for the library's translation ports, deferring AssemblyMapper init so its network calls do not fire under mocked unit tests - get_or_create_allele helper, job registration, and a pipeline dependency placing reverse translation after mapping and before CAR submission TODO#765: a re-run supersedes the whole derived set wholesale because re-mapping re-mints the records; idempotent records would let unchanged derived links stay live.

…n date Previously the cdna-transcript lookup was keyed by target_gene_id alone, which meant a re-mapped score set could bind a stale NM_ transcript from an earlier run rather than the one the current run emitted. - Key cdna_transcript_by_run on (target_gene_id, mapped_date); within a key the highest-id row wins, so a same-run replacement takes precedence. - Carry mapped_date through the MappingRecord query so each record anchors to its own run's cdna row. - Add _TranscriptResolutionSkipReason to distinguish recoverable skips (protein-coding target, transcript unresolved) from correct skips (non-coding/regulatory target, no protein consequence). Emit skip_category in annotation_metadata. - Add target_gene_id to _TranscriptResolution so skip classification can look up the target's TargetCategory. - Add Mapped[] type annotations to TargetGene.id and .category columns. - New tests: genomic-accession coding target RT, latest-cdna-row-within- run selection, stale-cdna-row isolation, skip classification (coding recoverable vs. regulatory correct), and cdna TGM persistence in the mapping job. Refs mavedb-api#763

Forward translation emits predicted protein consequences in parentheses (e.g. p.(Ala222Val)). The parens denote inference, not a distinct variant form, so normalize to the bare p. form before storing. Strings without prediction parens are returned unchanged. Includes parametrized unit tests.

…st collisions - Re-identify each component via normalize_and_identify after translate_from so the stored vrs_digest reflects current content, not a stale value cached by the reused AlleleTranslator's Merkle tree. Without this, distinct biological variants at the same position (e.g. A>C and A>T) share one digest and are merged by the digest-keyed get_or_create_allele. - Use ga4gh_identify with in_place="always" in identify_allele and identify_variation so an allele that already carries a translator-stamped id has it recomputed, not retained. - Coerce ReferenceLengthExpression -> LiteralSequenceExpression in normalize_and_identify, mirroring dcd_mapping's _rle_to_lse, so RT-built alleles hash identically to the mapper's authoritative alleles and dedup correctly across sources. - Add regression tests: stale-id overwrite, distinct-alt digest isolation, cis-phased ordering canonicalization, and RLE->LSE coercion.

…tighten comments - Emit the protein consequence (result.hgvs_p) as a protein-level member of the equivalence set alongside the coding/genomic candidates. Prediction parens (p.(Ala222Val)) are stripped via strip_protein_prediction_parens before translation and storage. Protein-assay inputs are excluded (the protein is already the authoritative allele). - Trim all inline and docstring comments to essential why; remove redundant prose throughout reverse_translation.py. - Add tests: protein allele persistence, skip-classification for all three skip categories (no_assay_level_hgvs, transcript_unresolved, no_coding_transcript).

…rom failures Previously, variant mapping success was determined solely by the presence of pre/post-mapped alleles. This conflated genuinely failed mappings with benign absences (intronic variants, no-protein-consequence variants) that legitimately produce no allele. - Add `MappingOutcome` enum to `lib/mapping/schema.py` mirroring `dcd_mapping.schemas.MappingOutcome`, with an `is_benign_absence` helper to classify INTRONIC and NO_PROTEIN_CONSEQUENCE outcomes - Rewrite the per-record outcome logic in `map_variants_for_score_set` to branch on the typed outcome rather than allele presence: MAPPED -> SUCCESS, benign absence -> SKIPPED, FAILED -> FAILED - Replace the `successful_mapped_variants` scalar with a typed `Counter[MappingOutcome]` that feeds `mapped_count`, `failed_count`, and `skipped_count` tallies in logs and final state decision - Derive `mapping_state` from genuine failures only; all-benign result sets are treated as complete, not failed - Raise `NonexistentMappingResultsError` when a score annotation has no `outcome` field (malformed/older payload) - Expand test coverage for intronic and no-protein-consequence scenarios, verifying correct status and failure-category assignment

… UTA-backed source WtCodonMode.ALL reads the reference codon via TranscriptSource.codon_at, which requires a real UTA connection. The previous NullTranscriptSource always returned None, silently breaking WT-codon resolution. - Extract `uta_transcript_source()` context manager into `translation_ports.py`; removes the ad-hoc UTA connection setup that was duplicated in the job and the now-deleted NullTranscriptSource - Scope the live UTA client around the full `run_in_executor` call so the connection outlives the synchronous executor block - Remove NullTranscriptSource; callers that only need transcript_for_protein (already resolved by the job) also benefit from the real client without extra cost - Add a TODO noting that non-substitution consequences (del/ins/delins/ fs/ext) are miscounted as FAILED rather than SKIPPED (#767)

bencap added 20 commits June 4, 2026 11:05

feat(data-providers): add seqrepo-backed vrs data proxy

a2d0509

- add shared seqrepo and seqrepo data proxy providers in services - reuse shared seqrepo provider in deps to remove duplicate config - align vrs sequence proxy behavior with dcd-mapping expectations

fixup into cisphased hgvs commit

0b1e19f

fixup into mapper changes

ca2a08c

bencap linked an issue Jun 18, 2026 that may be closed by this pull request

feat: Persist MappingRecord and Allele rows from mapping pipeline results #740

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Center Mappings on VRS Alleles#773

feat: Center Mappings on VRS Alleles#773
bencap wants to merge 20 commits into
release-2026.3.0from
feature/bencap/739/allele-data-model-schema

bencap commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bencap commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant