Skip to content

feat: Center Mappings on VRS Alleles#773

Draft
bencap wants to merge 20 commits into
release-2026.3.0from
feature/bencap/739/allele-data-model-schema
Draft

feat: Center Mappings on VRS Alleles#773
bencap wants to merge 20 commits into
release-2026.3.0from
feature/bencap/739/allele-data-model-schema

Conversation

@bencap

@bencap bencap commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

bencap added 20 commits June 4, 2026 11:05
- add shared seqrepo and seqrepo data proxy providers in services
- reuse shared seqrepo provider in deps to remove duplicate config
- align vrs sequence proxy behavior with dcd-mapping expectations
- add shared helpers to translate hgvs, normalize alleles, and compute ids
- clear cached merkle digests before identification so mutated alleles do
  not keep stale ga4gh identifiers
- document the invariant that all allele identification must route through
  the helper to prevent digest correctness regressions
Introduce a reusable declarative mixin implementing transaction-time
SCD Type 2 versioning for rows that change over time, used by either
versioned entities or link/association rows.

- A row is live while valid_to is NULL; current/as_of express the
  half-open [valid_from, valid_to) point-in-time predicate so call
  sites never hand-roll it
- supersede_with (single row) and supersede_live_where (bulk) stamp the
  retired valid_to and successor valid_from with one timestamp for a
  gap-free handoff regardless of transaction boundaries
- retire/retire_live_where are withdrawal primitives; retire cascades
  to live child links named in __retire_cascade__
- bulk supersede refuses to run on cascade-bearing classes since it
  cannot fire the cascade
- consumers must add a partial unique index over their natural key
  WHERE valid_to IS NULL as a backstop against duplicate live rows

Cover the mixin with unit tests against purpose-built models, using
explicit timestamps far from the transaction clock to prove the
gap-free handoff and verify the partial unique index backstop.
Add a lib module with reusable helpers for parsing and rewriting HGVS
strings ahead of VRS translation.

- extract_accession returns the reference accession (substring before
  the first colon), tolerant of whitespace and missing separators
- split_cis_phased_hgvs expands a bracketed multivariant expression
  into fully-qualified components carrying the original accession and
  coordinate prefix, since ga4gh's AlleleTranslator only yields a
  single Allele per call and each component must translate alone
- join_cis_phased_hgvs is the inverse, recombining components into one
  bracketed block and returning None when they do not share a single
  accession and coordinate prefix

Cover the helpers with unit tests including the split/join round trip
and the mixed-accession and mixed-prefix rejection cases.
Add VRS translation support for cis-phased multivariant HGVS, building
on the new hgvs split helpers.

- translate_hgvs_to_variation translates each component HGVS to an
  Allele independently, returning a bare Allele for a single component
  and wrapping two or more in a CisPhasedBlock, mirroring dcd_mapping's
  vrs_map._construct_vrs_allele; the reverse-translation job emits
  bracketed genomic forms the AlleleTranslator cannot translate directly
- identify_variation generalizes identify_allele to blocks, clearing
  every member and location digest plus the block's own before
  identifying so a stale member digest never propagates into the block
  id; the block digest is order-independent so a set dedups to one row

Cover both with unit tests, including the order-independent block
digest and the stale-digest clearing.
get_hgvs_from_post_mapped can now join the members of a multi-variant
block (Haplotype/CisPhasedBlock) into a single bracketed cis-phased
expression via the new combine_cis flag, replacing the previous
behavior of returning None for any multi-variant block.

- combine_cis defaults off because some consumers cannot yet handle a
  bracketed expression — notably ClinGen submission, which has no
  single CAID for a multi-variant cis block (#764)
- the CSV export fallback opts in, so g./p. HGVS columns are populated
  from cis-phased post-mapped output instead of left empty
- drop the stale commented-out error branches and the resolved TODO
Introduce parallel mapping tables for the Better Reverse Translation
epic (#746). The existing mapped_variants table is left untouched
(frozen serving) while the new schema is built out separately.

- add alleles, mapping_records, and mapping_record_alleles tables with
  their ORM models; alleles are content-addressed by vrs_digest and
  shared across mapping records via the link table
- mapping_record_alleles.is_authoritative distinguishes the assay's
  actual measurement from translator-derived links, since the same VRS
  allele can be authoritative for one record and derived for another
- put mapping_records and mapping_record_alleles on valid-time
  versioning (ValidTime mixin): a re-map retires the prior live row by
  closing valid_to instead of deleting, so history is retained and
  point-in-time queries are a single predicate; partial unique indexes
  promote "one live row per key" to the database
- derive Allele.transcript and MappingRecord.transcript as hybrid
  properties from the HGVS columns rather than storing them, so they
  cannot drift; drop the stored alleles.transcript column
- add the cross_level_translation annotation type, written once per
  variant to record whether filling unmapped levels succeeded, was
  skipped, or failed
- annotate Variant and TargetGeneMapping with mapping_records
  relationships and typed primary keys
Add type stubs for the VRS and hgvs APIs used by the reverse
translation work so callers can be type-checked without casts.

- ga4gh.core: type ga4gh_identify and re-export PrevVrsVersion
- ga4gh.vrs: stub the data proxy, AlleleTranslator, and normalize;
  translate_from returns Any so callers annotate the concrete VRS
  subtype they expect without a redundant cast
- hgvs.assemblymapper: stub AssemblyMapper with the g_to_t/c_to_g/c_to_p
  conversions used for cross-level translation
Wire the variant-annotation package into the server extra as a local
editable (PEP 660) dependency so the API can drive the reverse
translation pipeline.

- declare variant-annotation as a develop path dependency in
  pyproject.toml and the server extra; lock pulls in its transitive
  deps (variant-translation, biopython, openpyxl, et-xmlfile)
- mount ../variant-annotation into the dev and worker containers and
  prepend it to PYTHONPATH so the editable install resolves
- pass --no-directory to poetry install in the Dockerfile so the build
  does not try to install the not-yet-copied editable sibling
- ignore_missing_imports for variant_annotation.* in mypy: its editable
  import hook is unfollowable and it ships no py.typed, so treat it as
  untyped rather than maintaining drift-prone local stubs
- add a Makefile with dev/test/lint/format targets
Rework the variant mapping job to persist its results into the new
MappingRecord / Allele / MappingRecordAllele schema instead of
MappedVariant, building on the closure tables for the Better Reverse
Translation epic.

- create one MappingRecord per variant (including failed variants, with
  null VRS data); successfully post-mapped variants additionally
  get-or-create an authoritative Allele and link it via an
  is_authoritative MappingRecordAllele
- supersede a variant's prior live record through ValidTime
  supersede_with — retiring it and its allele links and inserting the
  new record under one timestamp — instead of flipping a current flag,
  so the old/new handoff is gap-free
- require a TargetGeneMapping for every mapped score, raising on a miss
  rather than tolerating a null link, since dcd-mapping guarantees one
- combine cis-phased members when deriving hgvs_assay_level
- update the mapping job tests to assert against MappingRecord and the
  authoritative-allele link, including the gap-free supersession and
  cascade-retire of the prior link

TODO#765: mapping is not idempotent, so each run always creates a new
authoritative link.
Add a seqrepo_data_proxy to the worker context in both the standalone
context and the on_job_start hook, so jobs performing VRS translation
have a SeqRepo-backed data proxy available alongside the cdot HGVS data
provider. Mirror it in the mock worker context used by tests.
Add a reverse_translate_variants_for_score_set worker job that builds
the cross-level HGVS equivalence class for every mapped variant in a
score set, persisting the candidates as non-authoritative alleles.

- for each current authoritative MappingRecord, collapse the assay-level
  HGVS to its protein consequence and expand to all coding/genomic
  candidates via a single batched construct_equivalent_variants call
  from the variant-annotation library, run off-loop in a thread
- resolve each record's coding (NM_) transcript from its target gene's
  cdna alignment, falling back to a batched UTA NP_→NM_ lookup for
  protein-level mappings; records with no coding transcript are skipped
- translate each candidate to VRS and write it as a get-or-created
  Allele linked non-authoritatively to the record, deduping by
  vrs_digest and never relinking the record's authoritative allele
- supersede the prior live derived links with the new set in one
  gap-free operation via ValidTime.supersede_live_where
- record per-variant cross_level_translation annotation status
  (success / failed / skipped), retaining per-candidate translation
  errors as metadata; the job fails only when every variant fails
- add WorkerCoordinateTranslator and NullTranscriptSource adapters for
  the library's translation ports, deferring AssemblyMapper init so its
  network calls do not fire under mocked unit tests
- get_or_create_allele helper, job registration, and a pipeline
  dependency placing reverse translation after mapping and before CAR
  submission

TODO#765: a re-run supersedes the whole derived set wholesale because
re-mapping re-mints the records; idempotent records would let unchanged
derived links stay live.
…n date

Previously the cdna-transcript lookup was keyed by target_gene_id alone,
which meant a re-mapped score set could bind a stale NM_ transcript from
an earlier run rather than the one the current run emitted.

- Key cdna_transcript_by_run on (target_gene_id, mapped_date); within a
  key the highest-id row wins, so a same-run replacement takes precedence.
- Carry mapped_date through the MappingRecord query so each record anchors
  to its own run's cdna row.
- Add _TranscriptResolutionSkipReason to distinguish recoverable skips
  (protein-coding target, transcript unresolved) from correct skips
  (non-coding/regulatory target, no protein consequence). Emit
  skip_category in annotation_metadata.
- Add target_gene_id to _TranscriptResolution so skip classification can
  look up the target's TargetCategory.
- Add Mapped[] type annotations to TargetGene.id and .category columns.
- New tests: genomic-accession coding target RT, latest-cdna-row-within-
  run selection, stale-cdna-row isolation, skip classification (coding
  recoverable vs. regulatory correct), and cdna TGM persistence in the
  mapping job.

Refs mavedb-api#763
Forward translation emits predicted protein consequences in parentheses
(e.g. p.(Ala222Val)). The parens denote inference, not a distinct
variant form, so normalize to the bare p. form before storing. Strings
without prediction parens are returned unchanged. Includes parametrized
unit tests.
…st collisions

- Re-identify each component via normalize_and_identify after translate_from
  so the stored vrs_digest reflects current content, not a stale value
  cached by the reused AlleleTranslator's Merkle tree. Without this,
  distinct biological variants at the same position (e.g. A>C and A>T)
  share one digest and are merged by the digest-keyed get_or_create_allele.
- Use ga4gh_identify with in_place="always" in identify_allele and
  identify_variation so an allele that already carries a translator-stamped
  id has it recomputed, not retained.
- Coerce ReferenceLengthExpression -> LiteralSequenceExpression in
  normalize_and_identify, mirroring dcd_mapping's _rle_to_lse, so RT-built
  alleles hash identically to the mapper's authoritative alleles and dedup
  correctly across sources.
- Add regression tests: stale-id overwrite, distinct-alt digest isolation,
  cis-phased ordering canonicalization, and RLE->LSE coercion.
…tighten comments

- Emit the protein consequence (result.hgvs_p) as a protein-level member
  of the equivalence set alongside the coding/genomic candidates. Prediction
  parens (p.(Ala222Val)) are stripped via strip_protein_prediction_parens
  before translation and storage. Protein-assay inputs are excluded (the
  protein is already the authoritative allele).
- Trim all inline and docstring comments to essential why; remove redundant
  prose throughout reverse_translation.py.
- Add tests: protein allele persistence, skip-classification for all three
  skip categories (no_assay_level_hgvs, transcript_unresolved,
  no_coding_transcript).
…rom failures

Previously, variant mapping success was determined solely by the
presence of pre/post-mapped alleles. This conflated genuinely failed
mappings with benign absences (intronic variants,
no-protein-consequence variants) that legitimately produce no allele.

- Add `MappingOutcome` enum to `lib/mapping/schema.py` mirroring
  `dcd_mapping.schemas.MappingOutcome`, with an `is_benign_absence`
  helper to classify INTRONIC and NO_PROTEIN_CONSEQUENCE outcomes
- Rewrite the per-record outcome logic in `map_variants_for_score_set`
  to branch on the typed outcome rather than allele presence:
  MAPPED -> SUCCESS, benign absence -> SKIPPED, FAILED -> FAILED
- Replace the `successful_mapped_variants` scalar with a typed
  `Counter[MappingOutcome]` that feeds `mapped_count`, `failed_count`,
  and `skipped_count` tallies in logs and final state decision
- Derive `mapping_state` from genuine failures only; all-benign
  result sets are treated as complete, not failed
- Raise `NonexistentMappingResultsError` when a score annotation has
  no `outcome` field (malformed/older payload)
- Expand test coverage for intronic and no-protein-consequence
  scenarios, verifying correct status and failure-category assignment
… UTA-backed source

WtCodonMode.ALL reads the reference codon via TranscriptSource.codon_at,
which requires a real UTA connection. The previous NullTranscriptSource
always returned None, silently breaking WT-codon resolution.

- Extract `uta_transcript_source()` context manager into
  `translation_ports.py`; removes the ad-hoc UTA connection setup
  that was duplicated in the job and the now-deleted NullTranscriptSource
- Scope the live UTA client around the full `run_in_executor` call so
  the connection outlives the synchronous executor block
- Remove NullTranscriptSource; callers that only need
  transcript_for_protein (already resolved by the job) also benefit
  from the real client without extra cost
- Add a TODO noting that non-substitution consequences (del/ins/delins/
  fs/ext) are miscounted as FAILED rather than SKIPPED (#767)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Persist MappingRecord and Allele rows from mapping pipeline results

1 participant