feat: Center Mappings on VRS Alleles#773
Draft
bencap wants to merge 20 commits into
Draft
Conversation
- add shared seqrepo and seqrepo data proxy providers in services - reuse shared seqrepo provider in deps to remove duplicate config - align vrs sequence proxy behavior with dcd-mapping expectations
- add shared helpers to translate hgvs, normalize alleles, and compute ids - clear cached merkle digests before identification so mutated alleles do not keep stale ga4gh identifiers - document the invariant that all allele identification must route through the helper to prevent digest correctness regressions
Introduce a reusable declarative mixin implementing transaction-time SCD Type 2 versioning for rows that change over time, used by either versioned entities or link/association rows. - A row is live while valid_to is NULL; current/as_of express the half-open [valid_from, valid_to) point-in-time predicate so call sites never hand-roll it - supersede_with (single row) and supersede_live_where (bulk) stamp the retired valid_to and successor valid_from with one timestamp for a gap-free handoff regardless of transaction boundaries - retire/retire_live_where are withdrawal primitives; retire cascades to live child links named in __retire_cascade__ - bulk supersede refuses to run on cascade-bearing classes since it cannot fire the cascade - consumers must add a partial unique index over their natural key WHERE valid_to IS NULL as a backstop against duplicate live rows Cover the mixin with unit tests against purpose-built models, using explicit timestamps far from the transaction clock to prove the gap-free handoff and verify the partial unique index backstop.
Add a lib module with reusable helpers for parsing and rewriting HGVS strings ahead of VRS translation. - extract_accession returns the reference accession (substring before the first colon), tolerant of whitespace and missing separators - split_cis_phased_hgvs expands a bracketed multivariant expression into fully-qualified components carrying the original accession and coordinate prefix, since ga4gh's AlleleTranslator only yields a single Allele per call and each component must translate alone - join_cis_phased_hgvs is the inverse, recombining components into one bracketed block and returning None when they do not share a single accession and coordinate prefix Cover the helpers with unit tests including the split/join round trip and the mixed-accession and mixed-prefix rejection cases.
Add VRS translation support for cis-phased multivariant HGVS, building on the new hgvs split helpers. - translate_hgvs_to_variation translates each component HGVS to an Allele independently, returning a bare Allele for a single component and wrapping two or more in a CisPhasedBlock, mirroring dcd_mapping's vrs_map._construct_vrs_allele; the reverse-translation job emits bracketed genomic forms the AlleleTranslator cannot translate directly - identify_variation generalizes identify_allele to blocks, clearing every member and location digest plus the block's own before identifying so a stale member digest never propagates into the block id; the block digest is order-independent so a set dedups to one row Cover both with unit tests, including the order-independent block digest and the stale-digest clearing.
get_hgvs_from_post_mapped can now join the members of a multi-variant block (Haplotype/CisPhasedBlock) into a single bracketed cis-phased expression via the new combine_cis flag, replacing the previous behavior of returning None for any multi-variant block. - combine_cis defaults off because some consumers cannot yet handle a bracketed expression — notably ClinGen submission, which has no single CAID for a multi-variant cis block (#764) - the CSV export fallback opts in, so g./p. HGVS columns are populated from cis-phased post-mapped output instead of left empty - drop the stale commented-out error branches and the resolved TODO
Introduce parallel mapping tables for the Better Reverse Translation epic (#746). The existing mapped_variants table is left untouched (frozen serving) while the new schema is built out separately. - add alleles, mapping_records, and mapping_record_alleles tables with their ORM models; alleles are content-addressed by vrs_digest and shared across mapping records via the link table - mapping_record_alleles.is_authoritative distinguishes the assay's actual measurement from translator-derived links, since the same VRS allele can be authoritative for one record and derived for another - put mapping_records and mapping_record_alleles on valid-time versioning (ValidTime mixin): a re-map retires the prior live row by closing valid_to instead of deleting, so history is retained and point-in-time queries are a single predicate; partial unique indexes promote "one live row per key" to the database - derive Allele.transcript and MappingRecord.transcript as hybrid properties from the HGVS columns rather than storing them, so they cannot drift; drop the stored alleles.transcript column - add the cross_level_translation annotation type, written once per variant to record whether filling unmapped levels succeeded, was skipped, or failed - annotate Variant and TargetGeneMapping with mapping_records relationships and typed primary keys
Add type stubs for the VRS and hgvs APIs used by the reverse translation work so callers can be type-checked without casts. - ga4gh.core: type ga4gh_identify and re-export PrevVrsVersion - ga4gh.vrs: stub the data proxy, AlleleTranslator, and normalize; translate_from returns Any so callers annotate the concrete VRS subtype they expect without a redundant cast - hgvs.assemblymapper: stub AssemblyMapper with the g_to_t/c_to_g/c_to_p conversions used for cross-level translation
Wire the variant-annotation package into the server extra as a local editable (PEP 660) dependency so the API can drive the reverse translation pipeline. - declare variant-annotation as a develop path dependency in pyproject.toml and the server extra; lock pulls in its transitive deps (variant-translation, biopython, openpyxl, et-xmlfile) - mount ../variant-annotation into the dev and worker containers and prepend it to PYTHONPATH so the editable install resolves - pass --no-directory to poetry install in the Dockerfile so the build does not try to install the not-yet-copied editable sibling - ignore_missing_imports for variant_annotation.* in mypy: its editable import hook is unfollowable and it ships no py.typed, so treat it as untyped rather than maintaining drift-prone local stubs - add a Makefile with dev/test/lint/format targets
Rework the variant mapping job to persist its results into the new MappingRecord / Allele / MappingRecordAllele schema instead of MappedVariant, building on the closure tables for the Better Reverse Translation epic. - create one MappingRecord per variant (including failed variants, with null VRS data); successfully post-mapped variants additionally get-or-create an authoritative Allele and link it via an is_authoritative MappingRecordAllele - supersede a variant's prior live record through ValidTime supersede_with — retiring it and its allele links and inserting the new record under one timestamp — instead of flipping a current flag, so the old/new handoff is gap-free - require a TargetGeneMapping for every mapped score, raising on a miss rather than tolerating a null link, since dcd-mapping guarantees one - combine cis-phased members when deriving hgvs_assay_level - update the mapping job tests to assert against MappingRecord and the authoritative-allele link, including the gap-free supersession and cascade-retire of the prior link TODO#765: mapping is not idempotent, so each run always creates a new authoritative link.
Add a seqrepo_data_proxy to the worker context in both the standalone context and the on_job_start hook, so jobs performing VRS translation have a SeqRepo-backed data proxy available alongside the cdot HGVS data provider. Mirror it in the mock worker context used by tests.
Add a reverse_translate_variants_for_score_set worker job that builds the cross-level HGVS equivalence class for every mapped variant in a score set, persisting the candidates as non-authoritative alleles. - for each current authoritative MappingRecord, collapse the assay-level HGVS to its protein consequence and expand to all coding/genomic candidates via a single batched construct_equivalent_variants call from the variant-annotation library, run off-loop in a thread - resolve each record's coding (NM_) transcript from its target gene's cdna alignment, falling back to a batched UTA NP_→NM_ lookup for protein-level mappings; records with no coding transcript are skipped - translate each candidate to VRS and write it as a get-or-created Allele linked non-authoritatively to the record, deduping by vrs_digest and never relinking the record's authoritative allele - supersede the prior live derived links with the new set in one gap-free operation via ValidTime.supersede_live_where - record per-variant cross_level_translation annotation status (success / failed / skipped), retaining per-candidate translation errors as metadata; the job fails only when every variant fails - add WorkerCoordinateTranslator and NullTranscriptSource adapters for the library's translation ports, deferring AssemblyMapper init so its network calls do not fire under mocked unit tests - get_or_create_allele helper, job registration, and a pipeline dependency placing reverse translation after mapping and before CAR submission TODO#765: a re-run supersedes the whole derived set wholesale because re-mapping re-mints the records; idempotent records would let unchanged derived links stay live.
…n date Previously the cdna-transcript lookup was keyed by target_gene_id alone, which meant a re-mapped score set could bind a stale NM_ transcript from an earlier run rather than the one the current run emitted. - Key cdna_transcript_by_run on (target_gene_id, mapped_date); within a key the highest-id row wins, so a same-run replacement takes precedence. - Carry mapped_date through the MappingRecord query so each record anchors to its own run's cdna row. - Add _TranscriptResolutionSkipReason to distinguish recoverable skips (protein-coding target, transcript unresolved) from correct skips (non-coding/regulatory target, no protein consequence). Emit skip_category in annotation_metadata. - Add target_gene_id to _TranscriptResolution so skip classification can look up the target's TargetCategory. - Add Mapped[] type annotations to TargetGene.id and .category columns. - New tests: genomic-accession coding target RT, latest-cdna-row-within- run selection, stale-cdna-row isolation, skip classification (coding recoverable vs. regulatory correct), and cdna TGM persistence in the mapping job. Refs mavedb-api#763
Forward translation emits predicted protein consequences in parentheses (e.g. p.(Ala222Val)). The parens denote inference, not a distinct variant form, so normalize to the bare p. form before storing. Strings without prediction parens are returned unchanged. Includes parametrized unit tests.
…st collisions - Re-identify each component via normalize_and_identify after translate_from so the stored vrs_digest reflects current content, not a stale value cached by the reused AlleleTranslator's Merkle tree. Without this, distinct biological variants at the same position (e.g. A>C and A>T) share one digest and are merged by the digest-keyed get_or_create_allele. - Use ga4gh_identify with in_place="always" in identify_allele and identify_variation so an allele that already carries a translator-stamped id has it recomputed, not retained. - Coerce ReferenceLengthExpression -> LiteralSequenceExpression in normalize_and_identify, mirroring dcd_mapping's _rle_to_lse, so RT-built alleles hash identically to the mapper's authoritative alleles and dedup correctly across sources. - Add regression tests: stale-id overwrite, distinct-alt digest isolation, cis-phased ordering canonicalization, and RLE->LSE coercion.
…tighten comments - Emit the protein consequence (result.hgvs_p) as a protein-level member of the equivalence set alongside the coding/genomic candidates. Prediction parens (p.(Ala222Val)) are stripped via strip_protein_prediction_parens before translation and storage. Protein-assay inputs are excluded (the protein is already the authoritative allele). - Trim all inline and docstring comments to essential why; remove redundant prose throughout reverse_translation.py. - Add tests: protein allele persistence, skip-classification for all three skip categories (no_assay_level_hgvs, transcript_unresolved, no_coding_transcript).
…rom failures Previously, variant mapping success was determined solely by the presence of pre/post-mapped alleles. This conflated genuinely failed mappings with benign absences (intronic variants, no-protein-consequence variants) that legitimately produce no allele. - Add `MappingOutcome` enum to `lib/mapping/schema.py` mirroring `dcd_mapping.schemas.MappingOutcome`, with an `is_benign_absence` helper to classify INTRONIC and NO_PROTEIN_CONSEQUENCE outcomes - Rewrite the per-record outcome logic in `map_variants_for_score_set` to branch on the typed outcome rather than allele presence: MAPPED -> SUCCESS, benign absence -> SKIPPED, FAILED -> FAILED - Replace the `successful_mapped_variants` scalar with a typed `Counter[MappingOutcome]` that feeds `mapped_count`, `failed_count`, and `skipped_count` tallies in logs and final state decision - Derive `mapping_state` from genuine failures only; all-benign result sets are treated as complete, not failed - Raise `NonexistentMappingResultsError` when a score annotation has no `outcome` field (malformed/older payload) - Expand test coverage for intronic and no-protein-consequence scenarios, verifying correct status and failure-category assignment
… UTA-backed source WtCodonMode.ALL reads the reference codon via TranscriptSource.codon_at, which requires a real UTA connection. The previous NullTranscriptSource always returned None, silently breaking WT-codon resolution. - Extract `uta_transcript_source()` context manager into `translation_ports.py`; removes the ad-hoc UTA connection setup that was duplicated in the job and the now-deleted NullTranscriptSource - Scope the live UTA client around the full `run_in_executor` call so the connection outlives the synchronous executor block - Remove NullTranscriptSource; callers that only need transcript_for_protein (already resolved by the job) also benefit from the real client without extra cost - Add a TODO noting that non-substitution consequences (del/ins/delins/ fs/ext) are miscounted as FAILED rather than SKIPPED (#767)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.