Skip to content

feat: normalize intronic variants#257

Open
tedil wants to merge 5 commits intomainfrom
feat/normalize-intronic-variants
Open

feat: normalize intronic variants#257
tedil wants to merge 5 commits intomainfrom
feat/normalize-intronic-variants

Conversation

@tedil
Copy link
Contributor

@tedil tedil commented Jan 30, 2026

… via roundtrip to genomic loc and back

Summary by CodeRabbit

  • New Features

    • Intronic-aware normalization for variants spanning introns with automatic fallback to standard normalization.
    • Bounded normalization option to normalize variants within an explicit boundary.
  • Improvements

    • Error messages now include more detailed diagnostic information for data access issues.
  • Documentation

    • Updated intron-related documentation to describe the new intronic normalization path.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

Implements intronic-aware normalization in Mapper (new private normalize_intronic used by maybe_normalize), adds Normalizer::normalize_bounded to normalize with explicit boundaries, and updates DataError formatting to include the inner error message; also bumps rust toolchain in Cargo.toml.

Changes

Cohort / File(s) Summary
Intronic-aware normalization
src/mapper/assembly.rs
Added intronic-aware flow in Mapper::maybe_normalize and a private normalize_intronic helper that converts transcript variants to genomic, computes exon/intron bounds, calls bounded normalization, and maps results back to transcript variants.
Bounded normalizer & usage
src/normalizer.rs
Added public Normalizer::normalize_bounded(&self, var: &HgvsVariant, boundary: Range<i32>) -> Result<HgvsVariant, Error> that uses an explicit boundary for normalization and preserves pre-checks and cds->tx mapping.
Error message formatting
src/mapper/error.rs
Changed Error::DataError error string to include the underlying error: "problem accessing data: {0}".
Toolchain metadata
Cargo.toml
Bumped rust-version from 1.80.0 to 1.82.0.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Mapper
    participant Genomic as GenomicVariant
    participant Normalizer as Normalizer<br/>(bounded)
    participant Transcript as TranscriptVariant

    Client->>Mapper: call maybe_normalize(var, normalize=true)
    alt variant spans intron
        Mapper->>Mapper: normalize_intronic(var)
        Mapper->>Genomic: convert transcript -> genomic variant
        Mapper->>Genomic: determine exon/strand and intron bounds
        Mapper->>Normalizer: normalize_bounded(genomic_var, bounds)
        Normalizer->>Normalizer: check_and_guard -> normalize_alleles(boundary)
        Normalizer-->>Mapper: normalized genomic variant
        Mapper->>Transcript: map genomic result -> transcript (g_to_c/g_to_n)
        Transcript-->>Client: normalized transcript variant
    else non-intronic or fallback
        Mapper->>Normalizer: normalize(var) (standard path)
        Normalizer-->>Client: normalized or original variant
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • stolpeo

Poem

🐰 Hopping through exons, I bound and I spin,
I turn genomic whispers back into kin.
With edges well-marked and errors now clear,
I nudge variants closer — a rabbitly cheer!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: normalize intronic variants' clearly and concisely summarizes the main change: adding normalization support for intronic variants through a new intronic normalization flow.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/normalize-intronic-variants

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Jan 30, 2026

Codecov Report

❌ Patch coverage is 8.45070% with 65 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.81%. Comparing base (eb7a77f) to head (d2b05e6).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
src/mapper/assembly.rs 11.11% 48 Missing ⚠️
src/normalizer.rs 0.00% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #257      +/-   ##
==========================================
- Coverage   90.38%   89.81%   -0.57%     
==========================================
  Files          18       18              
  Lines        9606     9527      -79     
==========================================
- Hits         8682     8557     -125     
- Misses        924      970      +46     
Files with missing lines Coverage Δ
src/normalizer.rs 88.11% <0.00%> (-1.86%) ⬇️
src/mapper/assembly.rs 63.30% <11.11%> (-8.45%) ⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/mapper/assembly.rs`:
- Around line 330-343: The exons returned by self.provider.get_tx_exons must be
explicitly sorted by alt_start_i before using them to derive strand/intron
bounds; modify the handling of the exons vector (the local exons variable) so
that right after retrieval and before the strand = exons.first() call you sort
the collection by exon.alt_start_i (ascending), then proceed to check for empty
and read .first(); this ensures downstream logic that computes intron boundaries
and uses alt_strand relies on properly ordered exons from both UTA and cdot/json
providers.
🧹 Nitpick comments (2)
src/mapper/assembly.rs (2)

345-362: Boundary defaults to 0..i32::MAX if variant doesn't fall within any intron gap.

If the loop at lines 353-362 doesn't find an intron containing the variant (e.g., if the variant spans an exon-intron boundary or the exon data is incomplete), boundary remains 0..i32::MAX. This would allow the variant to shuffle without restriction, which may not be the intended behavior for intronic normalization.

Consider returning an error or logging when no appropriate intron boundary is found:

♻️ Suggested improvement for boundary validation
         // find intron "containing" the variant to set boundaries
         let mut boundary = 0..i32::MAX;
+        let mut found_intron = false;
         // assumes exons are sorted (by alt_start_i)
         for i in 0..exons.len().saturating_sub(1) {
             let prev_exon_end = exons[i].alt_end_i;
             let next_exon_start = exons[i + 1].alt_start_i;

             // check if variant is roughly within this gap
             if var_start >= prev_exon_end && var_end <= next_exon_start {
                 boundary = prev_exon_end..next_exon_start;
+                found_intron = true;
                 break;
             }
         }
+        if !found_intron {
+            tracing::debug!(
+                "Could not find intron boundary for variant {} (range {}..{})",
+                var, var_start, var_end
+            );
+        }

346-346: Consider using a more specific error instead of Error::General.

Error::General provides minimal context. A more descriptive error would aid debugging when the genomic variant lacks a location range.

♻️ Suggested improvement
-        let var_range = var_g.loc_range().ok_or(Error::General)?;
+        let var_range = var_g
+            .loc_range()
+            .ok_or_else(|| Error::MissingGenomeIntervalPosition(format!("{}", var_g)))?;

@tedil tedil requested a review from holtgrewe January 30, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant