Skip to content

End-to-end NGS variant calling and functional annotation pipeline for Limulus polyphemus using GATK and SnpEff, with biological interpretation of high-impact immune and metabolic variants.

Notifications You must be signed in to change notification settings

sivananth-m/Variant-Calling-and-Functional-Annotation-of-the-Limulus-polyphemus-Atlantic-Horseshoe-Crab-Genome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Variant Calling and Functional Annotation of the Limulus polyphemus (Atlantic Horseshoe Crab) Genome

🧬 Project Overview

This project implements an end-to-end Next-Generation Sequencing (NGS) pipeline to identify and interpret genetic variants in the Atlantic Horseshoe Crab (Limulus polyphemus). As a "living fossil" with a unique innate immune system (Limulus Amebocyte Lysate - LAL), characterizing genetic variation in this species is critical for understanding its evolutionary resilience.

Key Findings:

  • Variant Discovery: Processed 9,877 variants from raw SRA sequencing data.
  • Functional Impact: Filtered down to 5 High-Confidence Loss-of-Function (LoF) candidates and 38 Missense variants.
  • Biological Insights:
    • Validated High-Impact variants in Serine Protease genes (central to the clotting immune response) and Metabolic Kinases.
    • Identified missense variation in Branched-chain Amino Acid metabolism and Glycan degradation pathways.

🛠️ Pipeline Workflow

The analysis followed GATK best practices and standard bioinformatics protocols:

  1. Data Acquisition: Raw paired-end reads retrieved from ENA/EMBL (Accession: SRR610297).
  2. QC & Trimming: fastp used for quality filtering and adapter removal.
  3. Alignment: Reads aligned to the NCBI RefSeq Limulus genome (GCF_000517525.1) using BWA-MEM.
  4. Post-Alignment Processing:
    • Read Group addition and Indexing.
    • Duplicate Removal: Picard MarkDuplicates used to mitigate PCR bias (Duplicate rate: 62.9%).
    • Alignment Stats: QC metrics generated via Samtools Flagstat.
  5. Variant Calling: Haplotype calling via GATK HaplotypeCaller (gVCF mode) followed by hard filtering.
  6. Annotation: Custom database built for SnpEff to predict functional effects (High/Moderate/Low impact).
  7. Downstream Analysis:
    • Functional characterization using DAVID (GO Terms & KEGG Pathways).
    • Manual curation of High-Impact candidates due to limited pathway annotation for non-model organisms.
  8. Validation: Visual inspection of read pileups using IGV to rule out sequencing artifacts.

📊 Quality Control & Metrics

Rigorous QC was performed to ensure the validity of variant calls.

Metric Result Interpretation
Sequencing Quality Phred > 35 Excellent quality across full read length (see Fig 1).
Alignment Rate 92.6% High mapping efficiency to the reference genome.
Duplication Rate 62.98% Managed via Picard removal steps to prevent false positive calls.
Variant Density 1 / 92kb Indicative of low heterozygosity or stringent filtering.
Ts/Tv Ratio 1.60 Consistent with transition/transversion rates in invertebrate sequencing.

Figure 1: Per-base Sequence Quality (Read 1 & 2) Read 1 Quality Read 2 Quality


🔬 Biological Interpretation

Due to the small number of high-impact variants (n=5) and the non-model status of Limulus, broad pathway enrichment was limited. Analysis focused on gene-centric characterization.

1. Loss-of-Function Candidates (High Impact)

Variants predicted to disrupt protein function (Stop Gain, Frameshift) were identified in critical physiological regulators:

  • Immune Defense: Serine protease nudel-like (LOC106462125).
    • Significance: The Limulus immune coagulation cascade relies on serine proteases. A LoF variant here may alter the organism's response to endotoxins.
  • Energy Homeostasis: Phosphorylase b kinase gamma (LOC106478142).
    • Significance: Key enzyme for glycogen mobilization. Disruption suggests potential impacts on rapid energy release during stress.
  • Mitochondrial Quality: Metalloendopeptidase OMA1.
    • Significance: Involved in mitochondrial stress response and protein quality control.

2. Missense Variants & Adaptation

Missense variants (amino acid changes) were mapped to functional pathways using KEGG and GO terms:

  • Metabolic Adaptation (KEGG):
    • Valine, leucine and isoleucine degradation: Isovaleryl-CoA dehydrogenase.
    • Glycan/Sphingolipid Metabolism: Beta-hexosaminidase subunit alpha.
    • Significance: Variation in these metabolic enzymes suggests evolutionary fine-tuning of nutrient processing and lysosomal recycling.
  • Structural Adaptation (GO Terms):
    • Cuticle protein 16.8-like: Variants in chitin-matrix proteins imply adaptation of the exoskeleton properties (hardness/flexibility).

📸 Visual Validation

Figure 2: IGV Snapshot of Serine Protease Variant (LOC106462125)

Analyst Note: The pileup displays a clean vertical column of High-Impact variants supported by reads on both forward and reverse strands, confirming this is a true biological variant and not a sequencing artifact.


📂 Repository Contents

  • /01_qc: Quality control reports (Fastp) and quality score plots.
  • /02_alignment: Alignment statistics (Flagstat, MarkDup metrics).
  • /03_variants: SnpEff summary HTML reports and variant distribution plots.
  • /04_analysis:
    • functional_tables/: DAVID output tables for GO Terms and KEGG pathways.
    • gene_lists/: Filtered lists of candidate genes.

Methodological Rationale

  • Hard filtering was used due to single-sample design.
  • SnpEff chosen over VEP due to custom genome support and successful database build.
  • DAVID used as it supports Entrez IDs from NCBI annotations.

📝 Conclusion

This project successfully identified high-confidence genetic variants in Limulus polyphemus. While pathway enrichment was limited by the small number of high-impact candidates, gene-level analysis revealed significant mutations in immune coagulation (Serine Protease) and metabolic regulation. These findings provide specific gene candidates for future studies investigating the immunological evolution of this ancient species.

About

End-to-end NGS variant calling and functional annotation pipeline for Limulus polyphemus using GATK and SnpEff, with biological interpretation of high-impact immune and metabolic variants.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages