Variant Calling and Functional Annotation of the Limulus polyphemus (Atlantic Horseshoe Crab) Genome
This project implements an end-to-end Next-Generation Sequencing (NGS) pipeline to identify and interpret genetic variants in the Atlantic Horseshoe Crab (Limulus polyphemus). As a "living fossil" with a unique innate immune system (Limulus Amebocyte Lysate - LAL), characterizing genetic variation in this species is critical for understanding its evolutionary resilience.
Key Findings:
- Variant Discovery: Processed 9,877 variants from raw SRA sequencing data.
- Functional Impact: Filtered down to 5 High-Confidence Loss-of-Function (LoF) candidates and 38 Missense variants.
- Biological Insights:
- Validated High-Impact variants in Serine Protease genes (central to the clotting immune response) and Metabolic Kinases.
- Identified missense variation in Branched-chain Amino Acid metabolism and Glycan degradation pathways.
The analysis followed GATK best practices and standard bioinformatics protocols:
- Data Acquisition: Raw paired-end reads retrieved from ENA/EMBL (Accession: SRR610297).
- QC & Trimming:
fastpused for quality filtering and adapter removal. - Alignment: Reads aligned to the NCBI RefSeq Limulus genome (
GCF_000517525.1) usingBWA-MEM. - Post-Alignment Processing:
- Read Group addition and Indexing.
- Duplicate Removal:
Picard MarkDuplicatesused to mitigate PCR bias (Duplicate rate: 62.9%). - Alignment Stats: QC metrics generated via
Samtools Flagstat.
- Variant Calling: Haplotype calling via
GATK HaplotypeCaller(gVCF mode) followed by hard filtering. - Annotation: Custom database built for
SnpEffto predict functional effects (High/Moderate/Low impact). - Downstream Analysis:
- Functional characterization using DAVID (GO Terms & KEGG Pathways).
- Manual curation of High-Impact candidates due to limited pathway annotation for non-model organisms.
- Validation: Visual inspection of read pileups using IGV to rule out sequencing artifacts.
Rigorous QC was performed to ensure the validity of variant calls.
| Metric | Result | Interpretation |
|---|---|---|
| Sequencing Quality | Phred > 35 | Excellent quality across full read length (see Fig 1). |
| Alignment Rate | 92.6% | High mapping efficiency to the reference genome. |
| Duplication Rate | 62.98% | Managed via Picard removal steps to prevent false positive calls. |
| Variant Density | 1 / 92kb | Indicative of low heterozygosity or stringent filtering. |
| Ts/Tv Ratio | 1.60 | Consistent with transition/transversion rates in invertebrate sequencing. |
Figure 1: Per-base Sequence Quality (Read 1 & 2)

Due to the small number of high-impact variants (n=5) and the non-model status of Limulus, broad pathway enrichment was limited. Analysis focused on gene-centric characterization.
Variants predicted to disrupt protein function (Stop Gain, Frameshift) were identified in critical physiological regulators:
- Immune Defense: Serine protease nudel-like (LOC106462125).
- Significance: The Limulus immune coagulation cascade relies on serine proteases. A LoF variant here may alter the organism's response to endotoxins.
- Energy Homeostasis: Phosphorylase b kinase gamma (LOC106478142).
- Significance: Key enzyme for glycogen mobilization. Disruption suggests potential impacts on rapid energy release during stress.
- Mitochondrial Quality: Metalloendopeptidase OMA1.
- Significance: Involved in mitochondrial stress response and protein quality control.
Missense variants (amino acid changes) were mapped to functional pathways using KEGG and GO terms:
- Metabolic Adaptation (KEGG):
- Valine, leucine and isoleucine degradation: Isovaleryl-CoA dehydrogenase.
- Glycan/Sphingolipid Metabolism: Beta-hexosaminidase subunit alpha.
- Significance: Variation in these metabolic enzymes suggests evolutionary fine-tuning of nutrient processing and lysosomal recycling.
- Structural Adaptation (GO Terms):
- Cuticle protein 16.8-like: Variants in chitin-matrix proteins imply adaptation of the exoskeleton properties (hardness/flexibility).
Figure 2: IGV Snapshot of Serine Protease Variant (LOC106462125)

Analyst Note: The pileup displays a clean vertical column of High-Impact variants supported by reads on both forward and reverse strands, confirming this is a true biological variant and not a sequencing artifact.
/01_qc: Quality control reports (Fastp) and quality score plots./02_alignment: Alignment statistics (Flagstat, MarkDup metrics)./03_variants: SnpEff summary HTML reports and variant distribution plots./04_analysis:functional_tables/: DAVID output tables for GO Terms and KEGG pathways.gene_lists/: Filtered lists of candidate genes.
- Hard filtering was used due to single-sample design.
- SnpEff chosen over VEP due to custom genome support and successful database build.
- DAVID used as it supports Entrez IDs from NCBI annotations.
This project successfully identified high-confidence genetic variants in Limulus polyphemus. While pathway enrichment was limited by the small number of high-impact candidates, gene-level analysis revealed significant mutations in immune coagulation (Serine Protease) and metabolic regulation. These findings provide specific gene candidates for future studies investigating the immunological evolution of this ancient species.