Skip to content

Fast and efficient conversion of Nanopore modBAM from guppy basecaller to BED files for differential methylation and machine learning predictions

License

Notifications You must be signed in to change notification settings

markusdrag/NanoporeToBED-Pipeline

Repository files navigation

NanoporeToBED Pipeline Logo

Nanopore Methylation Processing & BED Generation Pipeline

Part of the MethylSense Package

Version License: MIT DOI

A comprehensive pipeline for processing Oxford Nanopore Technologies sequencing data with base modifications (5mC) to generate methylation BED files and quality metrics.

Overview

This pipeline processes modified BAM (modBAM) files from Oxford Nanopore Technologies sequencing runs with methylation calling. It takes the output from Guppy basecaller with methylation awareness and produces:

  • Merged and aligned BAM files
  • CpG methylation BED files
  • Quality control reports

The output BED files are directly ready for use with the MethylSense package, which provides powerful tools for differential methylation discovery and machine learning modelling of differentially methylated regions (DMRs). Once your BED files are generated, head over to the MethylSense repository to:

  • Identify differentially methylated regions between conditions
  • Build predictive models for diagnostic and prognostic testing
  • Perform biomarker discovery using methylation patterns
  • Create clinical classifiers based on cfDNA methylation signatures

MethylSense seamlessly integrates with this pipeline's output for comprehensive methylation analysis from raw Nanopore data to clinical insights.

Prerequisites

Input Data Structure

The pipeline expects data from a Guppy basecaller run with methylation calling:

guppy_basecaller --disable_pings --compress_fastq \
  -c dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac.cfg \
  --num_callers 4 \
  -i pod5_skip \
  -s fastq_gpu_hac_mod \
  -x 'auto' --bam_out --recursive --min_qscore 7 \
  --barcode_kits 'SQK-NBD114-24'

Expected Input Directory Structure

fastq_gpu_hac_mod/
├── pass/                           # Passed QC reads (processed by default)
│   ├── barcode01/                  # Sample directory (any naming)
│   │   └── *.bam                   # Modified BAM files with MM/ML tags
│   ├── barcode02/
│   │   └── *.bam
│   └── barcode03/
│       └── *.bam
├── fail/                           # Failed QC reads (optional, use --include-fail)
│   ├── barcode01/
│   │   └── *.bam
│   └── ...
└── logs/                           # Ignored

Directory structure notes:

  • The pipeline looks for pass/ subdirectory by default
  • Sample directories can have any name (barcode01, sample_name, etc.)
  • Barcode information is extracted from BAM file headers, not directory names
  • Output sample names are formatted as: <input_dir_name>_b## (e.g., sample_A_b01)

Installation

Quick Automated Setup (Recommended for HPC)

# Clone the repository
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline

# Run the setup script
bash setup.sh

# Or specify a custom installation directory
bash setup.sh /path/to/install/location

The setup script will:

  • Create all necessary directories
  • Install the pipeline script
  • Create the conda environment automatically
  • Set up example files and documentation

Manual Installation

1. Download the Pipeline

# Clone the repository
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline

# Make the script executable
chmod +x NanoporeToBED.sh

Alternatively, download just the script:

wget https://raw.githubusercontent.com/markusdrag/NanoporeToBED-Pipeline/main/NanoporeToBED.sh
chmod +x NanoporeToBED.sh

2. Create Conda Environment

The setup script creates this automatically, but for manual setup:

name: nanopore_methylation
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python>=3.8
  - samtools>=1.17
  - minimap2>=2.26
  - bioconda::ont-modkit>=0.3.0
  - qualimap>=2.3
  - pigz
  - parallel

Install the environment:

# If you cloned the repository
conda env create -f environment.yml
conda activate nanopore_methylation

# Or if downloading script only, create environment directly
conda create -n nanopore_methylation -c conda-forge -c bioconda \
  samtools>=1.17 minimap2>=2.26 ont-modkit>=0.3.0 qualimap>=2.3 \
  pigz parallel python>=3.8
conda activate nanopore_methylation

3. Alternative: Using Micromamba

micromamba create -n nanopore_methylation -c conda-forge -c bioconda \
  samtools minimap2 ont-modkit qualimap pigz parallel python>=3.8
micromamba activate nanopore_methylation

Quick Start on HPC

For the fastest setup on your HPC, use the automated setup:

# Clone and setup
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline
bash setup.sh

# Then activate and run
conda activate nanopore_methylation
sbatch NanoporeToBED.sh \
  -i /data/nanopore/fastq_gpu_hac_mod \
  -o /data/nanopore/methylation_results \
  -ref /data/references/genome.fna \
  -t 32

For manual setup:

# 1. Get the pipeline
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline

# 2. Load your HPC's module system (if available)
module load conda  # or module load miniconda3

# 3. Create and activate environment
conda env create -f environment.yml
conda activate nanopore_methylation

# 4. Test the script
./NanoporeToBED.sh -h

# 5. Run on your data
sbatch NanoporeToBED.sh \
  -i /data/nanopore/fastq_gpu_hac_mod \
  -o /data/nanopore/methylation_results \
  -ref /data/references/genome.fna \
  -t 32

Usage

Single-Species Mode

For processing samples from a single organism:

bash NanoporeToBED.sh \
  -i /data/nanopore/fastq_gpu_hac_mod \
  -o /data/nanopore/methylation_results \
  -ref /data/references/genome.fna \
  -t 40

Where the input directory should contain your pass/ folder from Guppy output:

/data/nanopore/fastq_gpu_hac_mod/
├── pass/
│   ├── barcode01/
│   ├── barcode02/
│   └── ...
└── fail/           # Optional, use --include-fail to process

Multi-Species Mode

For processing mixed-species sequencing runs where different barcodes correspond to different organisms, use the multi-species flags:

bash NanoporeToBED.sh \
  -i /data/nanopore/mixed_species \
  -o /data/nanopore/methylation_results \
  --multi-mapping "1:11,12:16" \
  --multi-refs "/refs/pig.fna,/refs/penguin.fna" \
  -t 40

This example processes barcodes 01-11 as pig samples and barcodes 12-16 as penguin samples, aligning each to the appropriate reference genome.

Multi-mapping formats:

  • Range: 1:11 (barcodes 01 through 11)
  • Single: 5 (barcode 05 only)
  • Mixed: 1:5,10,15:20 (barcodes 01-05, 10, and 15-20)

The pipeline will display the barcode-to-reference mapping at startup and provide a summary of which samples used which reference genome at completion.

Parameters

Single-Species Mode

Flag Long Form Description Required Default
-i --input Input directory containing pass/fail folders with barcoded samples Yes -
-o --output Output directory for processed data Yes -
-ref --reference Path to reference genome FASTA file (.fna, .fa, or .fasta) Yes -
-t --threads Number of threads to use for processing No 40
--dry-run Run in test mode without processing No false
--include-fail Also process samples from fail/ directory No false
--include-empty-barcodes Include barcode## directories (default: skip, only process named samples) No false
--expanded-plots Generate extended analysis plots (distribution, QC, comparative) No false
-h --help Show help message No -

Multi-Species Mode

Flag Long Form Description Required Default
-i --input Input directory containing pass/fail folders with barcoded samples Yes -
-o --output Output directory for processed data Yes -
--multi-mapping Comma-separated barcode ranges (e.g., "1:11,12:16") Yes -
--multi-refs Comma-separated reference genome paths (e.g., "pig.fna,penguin.fna") Yes -
-t --threads Number of threads to use for processing No 40
--dry-run Run in test mode without processing No false
--include-fail Also process samples from fail/ directory No false
--include-empty-barcodes Include barcode## directories (default: skip, only process named samples) No false
--expanded-plots Generate extended analysis plots (distribution, QC, comparative) No false
-h --help Show help message No -

Note: Cannot mix single-species (-ref) and multi-species (--multi-mapping/--multi-refs) modes. Choose one mode per run.

Reference Genome Format

The reference genome should be a single FASTA file:

/path/to/reference/
└── genome.fna    # Or genome.fa, genome.fasta

Output Structure

output_dir/
├── sample_A_b01/                         # Sample directory (name_barcode)
│   ├── sample_A_b01.merged.bam           # Merged BAM with methylation tags
│   ├── sample_A_b01.merged.bam.bai
│   ├── sample_A_b01.minimap.bam          # Aligned BAM
│   ├── sample_A_b01.minimap.bam.bai
│   ├── sample_A_b01.CpG.bed              # Methylation calls
│   ├── qualimap/                              # QC reports
│   │   ├── qualimapReport.html
│   │   ├── genome_results.txt
│   │   └── raw_data_qualimapReport/
│   └── bam_list.txt                           # Processing manifest
├── sample_B_b02/
│   └── ...
└── logs/
    ├── pipeline_master_log_YYYYMMDD_HHMMSS.txt
    ├── sample_A_b01.log
    └── sample_B_b02.log

Output File Descriptions

  • .merged.bam: Concatenated BAM files from all sequencing chunks with methylation tags preserved
  • .minimap.bam: Re-aligned BAM files to reference genome
  • .CpG.bed: BED file with CpG methylation frequencies
    • Format: chromosome, start, end, modification_frequency, coverage, strand
    • Compatible with MethylSense pipeline (see separate repository) and other methylation analysis tools
  • qualimap/: HTML quality reports with coverage statistics and alignment metrics

Pipeline Steps

  1. BAM Merging: Combines multiple BAM files whilst preserving MM/ML methylation tags
  2. Alignment: Re-aligns reads to reference genome using minimap2 with tag preservation
  3. Methylation Calling: Extracts CpG methylation frequencies using modkit
  4. Quality Control: Generates comprehensive QC reports using Qualimap
  5. Summary Report: Generates statistics and visualisation plots (automatic, or standalone)

Summary Report (generate_summary.R)

The generate_summary.R script runs automatically at the end of the pipeline, or can be used standalone to regenerate plots or analyse existing output directories.

Standalone Usage

# Basic plots only
Rscript generate_summary.R /path/to/output_dir

# With expanded analysis plots
Rscript generate_summary.R /path/to/output_dir --expanded-plots

Output Structure

output_dir/
├── pipeline_summary.csv           # Summary statistics table
└── plots/
    ├── basic/                     # Always generated
    │   ├── cpg_sites_per_sample.png/pdf
    │   ├── mean_methylation_per_sample.png/pdf
    │   ├── mean_coverage_per_sample.png/pdf
    │   ├── total_reads_per_sample.png/pdf
    │   └── summary_overview.png/pdf
    ├── distribution/              # With --expanded-plots
    │   ├── methylation_distribution.png/pdf
    │   └── coverage_distribution.png/pdf
    ├── qc/                        # With --expanded-plots
    │   ├── low_coverage_cpg_percent.png/pdf
    │   └── strand_bias.png/pdf
    ├── biological/                # With --expanded-plots
    │   ├── hyper_hypo_methylated_counts.png/pdf
    │   └── methylation_by_chromosome.png/pdf
    └── comparative/               # With --expanded-plots
        ├── sample_correlation_heatmap.png/pdf
        └── pca_plot.png/pdf

Example Output Plots

CpG Sites per Sample

CpG Sites

Mean Methylation per Sample

Mean Methylation

Mean Coverage per Sample

Mean Coverage

Methylation Distribution

Methylation Distribution

Strand Bias QC

Strand Bias

Sample Correlation Heatmap

Correlation Heatmap

Running Tips

SLURM Configuration

Adjust SLURM parameters in the script header based on your data:

#SBATCH -c 40          # CPU cores (adjust based on availability)
#SBATCH --mem 192g     # Memory (scale with data size)
#SBATCH --time=72:00:00 # Time limit (depends on dataset size)
#SBATCH --account YourAccount # Your HPC account

Performance Optimisation

  • Thread usage: The -t parameter sets thread count (default: 40). Will automatically reduce if exceeds SLURM allocation
  • Memory: ~4-8 GB per thread is recommended
  • Storage: Ensure 3-5x input data size for temporary files
  • Large datasets: Consider processing in batches or increasing time allocation

Troubleshooting

  1. Insufficient memory: Reduce thread count or increase memory allocation
  2. Corrupted BAM files: Script automatically skips problematic BAMs with warnings
  3. Missing reference: Verify reference genome path and file exists
  4. Timeout issues: Extend time limit or process fewer samples
  5. No samples found: Check input directory structure matches expected pattern (see error message for searched patterns)

Monitoring Progress

# Check job status
squeue -u $USER

# Monitor SLURM output log
tail -f NanoporeToBED.out

# Check master log for overall progress
tail -f output_dir/logs/pipeline_master_log_*.txt

# Check individual sample logs
tail -f output_dir/logs/SRR*/*/sample_name.log

Quality Control Metrics

The pipeline generates several QC checkpoints:

  • File size validation (>100 MB threshold for merged files)
  • BAM header integrity checks
  • Alignment statistics via Qualimap
  • Methylation coverage in BED files

Key Metrics to Review

  • Coverage depth: Check in Qualimap reports
  • Mapping rate: Verify alignment efficiency
  • Methylation sites: Number of CpG sites covered
  • Read length distribution: Assess data quality

Advanced Configuration

Custom Reference Genomes

The pipeline works with any reference genome in FASTA format:

# Index your reference (optional, minimap2 will do this automatically)
minimap2 -d reference.mmi reference.fna

# Use in pipeline
sbatch NanoporeToBED.sh -ref /path/to/reference.fna ...

Batch Processing

Single-Species Batch Processing

For multiple libraries, create a wrapper script:

#!/bin/bash
# Process multiple Guppy output directories
for lib in SRR00000{1..5}; do
  sbatch NanoporeToBED.sh \
    -i /data/nanopore/fastq_gpu_hac_mod \
    -o /data/nanopore/methylation_results/$lib \
    -ref /data/references/genome.fna \
    -t 40
done

Multi-Species Batch Processing

For processing multiple mixed-species runs:

#!/bin/bash
# Process multiple mixed-species sequencing runs
for run_id in Run001 Run002 Run003; do
  sbatch NanoporeToBED.sh \
    -i /data/nanopore/${run_id}/fastq_gpu_hac_mod \
    -o /data/nanopore/methylation_results/${run_id} \
    --multi-mapping "1:11,12:16" \
    --multi-refs "/refs/pig.fna,/refs/penguin.fna" \
    -t 40
done

Citation

If you use this pipeline, please cite:

Our methodology paper:

  • Drag, M.H., Hvilsom, C., Poulsen, L.L., Jensen, H.E., Tahas, S.A., Leineweber, C., Cray, C., Bertelsen, M.F., Bojesen, A.M. (2025). New high accuracy diagnostics for avian Aspergillus fumigatus infection using Nanopore methylation sequencing of host cell-free DNA and machine learning prediction. bioRxiv 2025.04.11.648151. https://doi.org/10.1101/2025.04.11.648151

Software tools:

  • Oxford Nanopore Technologies modkit
  • Minimap2: Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094-3100.
  • Samtools: Danecek, P., et al. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008.
  • Qualimap: Okonechnikov, K., et al. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292-294.

Licence

MIT Licence (see LICENCE file)

Contact

For questions, bug reports, or feature requests, please: