GitHub - markusdrag/NanoporeToBED-Pipeline: Fast and efficient conversion of Nanopore modBAM from guppy basecaller to BED files for differential methylation and machine learning predictions

Nanopore Methylation Processing & BED Generation Pipeline

Part of the MethylSense Package

A comprehensive pipeline for processing Oxford Nanopore Technologies sequencing data with base modifications (5mC) to generate methylation BED files and quality metrics.

Overview

This pipeline processes modified BAM (modBAM) files from Oxford Nanopore Technologies sequencing runs with methylation calling. It takes the output from Guppy basecaller with methylation awareness and produces:

Merged and aligned BAM files
CpG methylation BED files
Quality control reports

The output BED files are directly ready for use with the MethylSense package, which provides powerful tools for differential methylation discovery and machine learning modelling of differentially methylated regions (DMRs). Once your BED files are generated, head over to the MethylSense repository to:

Identify differentially methylated regions between conditions
Build predictive models for diagnostic and prognostic testing
Perform biomarker discovery using methylation patterns
Create clinical classifiers based on cfDNA methylation signatures

MethylSense seamlessly integrates with this pipeline's output for comprehensive methylation analysis from raw Nanopore data to clinical insights.

Prerequisites

Input Data Structure

The pipeline expects data from a Guppy basecaller run with methylation calling:

guppy_basecaller --disable_pings --compress_fastq \
  -c dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac.cfg \
  --num_callers 4 \
  -i pod5_skip \
  -s fastq_gpu_hac_mod \
  -x 'auto' --bam_out --recursive --min_qscore 7 \
  --barcode_kits 'SQK-NBD114-24'

Expected Input Directory Structure

fastq_gpu_hac_mod/
├── pass/                           # Passed QC reads (processed by default)
│   ├── barcode01/                  # Sample directory (any naming)
│   │   └── *.bam                   # Modified BAM files with MM/ML tags
│   ├── barcode02/
│   │   └── *.bam
│   └── barcode03/
│       └── *.bam
├── fail/                           # Failed QC reads (optional, use --include-fail)
│   ├── barcode01/
│   │   └── *.bam
│   └── ...
└── logs/                           # Ignored

Directory structure notes:

The pipeline looks for pass/ subdirectory by default
Sample directories can have any name (barcode01, sample_name, etc.)
Barcode information is extracted from BAM file headers, not directory names
Output sample names are formatted as: <input_dir_name>_b## (e.g., sample_A_b01)

Installation

Quick Automated Setup (Recommended for HPC)

# Clone the repository
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline

# Run the setup script
bash setup.sh

# Or specify a custom installation directory
bash setup.sh /path/to/install/location

The setup script will:

Create all necessary directories
Install the pipeline script
Create the conda environment automatically
Set up example files and documentation

Manual Installation

1. Download the Pipeline

# Clone the repository
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline

# Make the script executable
chmod +x NanoporeToBED.sh

Alternatively, download just the script:

wget https://raw.githubusercontent.com/markusdrag/NanoporeToBED-Pipeline/main/NanoporeToBED.sh
chmod +x NanoporeToBED.sh

2. Create Conda Environment

The setup script creates this automatically, but for manual setup:

name: nanopore_methylation
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python>=3.8
  - samtools>=1.17
  - minimap2>=2.26
  - bioconda::ont-modkit>=0.3.0
  - qualimap>=2.3
  - pigz
  - parallel

Install the environment:

# If you cloned the repository
conda env create -f environment.yml
conda activate nanopore_methylation

# Or if downloading script only, create environment directly
conda create -n nanopore_methylation -c conda-forge -c bioconda \
  samtools>=1.17 minimap2>=2.26 ont-modkit>=0.3.0 qualimap>=2.3 \
  pigz parallel python>=3.8
conda activate nanopore_methylation

3. Alternative: Using Micromamba

micromamba create -n nanopore_methylation -c conda-forge -c bioconda \
  samtools minimap2 ont-modkit qualimap pigz parallel python>=3.8
micromamba activate nanopore_methylation

Quick Start on HPC

For the fastest setup on your HPC, use the automated setup:

# Clone and setup
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline
bash setup.sh

# Then activate and run
conda activate nanopore_methylation
sbatch NanoporeToBED.sh \
  -i /data/nanopore/fastq_gpu_hac_mod \
  -o /data/nanopore/methylation_results \
  -ref /data/references/genome.fna \
  -t 32

For manual setup:

# 1. Get the pipeline
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline

# 2. Load your HPC's module system (if available)
module load conda  # or module load miniconda3

# 3. Create and activate environment
conda env create -f environment.yml
conda activate nanopore_methylation

# 4. Test the script
./NanoporeToBED.sh -h

# 5. Run on your data
sbatch NanoporeToBED.sh \
  -i /data/nanopore/fastq_gpu_hac_mod \
  -o /data/nanopore/methylation_results \
  -ref /data/references/genome.fna \
  -t 32

Usage

Single-Species Mode

For processing samples from a single organism:

bash NanoporeToBED.sh \
  -i /data/nanopore/fastq_gpu_hac_mod \
  -o /data/nanopore/methylation_results \
  -ref /data/references/genome.fna \
  -t 40

Where the input directory should contain your pass/ folder from Guppy output:

/data/nanopore/fastq_gpu_hac_mod/
├── pass/
│   ├── barcode01/
│   ├── barcode02/
│   └── ...
└── fail/           # Optional, use --include-fail to process

Multi-Species Mode

For processing mixed-species sequencing runs where different barcodes correspond to different organisms, use the multi-species flags:

bash NanoporeToBED.sh \
  -i /data/nanopore/mixed_species \
  -o /data/nanopore/methylation_results \
  --multi-mapping "1:11,12:16" \
  --multi-refs "/refs/pig.fna,/refs/penguin.fna" \
  -t 40

This example processes barcodes 01-11 as pig samples and barcodes 12-16 as penguin samples, aligning each to the appropriate reference genome.

Multi-mapping formats:

Range: 1:11 (barcodes 01 through 11)
Single: 5 (barcode 05 only)
Mixed: 1:5,10,15:20 (barcodes 01-05, 10, and 15-20)

The pipeline will display the barcode-to-reference mapping at startup and provide a summary of which samples used which reference genome at completion.

Parameters

Single-Species Mode

Flag	Long Form	Description	Required	Default
`-i`	`--input`	Input directory containing pass/fail folders with barcoded samples	Yes	-
`-o`	`--output`	Output directory for processed data	Yes	-
`-ref`	`--reference`	Path to reference genome FASTA file (.fna, .fa, or .fasta)	Yes	-
`-t`	`--threads`	Number of threads to use for processing	No	40
	`--dry-run`	Run in test mode without processing	No	false
	`--include-fail`	Also process samples from fail/ directory	No	false
	`--include-empty-barcodes`	Include barcode## directories (default: skip, only process named samples)	No	false
	`--expanded-plots`	Generate extended analysis plots (distribution, QC, comparative)	No	false
`-h`	`--help`	Show help message	No	-

Multi-Species Mode

Flag	Long Form	Description	Required	Default
`-i`	`--input`	Input directory containing pass/fail folders with barcoded samples	Yes	-
`-o`	`--output`	Output directory for processed data	Yes	-
	`--multi-mapping`	Comma-separated barcode ranges (e.g., "1:11,12:16")	Yes	-
	`--multi-refs`	Comma-separated reference genome paths (e.g., "pig.fna,penguin.fna")	Yes	-
`-t`	`--threads`	Number of threads to use for processing	No	40
	`--dry-run`	Run in test mode without processing	No	false
	`--include-fail`	Also process samples from fail/ directory	No	false
	`--include-empty-barcodes`	Include barcode## directories (default: skip, only process named samples)	No	false
	`--expanded-plots`	Generate extended analysis plots (distribution, QC, comparative)	No	false
`-h`	`--help`	Show help message	No	-

Note: Cannot mix single-species (-ref) and multi-species (--multi-mapping/--multi-refs) modes. Choose one mode per run.

Reference Genome Format

The reference genome should be a single FASTA file:

/path/to/reference/
└── genome.fna    # Or genome.fa, genome.fasta

Output Structure

output_dir/
├── sample_A_b01/                         # Sample directory (name_barcode)
│   ├── sample_A_b01.merged.bam           # Merged BAM with methylation tags
│   ├── sample_A_b01.merged.bam.bai
│   ├── sample_A_b01.minimap.bam          # Aligned BAM
│   ├── sample_A_b01.minimap.bam.bai
│   ├── sample_A_b01.CpG.bed              # Methylation calls
│   ├── qualimap/                              # QC reports
│   │   ├── qualimapReport.html
│   │   ├── genome_results.txt
│   │   └── raw_data_qualimapReport/
│   └── bam_list.txt                           # Processing manifest
├── sample_B_b02/
│   └── ...
└── logs/
    ├── pipeline_master_log_YYYYMMDD_HHMMSS.txt
    ├── sample_A_b01.log
    └── sample_B_b02.log

Output File Descriptions

.merged.bam: Concatenated BAM files from all sequencing chunks with methylation tags preserved
.minimap.bam: Re-aligned BAM files to reference genome
.CpG.bed: BED file with CpG methylation frequencies
- Format: chromosome, start, end, modification_frequency, coverage, strand
- Compatible with MethylSense pipeline (see separate repository) and other methylation analysis tools
qualimap/: HTML quality reports with coverage statistics and alignment metrics

Pipeline Steps

BAM Merging: Combines multiple BAM files whilst preserving MM/ML methylation tags
Alignment: Re-aligns reads to reference genome using minimap2 with tag preservation
Methylation Calling: Extracts CpG methylation frequencies using modkit
Quality Control: Generates comprehensive QC reports using Qualimap
Summary Report: Generates statistics and visualisation plots (automatic, or standalone)

Summary Report (generate_summary.R)

The generate_summary.R script runs automatically at the end of the pipeline, or can be used standalone to regenerate plots or analyse existing output directories.

Standalone Usage

# Basic plots only
Rscript generate_summary.R /path/to/output_dir

# With expanded analysis plots
Rscript generate_summary.R /path/to/output_dir --expanded-plots

Output Structure

output_dir/
├── pipeline_summary.csv           # Summary statistics table
└── plots/
    ├── basic/                     # Always generated
    │   ├── cpg_sites_per_sample.png/pdf
    │   ├── mean_methylation_per_sample.png/pdf
    │   ├── mean_coverage_per_sample.png/pdf
    │   ├── total_reads_per_sample.png/pdf
    │   └── summary_overview.png/pdf
    ├── distribution/              # With --expanded-plots
    │   ├── methylation_distribution.png/pdf
    │   └── coverage_distribution.png/pdf
    ├── qc/                        # With --expanded-plots
    │   ├── low_coverage_cpg_percent.png/pdf
    │   └── strand_bias.png/pdf
    ├── biological/                # With --expanded-plots
    │   ├── hyper_hypo_methylated_counts.png/pdf
    │   └── methylation_by_chromosome.png/pdf
    └── comparative/               # With --expanded-plots
        ├── sample_correlation_heatmap.png/pdf
        └── pca_plot.png/pdf

Example Output Plots

CpG Sites per Sample	Mean Methylation per Sample
Mean Coverage per Sample	Methylation Distribution
Strand Bias QC	Sample Correlation Heatmap

Running Tips

SLURM Configuration

Adjust SLURM parameters in the script header based on your data:

#SBATCH -c 40          # CPU cores (adjust based on availability)
#SBATCH --mem 192g     # Memory (scale with data size)
#SBATCH --time=72:00:00 # Time limit (depends on dataset size)
#SBATCH --account YourAccount # Your HPC account

Performance Optimisation

Thread usage: The -t parameter sets thread count (default: 40). Will automatically reduce if exceeds SLURM allocation
Memory: ~4-8 GB per thread is recommended
Storage: Ensure 3-5x input data size for temporary files
Large datasets: Consider processing in batches or increasing time allocation

Troubleshooting

Insufficient memory: Reduce thread count or increase memory allocation
Corrupted BAM files: Script automatically skips problematic BAMs with warnings
Missing reference: Verify reference genome path and file exists
Timeout issues: Extend time limit or process fewer samples
No samples found: Check input directory structure matches expected pattern (see error message for searched patterns)

Monitoring Progress

# Check job status
squeue -u $USER

# Monitor SLURM output log
tail -f NanoporeToBED.out

# Check master log for overall progress
tail -f output_dir/logs/pipeline_master_log_*.txt

# Check individual sample logs
tail -f output_dir/logs/SRR*/*/sample_name.log

Quality Control Metrics

The pipeline generates several QC checkpoints:

File size validation (>100 MB threshold for merged files)
BAM header integrity checks
Alignment statistics via Qualimap
Methylation coverage in BED files

Key Metrics to Review

Coverage depth: Check in Qualimap reports
Mapping rate: Verify alignment efficiency
Methylation sites: Number of CpG sites covered
Read length distribution: Assess data quality

Advanced Configuration

Custom Reference Genomes

The pipeline works with any reference genome in FASTA format:

# Index your reference (optional, minimap2 will do this automatically)
minimap2 -d reference.mmi reference.fna

# Use in pipeline
sbatch NanoporeToBED.sh -ref /path/to/reference.fna ...

Batch Processing

Single-Species Batch Processing

For multiple libraries, create a wrapper script:

#!/bin/bash
# Process multiple Guppy output directories
for lib in SRR00000{1..5}; do
  sbatch NanoporeToBED.sh \
    -i /data/nanopore/fastq_gpu_hac_mod \
    -o /data/nanopore/methylation_results/$lib \
    -ref /data/references/genome.fna \
    -t 40
done

Multi-Species Batch Processing

For processing multiple mixed-species runs:

#!/bin/bash
# Process multiple mixed-species sequencing runs
for run_id in Run001 Run002 Run003; do
  sbatch NanoporeToBED.sh \
    -i /data/nanopore/${run_id}/fastq_gpu_hac_mod \
    -o /data/nanopore/methylation_results/${run_id} \
    --multi-mapping "1:11,12:16" \
    --multi-refs "/refs/pig.fna,/refs/penguin.fna" \
    -t 40
done

Citation

If you use this pipeline, please cite:

Our methodology paper:

Drag, M.H., Hvilsom, C., Poulsen, L.L., Jensen, H.E., Tahas, S.A., Leineweber, C., Cray, C., Bertelsen, M.F., Bojesen, A.M. (2025). New high accuracy diagnostics for avian Aspergillus fumigatus infection using Nanopore methylation sequencing of host cell-free DNA and machine learning prediction. bioRxiv 2025.04.11.648151. https://doi.org/10.1101/2025.04.11.648151

Software tools:

Oxford Nanopore Technologies modkit
Minimap2: Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094-3100.
Samtools: Danecek, P., et al. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008.
Qualimap: Okonechnikov, K., et al. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292-294.

Licence

MIT Licence (see LICENCE file)

Contact

Lead Developer: Markus Hodal Drag
Email: markus.drag@sund.ku.dk
Institution: University of Copenhagen
GitHub: https://github.com/markusdrag
ORCID: https://orcid.org/0000-0002-7412-6402

For questions, bug reports, or feature requests, please:

Open an issue on GitHub: https://github.com/markusdrag/NanoporeToBED-Pipeline/issues
Or contact via email for collaboration enquiries

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
archived_logos		archived_logos
examples		examples
logos_for_inspiration		logos_for_inspiration
screenshots		screenshots
.DS_Store		.DS_Store
CHANGELOG_v1.5.0.md		CHANGELOG_v1.5.0.md
LICENSE		LICENSE
NanoporeToBED.sh		NanoporeToBED.sh
README.md		README.md
environment.yml		environment.yml
generate_demo_beds.R		generate_demo_beds.R
generate_summary.R		generate_summary.R
logo_dark_bg.png		logo_dark_bg.png
logo_white_bg.png		logo_white_bg.png
setup.sh		setup.sh

License

markusdrag/NanoporeToBED-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Overview

Prerequisites

Input Data Structure

Expected Input Directory Structure

Installation

Quick Automated Setup (Recommended for HPC)

Manual Installation

1. Download the Pipeline

2. Create Conda Environment

3. Alternative: Using Micromamba

Quick Start on HPC

Usage

Single-Species Mode

Multi-Species Mode

Parameters

Single-Species Mode

Multi-Species Mode

Reference Genome Format

Output Structure

Output File Descriptions

Pipeline Steps

Summary Report (generate_summary.R)

Standalone Usage

Output Structure

Example Output Plots

Running Tips

SLURM Configuration

Performance Optimisation

Troubleshooting

Monitoring Progress

Quality Control Metrics

Key Metrics to Review

Advanced Configuration

Custom Reference Genomes

Batch Processing

Single-Species Batch Processing

Multi-Species Batch Processing

Citation

Licence

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Languages