A comprehensive pipeline for processing Oxford Nanopore Technologies sequencing data with base modifications (5mC) to generate methylation BED files and quality metrics.
This pipeline processes modified BAM (modBAM) files from Oxford Nanopore Technologies sequencing runs with methylation calling. It takes the output from Guppy basecaller with methylation awareness and produces:
- Merged and aligned BAM files
- CpG methylation BED files
- Quality control reports
The output BED files are directly ready for use with the MethylSense package, which provides powerful tools for differential methylation discovery and machine learning modelling of differentially methylated regions (DMRs). Once your BED files are generated, head over to the MethylSense repository to:
- Identify differentially methylated regions between conditions
- Build predictive models for diagnostic and prognostic testing
- Perform biomarker discovery using methylation patterns
- Create clinical classifiers based on cfDNA methylation signatures
MethylSense seamlessly integrates with this pipeline's output for comprehensive methylation analysis from raw Nanopore data to clinical insights.
The pipeline expects data from a Guppy basecaller run with methylation calling:
guppy_basecaller --disable_pings --compress_fastq \
-c dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac.cfg \
--num_callers 4 \
-i pod5_skip \
-s fastq_gpu_hac_mod \
-x 'auto' --bam_out --recursive --min_qscore 7 \
--barcode_kits 'SQK-NBD114-24'fastq_gpu_hac_mod/
├── pass/ # Passed QC reads (processed by default)
│ ├── barcode01/ # Sample directory (any naming)
│ │ └── *.bam # Modified BAM files with MM/ML tags
│ ├── barcode02/
│ │ └── *.bam
│ └── barcode03/
│ └── *.bam
├── fail/ # Failed QC reads (optional, use --include-fail)
│ ├── barcode01/
│ │ └── *.bam
│ └── ...
└── logs/ # Ignored
Directory structure notes:
- The pipeline looks for
pass/subdirectory by default - Sample directories can have any name (barcode01, sample_name, etc.)
- Barcode information is extracted from BAM file headers, not directory names
- Output sample names are formatted as:
<input_dir_name>_b##(e.g.,sample_A_b01)
# Clone the repository
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline
# Run the setup script
bash setup.sh
# Or specify a custom installation directory
bash setup.sh /path/to/install/locationThe setup script will:
- Create all necessary directories
- Install the pipeline script
- Create the conda environment automatically
- Set up example files and documentation
# Clone the repository
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline
# Make the script executable
chmod +x NanoporeToBED.shAlternatively, download just the script:
wget https://raw.githubusercontent.com/markusdrag/NanoporeToBED-Pipeline/main/NanoporeToBED.sh
chmod +x NanoporeToBED.shThe setup script creates this automatically, but for manual setup:
name: nanopore_methylation
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python>=3.8
- samtools>=1.17
- minimap2>=2.26
- bioconda::ont-modkit>=0.3.0
- qualimap>=2.3
- pigz
- parallelInstall the environment:
# If you cloned the repository
conda env create -f environment.yml
conda activate nanopore_methylation
# Or if downloading script only, create environment directly
conda create -n nanopore_methylation -c conda-forge -c bioconda \
samtools>=1.17 minimap2>=2.26 ont-modkit>=0.3.0 qualimap>=2.3 \
pigz parallel python>=3.8
conda activate nanopore_methylationmicromamba create -n nanopore_methylation -c conda-forge -c bioconda \
samtools minimap2 ont-modkit qualimap pigz parallel python>=3.8
micromamba activate nanopore_methylationFor the fastest setup on your HPC, use the automated setup:
# Clone and setup
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline
bash setup.sh
# Then activate and run
conda activate nanopore_methylation
sbatch NanoporeToBED.sh \
-i /data/nanopore/fastq_gpu_hac_mod \
-o /data/nanopore/methylation_results \
-ref /data/references/genome.fna \
-t 32For manual setup:
# 1. Get the pipeline
git clone https://github.com/markusdrag/NanoporeToBED-Pipeline.git
cd NanoporeToBED-Pipeline
# 2. Load your HPC's module system (if available)
module load conda # or module load miniconda3
# 3. Create and activate environment
conda env create -f environment.yml
conda activate nanopore_methylation
# 4. Test the script
./NanoporeToBED.sh -h
# 5. Run on your data
sbatch NanoporeToBED.sh \
-i /data/nanopore/fastq_gpu_hac_mod \
-o /data/nanopore/methylation_results \
-ref /data/references/genome.fna \
-t 32For processing samples from a single organism:
bash NanoporeToBED.sh \
-i /data/nanopore/fastq_gpu_hac_mod \
-o /data/nanopore/methylation_results \
-ref /data/references/genome.fna \
-t 40Where the input directory should contain your pass/ folder from Guppy output:
/data/nanopore/fastq_gpu_hac_mod/
├── pass/
│ ├── barcode01/
│ ├── barcode02/
│ └── ...
└── fail/ # Optional, use --include-fail to process
For processing mixed-species sequencing runs where different barcodes correspond to different organisms, use the multi-species flags:
bash NanoporeToBED.sh \
-i /data/nanopore/mixed_species \
-o /data/nanopore/methylation_results \
--multi-mapping "1:11,12:16" \
--multi-refs "/refs/pig.fna,/refs/penguin.fna" \
-t 40This example processes barcodes 01-11 as pig samples and barcodes 12-16 as penguin samples, aligning each to the appropriate reference genome.
Multi-mapping formats:
- Range:
1:11(barcodes 01 through 11) - Single:
5(barcode 05 only) - Mixed:
1:5,10,15:20(barcodes 01-05, 10, and 15-20)
The pipeline will display the barcode-to-reference mapping at startup and provide a summary of which samples used which reference genome at completion.
| Flag | Long Form | Description | Required | Default |
|---|---|---|---|---|
-i |
--input |
Input directory containing pass/fail folders with barcoded samples | Yes | - |
-o |
--output |
Output directory for processed data | Yes | - |
-ref |
--reference |
Path to reference genome FASTA file (.fna, .fa, or .fasta) | Yes | - |
-t |
--threads |
Number of threads to use for processing | No | 40 |
--dry-run |
Run in test mode without processing | No | false | |
--include-fail |
Also process samples from fail/ directory | No | false | |
--include-empty-barcodes |
Include barcode## directories (default: skip, only process named samples) | No | false | |
--expanded-plots |
Generate extended analysis plots (distribution, QC, comparative) | No | false | |
-h |
--help |
Show help message | No | - |
| Flag | Long Form | Description | Required | Default |
|---|---|---|---|---|
-i |
--input |
Input directory containing pass/fail folders with barcoded samples | Yes | - |
-o |
--output |
Output directory for processed data | Yes | - |
--multi-mapping |
Comma-separated barcode ranges (e.g., "1:11,12:16") | Yes | - | |
--multi-refs |
Comma-separated reference genome paths (e.g., "pig.fna,penguin.fna") | Yes | - | |
-t |
--threads |
Number of threads to use for processing | No | 40 |
--dry-run |
Run in test mode without processing | No | false | |
--include-fail |
Also process samples from fail/ directory | No | false | |
--include-empty-barcodes |
Include barcode## directories (default: skip, only process named samples) | No | false | |
--expanded-plots |
Generate extended analysis plots (distribution, QC, comparative) | No | false | |
-h |
--help |
Show help message | No | - |
Note: Cannot mix single-species (-ref) and multi-species (--multi-mapping/--multi-refs) modes. Choose one mode per run.
The reference genome should be a single FASTA file:
/path/to/reference/
└── genome.fna # Or genome.fa, genome.fasta
output_dir/
├── sample_A_b01/ # Sample directory (name_barcode)
│ ├── sample_A_b01.merged.bam # Merged BAM with methylation tags
│ ├── sample_A_b01.merged.bam.bai
│ ├── sample_A_b01.minimap.bam # Aligned BAM
│ ├── sample_A_b01.minimap.bam.bai
│ ├── sample_A_b01.CpG.bed # Methylation calls
│ ├── qualimap/ # QC reports
│ │ ├── qualimapReport.html
│ │ ├── genome_results.txt
│ │ └── raw_data_qualimapReport/
│ └── bam_list.txt # Processing manifest
├── sample_B_b02/
│ └── ...
└── logs/
├── pipeline_master_log_YYYYMMDD_HHMMSS.txt
├── sample_A_b01.log
└── sample_B_b02.log
.merged.bam: Concatenated BAM files from all sequencing chunks with methylation tags preserved.minimap.bam: Re-aligned BAM files to reference genome.CpG.bed: BED file with CpG methylation frequencies- Format: chromosome, start, end, modification_frequency, coverage, strand
- Compatible with MethylSense pipeline (see separate repository) and other methylation analysis tools
qualimap/: HTML quality reports with coverage statistics and alignment metrics
- BAM Merging: Combines multiple BAM files whilst preserving MM/ML methylation tags
- Alignment: Re-aligns reads to reference genome using minimap2 with tag preservation
- Methylation Calling: Extracts CpG methylation frequencies using modkit
- Quality Control: Generates comprehensive QC reports using Qualimap
- Summary Report: Generates statistics and visualisation plots (automatic, or standalone)
The generate_summary.R script runs automatically at the end of the pipeline, or can be used standalone to regenerate plots or analyse existing output directories.
# Basic plots only
Rscript generate_summary.R /path/to/output_dir
# With expanded analysis plots
Rscript generate_summary.R /path/to/output_dir --expanded-plotsoutput_dir/
├── pipeline_summary.csv # Summary statistics table
└── plots/
├── basic/ # Always generated
│ ├── cpg_sites_per_sample.png/pdf
│ ├── mean_methylation_per_sample.png/pdf
│ ├── mean_coverage_per_sample.png/pdf
│ ├── total_reads_per_sample.png/pdf
│ └── summary_overview.png/pdf
├── distribution/ # With --expanded-plots
│ ├── methylation_distribution.png/pdf
│ └── coverage_distribution.png/pdf
├── qc/ # With --expanded-plots
│ ├── low_coverage_cpg_percent.png/pdf
│ └── strand_bias.png/pdf
├── biological/ # With --expanded-plots
│ ├── hyper_hypo_methylated_counts.png/pdf
│ └── methylation_by_chromosome.png/pdf
└── comparative/ # With --expanded-plots
├── sample_correlation_heatmap.png/pdf
└── pca_plot.png/pdf
|
CpG Sites per Sample
|
Mean Methylation per Sample
|
|
Mean Coverage per Sample
|
Methylation Distribution
|
|
Strand Bias QC
|
Sample Correlation Heatmap
|
Adjust SLURM parameters in the script header based on your data:
#SBATCH -c 40 # CPU cores (adjust based on availability)
#SBATCH --mem 192g # Memory (scale with data size)
#SBATCH --time=72:00:00 # Time limit (depends on dataset size)
#SBATCH --account YourAccount # Your HPC account- Thread usage: The
-tparameter sets thread count (default: 40). Will automatically reduce if exceeds SLURM allocation - Memory: ~4-8 GB per thread is recommended
- Storage: Ensure 3-5x input data size for temporary files
- Large datasets: Consider processing in batches or increasing time allocation
- Insufficient memory: Reduce thread count or increase memory allocation
- Corrupted BAM files: Script automatically skips problematic BAMs with warnings
- Missing reference: Verify reference genome path and file exists
- Timeout issues: Extend time limit or process fewer samples
- No samples found: Check input directory structure matches expected pattern (see error message for searched patterns)
# Check job status
squeue -u $USER
# Monitor SLURM output log
tail -f NanoporeToBED.out
# Check master log for overall progress
tail -f output_dir/logs/pipeline_master_log_*.txt
# Check individual sample logs
tail -f output_dir/logs/SRR*/*/sample_name.logThe pipeline generates several QC checkpoints:
- File size validation (>100 MB threshold for merged files)
- BAM header integrity checks
- Alignment statistics via Qualimap
- Methylation coverage in BED files
- Coverage depth: Check in Qualimap reports
- Mapping rate: Verify alignment efficiency
- Methylation sites: Number of CpG sites covered
- Read length distribution: Assess data quality
The pipeline works with any reference genome in FASTA format:
# Index your reference (optional, minimap2 will do this automatically)
minimap2 -d reference.mmi reference.fna
# Use in pipeline
sbatch NanoporeToBED.sh -ref /path/to/reference.fna ...For multiple libraries, create a wrapper script:
#!/bin/bash
# Process multiple Guppy output directories
for lib in SRR00000{1..5}; do
sbatch NanoporeToBED.sh \
-i /data/nanopore/fastq_gpu_hac_mod \
-o /data/nanopore/methylation_results/$lib \
-ref /data/references/genome.fna \
-t 40
doneFor processing multiple mixed-species runs:
#!/bin/bash
# Process multiple mixed-species sequencing runs
for run_id in Run001 Run002 Run003; do
sbatch NanoporeToBED.sh \
-i /data/nanopore/${run_id}/fastq_gpu_hac_mod \
-o /data/nanopore/methylation_results/${run_id} \
--multi-mapping "1:11,12:16" \
--multi-refs "/refs/pig.fna,/refs/penguin.fna" \
-t 40
doneIf you use this pipeline, please cite:
Our methodology paper:
- Drag, M.H., Hvilsom, C., Poulsen, L.L., Jensen, H.E., Tahas, S.A., Leineweber, C., Cray, C., Bertelsen, M.F., Bojesen, A.M. (2025). New high accuracy diagnostics for avian Aspergillus fumigatus infection using Nanopore methylation sequencing of host cell-free DNA and machine learning prediction. bioRxiv 2025.04.11.648151. https://doi.org/10.1101/2025.04.11.648151
Software tools:
- Oxford Nanopore Technologies modkit
- Minimap2: Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094-3100.
- Samtools: Danecek, P., et al. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008.
- Qualimap: Okonechnikov, K., et al. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292-294.
MIT Licence (see LICENCE file)
- Lead Developer: Markus Hodal Drag
- Email: markus.drag@sund.ku.dk
- Institution: University of Copenhagen
- GitHub: https://github.com/markusdrag
- ORCID: https://orcid.org/0000-0002-7412-6402
For questions, bug reports, or feature requests, please:
- Open an issue on GitHub: https://github.com/markusdrag/NanoporeToBED-Pipeline/issues
- Or contact via email for collaboration enquiries





