Juno is designed for processing Illumina paired-end sequencing data for OROV genome assembly, supporting either reference-based or de novo assembly modes with comprehensive QC, taxonomic classification, and assembly evaluation.
Disclaimer: results of pipeline are intended for research use only and obtained by procedures that were not CLIA validated.
$ nextflow run juno.nf -profile singularity -params-file params.yaml$ sbatch ./juno.sh- Nextflow 25.04.0+
- Singularity or Docker
- Python 3.10+
- Slurm (This applies only if HiPerGator is used)
Allocate at least 16 CPU cores and 64 GB RAM when using the viral Kraken2 Database, or 16 CPU cores and 200 GB RAM (or more) when using larger Kraken2 databases.
$ git clone https://github.com/BPHL-Molecular/Juno.git
$ cd Juno$ mkdir fastq
# move or copy your FASTQ files into this directoryNote: FASTQ files must follow the Illumina naming format: *_L001_R{1,2}_*.fastq.gz (e.g., sample_name_L001_R1_001.fastq.gz and sample_name_L001_R2_001.fastq.gz)
3. (Optional) Conda environment installation (Nextflow and Docker/Singularity/Apptainer must be installed on your system)
# Create conda environment
$ git clone https://github.com/BPHL-Molecular/Juno.git
$ cd Juno
$ conda create -n juno -c conda-forge python=3.10
# Activate and run environment using preferred profile
$ conda activate juno
$ nextflow run juno.nf -profile apptainer -params-file params.yamlImportant: All pipeline parameters must be set in the params.yaml file. Make sure you edit this file to provide the correct paths and values before running the pipeline.
# Input/Output paths
input_dir: "/path/to/fastq"
output_dir: "/path/to/juno_output"
# Assembly mode: 'reference' or 'denovo'
assembly_mode: "denovo"
# Kraken2 database path
kraken2_db: "/path/to/kraken2/database"
# Human read removal using NCBI's SRA human read removal tool (HRRT)
skip_hrrt: false
# Assembly polishing with Pilon (only used in de novo mode)
polish_contigs: trueYou will need to download the kraken2/bracken viral database from the BenLangmead Index zone link for read classification.
Please be aware that using larger databases (e.g., Standard, PlusPF) will require significantly more memory resources. Ensure your system has sufficient memory allocated or adjust resource parameters in nextflow.config or juno.sh (if using Slurm/HiPerGator) accordingly.
The pipeline includes a step for removing human reads using NCBI's SRA Human Read Removal Tool (HRRT). This step is enabled by default, but please note that it significantly increases runtime due to the large container size and the intensive I/O involved in decompressing input files and recompressing cleaned outputs.
To skip this step, set: skip_hrrt: true
Note: Skipping HRRT may be appropriate for:
- Non-human samples
- Pre-cleaned datasets
- Testing/development workflows
flowchart TD
Start([FASTQ Files]) --> FastqScan[FASTQ-SCAN<br/>Raw Read Stats]
FastqScan --> Trimmomatic[TRIMMOMATIC<br/>Quality Trimming]
Trimmomatic --> BBDukAdapters[BBDUK<br/>Adapter Removal]
BBDukAdapters --> BBDukPhiX[BBDUK<br/>PhiX Removal]
BBDukPhiX --> HRRTCheck{Skip HRRT?}
HRRTCheck -->|No| HRRT[HRRT<br/>Human Read Removal]
HRRTCheck -->|Yes| CleanReads[Clean Reads]
HRRT --> CleanReads
CleanReads --> Fastp[FASTP<br/>Final QC Report]
CleanReads --> Kraken2[KRAKEN2<br/>Taxonomic Classification]
Kraken2 --> KrakenTools[KRAKENTOOLS<br/>Extract OROV Reads]
KrakenTools --> ModeCheck{Assembly Mode?}
ModeCheck -->|Reference| BWA[BWA<br/>Align to Reference]
BWA --> Samtools[SAMTOOLS<br/>BAM Processing]
Samtools --> IvarVariants[IVAR<br/>Variant Calling]
Samtools --> IvarConsensus[IVAR<br/>Consensus Generation]
IvarConsensus --> QuastRef[QUAST<br/>Assembly Evaluation]
QuastRef --> SummaryRef[Summary Report<br/>Reference Mode]
ModeCheck -->|De Novo| Spades[SPADES<br/>De Novo Assembly]
Spades --> BwaValidate[BWA<br/>Validate Assembly]
BwaValidate --> SamtoolsDenovo[SAMTOOLS<br/>Validation Stats]
SamtoolsDenovo --> PolishCheck{Polish Contigs?}
PolishCheck -->|Yes| Pilon[PILON<br/>Polish Assembly]
Pilon --> TrimTerminals[Trim Low-Coverage<br/>Terminal Regions]
TrimTerminals --> Blast[BLAST<br/>Classify Contigs]
PolishCheck -->|No| Blast
Blast --> ClassifyContigs[CLASSIFY_CONTIGS<br/>Segment Assignment]
ClassifyContigs --> QuastDenovo[QUAST<br/>Per-Segment Evaluation]
QuastDenovo --> SummaryDenovo[Summary Report<br/>De Novo Mode]
SummaryRef --> MultiQC[MULTIQC<br/>Aggregate QC Report]
SummaryDenovo --> MultiQC
MultiQC --> End([Pipeline Complete])
style Start fill:#e1f5e1
style End fill:#ffe1e1
style ModeCheck fill:#fff4e1
style HRRTCheck fill:#fff4e1
style PolishCheck fill:#fff4e1
style MultiQC fill:#e1e5ff
The pipeline runs in one assembly mode at a time, set via the assembly_mode parameter in params.yaml. Choose the mode appropriate for your data and objectives:
Best for:
- Targeted amplicon sequencing data
- Samples with known, closely related reference genomes
- Detecting specific variants and generating consensus sequences
- Standard surveillance and outbreak investigations
- When high-quality reference genome is available
Outputs:
- Per-segment consensus sequences aligned to reference coordinates
- Variant calls (SNVs/indels)
Best for:
- Untargeted or Metagenomic sequencing data
- Samples with divergent or unknown variants
- Discovery of novel sequences or reassortants
- When reference genome may not represent sample diversity
- Exploratory analysis of viral populations
Outputs:
- Assembled contigs
- Polished contigs (if polishing contigs was enbaled)
- Contigs classified by genome segment (L, M, S)
output_dir/
βββ fastq_scan/ # Raw read statistics
βββ dehosted/ # Cleaned reads (if HRRT enabled)
βββ trimmomatic/ # Trimmed reads
βββ bbduk/ # Adapter and PhiX removal statistics
β βββ bbduk_adapters/ # Adapter removal outputs
β βββ bbduk_phix/ # PhiX removal outputs
βββ fastp/ # Final QC reports
βββ kraken2/ # Classification results
βββ krakentools/ # Filtered OROV reads
βββ alignments/ # SAM/BAM files & indices
βββ stats/ # Alignment statistics (coverage, depth, flagstat, markdup)
βββ variants/ # Variant calls (iVar)
βββ consensus/ # Consensus sequences (iVar)
βββ quast/ # Assembly metrics
βββ multiqc/ # Combined QC report
βββ summary_report.tsv # Summary report for all samples
output_dir/
βββ fastq_scan/ # Raw read statistics
βββ dehosted/ # Cleaned reads (if HRRT enabled)
βββ trimmomatic/ # Trimmed reads
βββ bbduk/ # Adapter and PhiX removal statistics
β βββ bbduk_adapters/ # Adapter removal outputs
β βββ bbduk_phix/ # PhiX removal outputs
βββ fastp/ # Final QC reports
βββ kraken2/ # Classification results
βββ krakentools/ # Filtered OROV reads
βββ spades/ # SPAdes assembly outputs
βββ blast/ # BLAST database and results
β βββ blast_db/ # BLAST reference database files
β βββ sample_id/ # Per-sample BLAST results
βββ pilon/ # Polished assemblies (if polish_contigs enabled)
β βββ sample_id/ # Per-sample Pilon outputs
βββ trimmed_contigs/ # Terminal-trimmed contigs (if polish_contigs enabled)
βββ samtools/ # Validation alignment statistics
β βββ sample_id/ # Per-sample validation BAM files and stats
βββ assemblies/ # Classified contigs by genome segment
β βββ sample_id/ # Per-sample directories
β βββ sample_L.fasta
β βββ sample_M.fasta
β βββ sample_S.fasta
β βββ sample_unassigned.fasta
β βββ sample_classification_summary.txt
βββ quast/ # Assembly metrics (per segment)
βββ multiqc/ # Combined QC report
βββ summary_report.tsv # Summary report for all samples
- PASS: Coverage β₯90% AND depth β₯15x AND "N" bases β€5%
- PASS_W_HIGH_N_BASES: Coverage β₯90% AND depth β₯15x BUT "N" bases >5%
- FAIL: Coverage <90% OR depth <15x
- ASSEMBLED: Contigs successfully assembled with largest contig within 90-150% of quality threshold (length)
- FRAGMENTED: Contigs exist but largest contig is outside the 90-150% quality threshold (length)
- NO_ASSEMBLY: No contigs assembled or classified for segment
Juno is made possible thanks to these bioinformatics tools:
fastq-scan- raw read statisticstrimmomatic- quality trimmingbbduk- adapter and PhiX removalsra-human-scrubber (HRRT)- human read removalfastp- final QC reportkraken2- taxonomic classificationkrakentools- extract classified readsbwa- read alignmentsamtools- SAM/BAM processingivar- variant calling & consensus generationspades- de novo genome assemblypilon- assembly polishingblast- contig classificationquast- assembly evaluationmultiqc- aggregate QC reporting
Pipeline Errors: Check Nextflow execution logs in .nextflow.log
Low Coverage Regions (Reference Mode): Regions with low coverage (<10x) will be filled with 'N' in consensus sequences.
De Novo Assembly Issues:
- Low contig counts may indicate insufficient OROV reads
- Check classification summary for unassigned contigs
- BLAST identity/coverage thresholds: β₯85% identity, β₯70% coverage
- FRAGMENTED status indicates assembly exists but quality criteria not met (largest contig outside 90-150% of reference length)
We welcome contributions to make Juno better! Feel free to open issues or submit pull requests to suggest any additional features or enhancements!
Email: bphl-sebioinformatics@flhealth.gov
Juno is licensed under the MIT License.