Skip to content

RAGNAROK is a nextflow-implemented pipeline for rapid genome annotation using multiple lines of evidence.

License

Notifications You must be signed in to change notification settings

ryandkuster/ragnarok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAGNAROK: RApid GeNe Annotation (ROcKs)

GitHub release GitHub tag (latest by date) Nextflow License: MIT


Ragnarok Logo

RAGNAROK is a nextflow-implemented pipeline for rapid genome annotation using multiple lines of evidence.

At its core, ragnarok performs alignments of RNA evidence in the form of illumina short reads, long reads (pacbio/ONT), or a combination of the two. Protein alignments are then performed against likely coding sequences. Helixer-predicted genes are combined with all RNA and protein-based models (as well as any user-supplied existing annotations) and selectively filtered by Mikado for the best transcript models at overlapping loci.

For more information and benchmarking, see the RAGNAROK pre-print.



contents

requirements

software prerequisites

nextflow (22.10.4+)
apptainer (1.1.8+)

Ragnarok is built for LINUX-based systems and relies on Nextflow and Apptainer (FKA Singularity). Both must be installed and available in your working environment prior to use.

See the wiki on Installing Nextflow and Apptainer for more information.

Ragnarok is publicly available through its github page.

Ragnarok can be cloned anywhere you'd like.

mkdir -p ~/nextflow && cd ~/nextflow # optional location, matches examples below
git clone https://github.com/ryandkuster/ragnarok

required files

parameter type description
--genome .fna Provide your assembly (ideally with simple names if using EDTA masking option).
--ill string Required if pb of ont not used. Path to a directory containing paired end fastq files. Can end with directory name (Ex: path/to/files/) or a specific prefix of the paired files (Ex: path/to/files/reads_P1). Using specific prefix name will search for the paired end read files ending in R1.[fq
--pb string Required if ill not used. Path to a directory containing PacBio long read fastq files. Can end with directory name (Ex: path/to/files/) or a specific prefix of the paired files (Ex: path/to/files/reads_P1). Using specific prefix name will search for the paired end read files ending in [fq
--ont string Required if ill not used. Path to a directory containing ONT long read fastq files. Can end with directory name (Ex: path/to/files/) or a specific prefix of the paired files (Ex: path/to/files/reads_P1). Using specific prefix name will search for the paired end read files ending in [fq
--protein .faa Protein file for miniprot (e.g., closest ref species).
--homology .faa Mikado protein homology file (e.g., uniprot 33090 for viridiplantae).
--scoring .yaml Mikado scoring file (e.g., plant.yaml).
--design .tsv Mikado configuration table file (see Mikado documentation and below).

The mikado2 stage of the pipeline requires a configuration table to weigh the input gene models and give model priority. For this --design input, Ragnarok has the mandatory fields (hx, st, mp, tr) for the helixer, stringtie, miniprot, and transdecoder models produced along the way. Users should leave the file field blank if they are to be performed in the pipeline, but all other fields should be present.

The fields in this file are file location, alias, strand-specific, sample-score, reference, and exclude redundant models. Mikado documentation provides further information on the choice of values in this table.

Example tsv configuration (assets/mikado_conf.tsv):

	hx	True		False	False
	st	True	1	False	True
	tr	False	-0.5	False	False
	mp	True	1	False	False

Note

The first field is intentionally missing as Ragnarok will produce these outputs.

Ragnarok also allows for any number of existing input annotations (gff3) to be input as additional models into the mikado2 stage of processing.

Hypothetical tsv configuration including combination of Ragnarok (empty file fields) and existing models:

	hx	True		False	False
	st	True	1	False	True
	tr	False	-0.5	False	False
	mp	True	1	False	False
cufflinks.gtf	cuff	True		False	False
trinity.gff3	tr	False	-0.5	False	False
reference.gff3	at	True	5	True	False

Note

The filepath to existing gffs will need to be provided.

optional files

parameter type description
--lo_genome .fna Reference genome to use for liftover, requires corresponding --lo_gff
--lo_gff .gff Reference annotations to use for liftover, requires corresponding --lo_genome

additional parameters

parameter type description default
--lineage url URL path to helixer model https://zenodo.org/records/10836346. "https://zenodo.org/records/10836346/files/land_plant_v0.3_a_0080.h5"
--subseq_len int Helixer subseq length. 64152 (plants)
--skip_qc bool Perform fastqc/multiqc on raw read data. true
--skip_trim bool Perform adapter trimming on raw read data true
--minimum_length int Use with --skip_trim, minimum length read to keep when adapter trimming. 50
--skip_st bool Requires st, tr, mp gff files locally (in --design file), bypass Stringtie steps. false
--skip_hx bool Requires hx file locally (in --design file), bypass Helixer step. false
--entap_db directory Entap pre-configured database false
--entap_run .params file Entap config defining databases and contaminants. assets/template_entap_config.ini (plants))
--busco_db str Desired BUSCO dataset from BUSCO v5.8.1 and above. "embryophyta_odb12"
--final_prefix str File prefix for final files in RAGNAROK publish directory "ragnarok"
--max_intron int Maximum intron length used for STAR alignIntronMax. 10000

getting started

running on a local server

First, do you have a working nextflow/apptainer version?

nextflow -version
apptainer --version

Set a scratch directory for singularity.

SCRATCHDIR= <path>

Below is a sample script to run the pipeline. You'll need to replace the <> values with those that make sense for your use case.

nextflow run ~/nextflow/ragnarok/main.nf \
    --publish_dir     < path to results location > \
    --genome          < path to genome in fasta > \
    --cds             < path to cds fasta file > \
    --protein         < path to protein (aa) fasta file > \
    --ill             < path to directory that immediately contains all R1/R2 fastqs > \
    --pb              < path to directory that immediately contains all long read fastqs > \
    --perform_masking < bool > \
    --skip_qc         < bool > \
    --skip_trim       < bool > \
    --nlrs            < bool > \
    --ipscan          < path to interproscan directory > \
    --genemark        < path to prepared genemark directory (see `optional files` section) > \
    --design          < tsv file with expected gff files and weights for mikado > \
    --scoring         < path to mikado scoring file (e.g., plant.yaml) > \
    --homology        < path to homology fasta file (e.g., uniprot for you phylum )> \
    -profile local,four \
    -resume

Note the profile here is set up for use on a local server, but will likely require modification for your job. The four local profile is set up to use approximately 4 cpus maximum. Other presets exist in conf/local.conf and you can create your own by copying those examples.

running on a slurm server

First, do you have a working nextflow/apptainer version?

nextflow -version
apptainer --version

Below is a sample sbatch script to run the pipeline. You'll need to replace the <> values with those that make sense for your use case.

#!/bin/bash
#SBATCH -J ragnarok
#SBATCH -A acf-utk0032
#SBATCH --partition=long
#SBATCH --qos=long
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --time=5-00:00:00
#SBATCH --error=job.e%J
#SBATCH --output=job.o%J

SCRATCHDIR=< path to a directory to store singularity cache >

export NXF_OPTS="-Xms500M -Xmx2G"
export NXF_ANSI_LOG=false
SCRATCHDIR= <path to scratch for singularity>

nextflow run ~/nextflow/ragnarok/main.nf \
    --publish_dir     < path to results location > \
    --genome          < path to genome in fasta > \
    --cds             < path to cds fasta file > \
    --protein         < path to protein (aa) fasta file > \
    --ill             < path to directory that immediately contains all R1/R2 fastqs > \
    --pb              < path to directory that immediately contains all long read fastqs > \
    --perform_masking < bool > \
    --skip_qc         < bool > \
    --skip_trim       < bool > \
    --nlrs            < bool > \
    --ipscan          < path to interproscan directory > \
    --genemark        < path to prepared genemark directory (see `optional files` section) > \
    --design          < tsv file with expected gff files and weights for mikado > \
    --scoring         < path to mikado scoring file (e.g., plant.yaml) > \
    --homology        < path to homology fasta file (e.g., uniprot for you phylum )> \
    -profile slurm,custom \
    -resume

Based on your qos/partition, you may want to modify the conf/slurm.config and conf/slurm_custom.config files to handle your dataset.

a few notes on slurm qos and partitions

This pipeline currently relies on four qos/partition configurations on the UTK ISAAC-NG system.

  • short : maximum 3 hours, 12 jobs submitted
  • campus : maximum 1 day, 94 jobs submitted
  • long : maximum 6 days, 14 jobs submitted
  • gpu : maximum 4 hours, 1 job submitted

For each of the three labels found in slurm.config maxForks can be adjusted to qos that work on other sytems, and the imported clusterOptions for each can be updated with specific qos/partition/account information found in slurm_custom.config.

To check the limits for a given qos on your system, replace short with any qos you have access to:

sacctmgr show qos where name=short

...or

scontrol show partition short

test run

If you want to first see if the pipeline will work on your system, some smaller test files have been included in the test directory. On a local server (or within an interactive slurm sessions), navigate to test/src/tiny_maize and run:

bash local_ragnarok_tiny_maize.sh

Note

This may take approximately 45 minutes depending on your resources as EnTAP will use the full reference databases.

Assuming you have followed the nextflow/apptainer installation steps above, this should run.

If you're running on a slurm-based system, you may modify the conf/slurm.config and conf/slurm_custom.config files to meet your system's qos and job submission limits and try running:

sbatch slurm_ragnarok_tiny_maize.sbatch

output files

output overview

Upon completion of the pipeline, the defined publish directory (--publish_dir) will contain the following subdirectories:

publish
├── RAGNAROK           (final output files and associated sequences/summaries)
├── alignments
│   ├── sorted_bam
│   └── star
├── design             (relevant metadata from user input)
├── entap
│   └── entap_outfiles (EnTAP output directory with functional annotations)
├── mikado
│   ├── mikado_in      (gff files and configuration/scoring used as input)
│   └── mikado_out     (the raw output of mikado)
└── summary            (nextflow reports on run)

detailed output example

publish
├── RAGNAROK/
│   ├── ragnarok.entap_filtered.busco_embryophyta_odb12.txt
│   ├── ragnarok.entap_filtered.compleasm_embryophyta_odb12.txt
│   ├── ragnarok.entap_filtered.gff3
│   ├── ragnarok.entap_filtered.proteins.faa
│   ├── ragnarok.entap_filtered.transcripts.fna
│   └── ragnarok.entap_no_annotation.gff3
├── alignments/
│   ├── sorted_bam/
│   │   └── short_sorted_merged.bam
│   └── star/
│       ├── CB_N_B_T1Aligned.out.bam -> <symlink>
│       ├── CB_N_B_T1Log.final.out -> <symlink>
│       ├── CB_N_B_T1Log.out -> <symlink>
│       ├── CB_N_B_T1Log.progress.out -> <symlink>
│       └── star_CB_N_B_T1.out -> <symlink>
├── cutadapt
├── design/
│   ├── gff_paths.csv
│   └── mikado.tsv
├── edta_masking
├── entap/
│   └── entap_outfiles/
│       └── final_results/
├── mikado/
│   ├── mikado_in/
│   │   ├── configuration.yaml
│   │   ├── genome_unzip.fna.fai
│   │   ├── mikado_prepared.fasta
│   │   ├── mikado_prepared.gtf
│   │   ├── pre_mikado_10kIntron_stringtie.gff
│   │   ├── pre_mikado_aa_miniprot.gff
│   │   ├── pre_mikado_helixer.gff3
│   │   └── pre_mikado_transcripts.fasta.transdecoder.genome.gff3
│   └── mikado_out/
│       ├── mikado.loci_out.gff3
│       └── mikado.subloci.gff3
├── qc/
│   ├── raw/
│   └── trimmed/:w

└── summary/
    ├── 2025-08-14_15-15_dag.html
    ├── 2025-08-14_15-15_report.html
    ├── 2025-08-14_15-15_timeline.html
    └── 2025-08-14_15-15_trace.html

experimental features

The following features are under development and work on many (but not all) systems.

EDTA masking

parameter type description
--perform_masking bool Run EDTA to mask input genome (recommended).
--masking_threshold int Use with perform_masking to custom hard-mask TEanno models >= this length.
--cds .fna CDS file for your species for use with --perform_masking true (used by EDTA)

Note

If EDTA does not work well with your server, consider running the conda version standalone and input the hard-masked genome as the --genome parameter before running Ragnarok.

plant NLR annotation

parameter type description
--nlrs bool Run FindPlantNLRs.
--ipscan directory Locally stored interproscan for use with --nlrs true (64-bit download)
--genemark directory Genemark with key configured for use with --nlrs true (see assets/genemark_setup.sh)

Note

Using FindPlantNLRs with a pre-masked genome is not recommended.

Example tsv configuration for --nlrs true (assets/mikado_nlr_conf.tsv):

	hx	True		False	False
	st	True	1	False	True
	tr	False	-0.5	False	False
	mp	True	1	False	False
	nlr	True	1	False	False
flowchart TB
    subgraph " "
    subgraph params
    v58["lineage"]
    v76["homology"]
    v13["pb"]
    v82["entap_conf"]
    v19["minimum_length"]
    v0["design"]
    v53["protein"]
    v60["subseq_len"]
    v83["entap_run"]
    v24["perform_masking"]
    v39["max_intron"]
    v90["busco_db"]
    v75["scoring"]
    v11["ill"]
    v8["genome"]
    v25["cds"]
    end
    v3("gff files")
    v5("masked genome")
    v6([PARSE_INPUT])
    v16([FASTQC_RAW])
    v17([MULTIQC_RAW])
    v20([FASTP_ADAPTERS])
    v22([FASTQC_TRIM])
    v23([MULTIQC_TRIM])
    v27([EDTA])
    v37([STAR_INDEX_NA])
    v40([STAR_MAP])
    v42([SAM_SORT])
    v43([MINIMAP2])
    v44([SAM_SORT_LONG])
    v45([STRINGTIE_MIX])
    v47([STRINGTIE])
    v51([GFFREAD])
    v52([TRANSDECODER])
    v54([MINIPROT])
    v59([HELIXER_DB])
    v61([HELIXER])
    v77([MIKADO_CONF])
    v78([TRANSDECODER_ORF])
    v79([DIAMOND])
    v80([THE_GRANDMASTER])
    v81([GFFREAD_MIKADO])
    v84([ENTAP_INI])
    v85([PROT_FIX])
    v86([ENTAP_RUN])
    v88([AGAT_SUBSET])
    v89([GFFREAD_ENTAP])
    v91([BUSCO])
    v92([COMPLEASM_DB])
    v93([COMPLEASM])
    v0 --> v6
    v11 --> v16
    v16 --> v17
    v19 --> v20
    v11 --> v20
    v20 --> v22
    v22 --> v23
    v8 --> v27
    v24 --> v27
    v25 --> v27
    v8 --> v37
    v20 --> v40
    v37 --> v40
    v39 --> v40
    v40 --> v42
    v8 --> v43
    v13 --> v43
    v43 --> v44
    v42 --> v45
    v44 --> v45
    v42 --> v47
    v27 --> v43
    v27 --> v37
    v27 --> v61
    v44 --> v47
    v45 --> v51
    v47 --> v51
    v8 --> v51
    v47 --> v52
    v8 --> v52
    v52 --> v54
    v52 --> v3
    v54 --> v3
    v51 --> v3
    v61 --> v3
    v53 --> v54
    v8 --> v54
    v58 --> v59
    v8 --> v61
    v59 --> v61
    v60 --> v61
    v3 --> v77
    v6 --> v77
    v8 --> v77
    v75 --> v77
    v76 --> v77
    v77 --> v78
    v76 --> v79
    v77 --> v79
    v8 --> v80
    v76 --> v80
    v77 --> v80
    v78 --> v80
    v79 --> v80
    v80 --> v81
    v8 --> v81
    v82 --> v84
    v83 --> v84
    v81 --> v85
    v84 --> v86
    v85 --> v86
    v80 --> v88
    v86 --> v88
    v88 --> v89
    v8 --> v89
    v89 --> v91
    v90 --> v91
    v90 --> v92
    v89 --> v93
    v90 --> v93
    v92 --> v93
    v88 --> v100
    v89 --> v101
    v89 --> v102
    v91 --> v103
    v93 --> v103
    v17 --> v104
    v23 --> v105
    subgraph publish
    v100["entap filtered gff"]
    v101["entap filtered cds"]
    v102["entap filtered protein"]
    v103["BUSCO metrics"]
    v104["raw qc"]
    v105["trimmed qc"]
    end
    end
Loading

tools used in ragnarok

tool images

  • agat:quay.io/biocontainers/agat:1.4.2--pl5321hdfd78af_0
  • bedtools:quay.io/biocontainers/bedtools:2.31.1--h13024bc_3
  • busco:quay.io/biocontainers/busco:5.8.2--pyhdfd78af_0
  • compleasm:quay.io/biocontainers/compleasm:0.2.7--pyh7e72e81_0
  • diamond:quay.io/biocontainers/diamond:2.1.11--h5ca1c30_1
  • edta:quay.io/biocontainers/edta:2.2.2--hdfd78af_1
  • entap:docker://plantgenomics/entap:2.2.0
  • fastp:quay.io/biocontainers/fastp:0.23.4--h125f33a_5
  • fastqc:quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0
  • findplantnlrs:docker://ryandk/findplantnlrs:latest
  • gffread:quay.io/biocontainers/gffread:0.12.7--h077b44d_6
  • helixer:docker://gglyptodon/helixer-docker:helixer_v0.3.4_cuda_12.2.2-cudnn8
  • liftoff:docker://quay.io/biocontainers/liftoff:1.6.3--pyhdfd78af_1
  • mikado2:docker://gemygk/mikado:v2.3.5rc2
  • minimap2:quay.io/biocontainers/minimap2:2.28--h577a1d6_4
  • miniprot:quay.io/biocontainers/miniprot:0.13--h577a1d6_2
  • multiqc:quay.io/biocontainers/multiqc:1.24.1--pyhdfd78af_0
  • pandas:quay.io/biocontainers/pandas:1.5.2
  • samtools:quay.io/biocontainers/samtools:1.20--h50ea8bc_1
  • star:quay.io/biocontainers/star:2.7.11a--h0033a41_0
  • stringtie:quay.io/biocontainers/stringtie:3.0.0--h29c0135_0
  • transdecoder:quay.io/biocontainers/transdecoder:5.7.1--pl5321hdfd78af_0

See conf/containers.config for most current versions.

license

MIT license

About

RAGNAROK is a nextflow-implemented pipeline for rapid genome annotation using multiple lines of evidence.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •