Main tool: BBTools
Code repository: https://sourceforge.net/projects/bbmap/
Additional tools:
- samtools: 1.21
- htslib: 1.21
- sambamba: 1.0.1
Basic information on how to use this tool:
- executable: *.sh
- help: Program descriptions and options are shown when running the shell scripts with no parameters.
- version: --version
- description:
BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data. BBTools can handle common sequencing file formats such as fastq, fasta, sam, scarf, fasta+qual, compressed or raw, with autodetection of quality encoding and interleaving.
Additional information:
| Script | Purpose | Comment |
|---|---|---|
| bbcms.sh | Performs error correction using a Count-Min Sketch | Intended for metagenome assembly assembly |
| bbcountunique.sh | Counts unique kmers in reads | |
| bbduk.sh | Trims, filters or masks reads using kmers | |
| bbmap.sh | Splice-aware aligner for short reads | |
| bbmapskimmer.sh | BBMap version designed for high levels of multimapping | |
| bbmask.sh | Masks references based on various things, such as sequence complexity | |
| bbmerge.sh | Merges overlapping paired reads | |
| bbmerge-auto.sh | Same as bbmerge, but tries to allocate all memory on the node | Use this version for kmer operations like extend |
| bbnorm.sh | Normalizes reads based on coverage | Mainly for use prior to single-cell assembly |
| bbsplit.sh | BBMap version that maps to multiple references simultaneously | Intended for decontamination; similar to Seal |
| bbversion.sh | Prints the version of BBTools | |
| bbwrap.sh | Wraps BBMap to process many files using same reference | Saves time by loading the index only once |
| calctruequality.sh | Allows recalibration of quality scores from mapped reads | This generates the correction matrix; BBDuk does the recalibration |
| callgenes.sh | Fast prokaryotic gene caller | Integrated into BBSketch |
| callvariants.sh | Fast variant caller | |
| callvariants2.sh | Same as callvariants.sh with the "multisample" flag | |
| clumpify.sh | Shrinks compressed fastq files, and can remove duplicate reads | Also supports error correction |
| comparesketch.sh | Compares sketches locally, without using a sketch server | |
| crossblock.sh | Alias for decontaminate.sh | |
| cutgff.sh | Cuts out features defined by gff file | E.g, generates one fasta entry per gene from a gff and an assembly |
| cutprimers.sh | Cuts out subregions of ribosomes | Mainly for 16S analysis |
| decontaminate.sh | Pool-level decontamination for single-cell MDA-amplified genomes | |
| dedupe.sh | Removes duplicate and fully-contained sequences | Can also be used to cluster 16S sequences |
| dedupe2.sh | Version of dedupe that supports more hash keys for greater sensitivity | |
| dedupebymapping.sh | Deduplicates reads based on mapping coordinates | |
| demuxbyname.sh | Demultiplexes based on sequences headers | |
| filterbyname.sh | Filters based on sequence headers | |
| filterbytaxa.sh | Filters sequences based on taxonomic classification | Used with NCBI datasets |
| filterbytile.sh | Removes reads that are in low quality areas on flowcell | |
| filterqc.sh | Part of JGI's fastq filtering pipeline | |
| filtersam.sh | Filters sam files to remove reads with multiple unsupported mismatches | Designed for NovaSeq |
| gitable.sh | Used to process NCBI taxonomy data | |
| khist.sh | Alias for bbnorm.sh with flags for making a kmer frequency histogram | |
| kmercountexact.sh | Counts kmers and produces a histogram | Uses more memory than BBNorm but allows exact counts |
| kmercountmulti.sh | Cardinality estimation over multiple kmer lengths | Uses LogLog; does not produce a histogram |
| mapPacBio.sh | BBMap version designed for PacBio or Nanopore reads | Reads longer than 5kbp get broken into 5kbp shreds |
| mergesketch.sh | Allows multiple sketches to be combined | |
| msa.sh | Alignment tool | Used with cutprimers.sh to cut subsections out of 16s |
| mutate.sh | Generates synthetic genomes by randomly mutating the input | |
| muxbyname.sh | Multiplex multiple files, renaming sequences based on input file name | Opposite of demuxbyname.sh |
| partition.sh | Splits a sequence file into multiple files | |
| pileup.sh | Calculates coverage from sam files | |
| plotflowcell.sh | Produces statistics about flowcell positions | |
| processhi-c.sh | Custom trimming for hi-C reads | In development |
| randomreads.sh | Generates synthetic data from real genome reference | Highly customizable |
| readqc.sh | Short read quality report | Alternative to fastqc |
| reformat.sh | Converts sequence files to another format | Has many additional options, includes subsampling |
| rename.sh | Renames sequences in various ways, such as adding a prefix | |
| repair.sh | Fixes broken pairing in fastq files | |
| representative.sh | Makes a smaller subset of a reference dataset by eliminating redundancy | Designed for use with BBSketch output |
| rqcfilter2.sh | Filtering pipeline used at JGI | portal.nersc.gov/dna/microbial/assembly/bushnell/RQCFilterData.tar |
| seal.sh | Counts kmer matches between query and reference sequences | |
| sendsketch.sh | Fast taxonomic classifier using webservers at JGI | |
| shred.sh | Breaks sequences into shorter, fixed-length pieces | |
| shuffle.sh | Randomly reorders input file | Crashes if input doesn't fit in memory |
| shuffle2.sh | Randomly reorders input file | Supports larger files, but output might be less random |
| sketch.sh | Makes reference sketches on a per-TaxID basis | |
| sketchblacklist.sh | Makes sketch blacklists of common kmers | |
| sortbyname.sh | Sorts sequences by name, length, quality, taxa, and other things | |
| summarizequast.sh | Generates box plots for multiple quast reports | |
| tadpipe.sh | Preprocessing and assembly pipeline using tadpole | |
| tadpole.sh | Fast short read assembler | |
| tadwrapper.sh | Runs Tadpole with multiple kmer lengths to select the best assembly | |
| taxserver.sh | Starts taxonomy and sketch servers | |
| testformat.sh | Determines if file is fasta, fastq, interleaved, etc. by reading first few lines | |
| testformat2.sh | Generates extensive statistics by reading the full file | |
| translate6frames.sh | Translates nucleotide sequence into amino acid sequence in all frames | |
| vcf2gff.sh | Converts vcf format to gff format |
Full documentation: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/
(adapted from /opt/bbmap/pipelines/covid/processCorona.sh)
Interleave a pair of FASTQ files for downstream processing:
reformat.sh \
in1=${SAMPLE}_R1.fastq.gz \
in2=${SAMPLE}_R2.fastq.gz \
out=${SAMPLE}.fastq.gz
Split into SARS-CoV-2 and non-SARS-CoV-2 reads:
bbduk.sh ow -Xmx1g \
in=${SAMPLE}.fq.gz \
ref=REFERENCE.fasta \
outm=${SAMPLE}_viral.fq.gz \
outu=${SAMPLE}_nonviral.fq.gz \
k=25