Skip to content

C++ implementation of lulu, a R package for post-clustering curation of metabarcoding data

License

Notifications You must be signed in to change notification settings

frederic-mahe/mumu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

711 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mumu

C/C++ CI coverage

fast and robust C++ implementation of lulu, a R package for post-clustering curation of metabarcoding data

In short, mumu is a general clustering method that uses both pairwise similarity and co-abundance patterns.

about

mumu is not a strict lulu clone. There is a bug in lulu that prevents some merging from happening. Additionally, mumu filters and sorts input data differently. When combined, these differences result in slightly more merging with mumu (by a few percent). Use the --legacy option if you need to reproduce lulu's results exactly.

mumu is fully tested, with 175 carefully crafted individual black-box tests, covering 100% of the application-specific C++ code. Tests are written using common Unix/Linux shell utilities. Some C++ internal tests are also used (assertions), but these are only active at compile-time, or at runtime when compiling with the debug flag.

mumu uses C++20 features to make the code simpler, easier to maintain and to port to other operating systems. Please note that mumu has been tested on GNU/Linux. Compilation on other operating systems, such as macOS, BSD, or Windows should be possible but remains untested. Compiling mumu requires a compliant C++ compiler (GCC 10 or more recent, clang 17 or more recent). If your system only provides an older compiler, a recipe for a singularity/Apptainer/docker image is available (see section advanced users).

About the name of the project: m is simply the next letter after l, hence mumu. Any similarity to actual words is purely coincidental.

install

clone or download a copy the repository:

git clone https://github.com/frederic-mahe/mumu.git
cd ./mumu/
make
make check
make install  # as root or sudo

dependencies are minimal:

  • a 64-bit operating system,
  • make (version 4 or more recent),
  • GCC 10 (2020) or more recent, or clang 17 (2023) or more recent,
  • GNU Awk and other GNU tools for testing

getting started

simply run:

mumu \
    --otu_table OTU.table \
    --match_list matches.list \
    --log /dev/null \
    --new_otu_table new_OTU.table

where the input OTU.table is formatted as such:

OTUs sample1 sample2 sample3
A 12 9 24
B 3 0 6

and the input matches.list is formatted as such:

B A 95.6

Given a fasta file input.fasta, a correct list of matches can be produced with vsearch for all OTU pairs with at least 84% similarity (--id 0.84, see mumu --help and man mumu for more details):

vsearch
    --usearch_global input.fasta
    --db input.fasta
    --self
    --id 0.84
    --iddef 1
    --userfields query+target+id
    --maxaccepts 0
    --query_cov 0.9
    --maxhits 10
    --userout matches.list

wrapper

Adrien Taudière (@adrientaudiere) published mumu_pq, a wrapper that allows to use mumu on phyloseq objects (R).

advanced users

build an Apptainer (ex-singularity) image for operating systems with older compilers:

# build image with singularity 3.8.5
# (Alpine edge with GCC 11.2 [2022-02-25])
singularity \
    build \
    --fakeroot \
    --force mumu-alpine.sif \
    mumu-alpine.recipe

# test (image is appr. 4 MB)
singularity run mumu-alpine.sif --help

road-map

mumu is currently feature-complete (nothing is missing), but refactoring will continue and as more C++ features (C++20 modules, C++23 ranges, C++26 contracts, etc.) are standardized and supported by compilers.

  • replicate lulu's results,
  • fix lulu's bug,
  • allow chained merges,
  • high software quality score (softwipe),
  • allow empty input files,
  • allow process substitutions (input/output),
  • compile without warnings with GCC 10 and 11,
  • compile without warnings with GCC 12.2,
  • compile without warnings with GCC 12.3,
  • compile without warnings with GCC 13, 14, and 15
  • compile with clang 17 to 22 (std::ranges is not supported in clang-16),
  • investigate the five minor failed tests when running on Alpine (as root),
  • add a row of column header to the log file? (see issue #4)
  • silently strip quote symbols from input table? Exporters often quote strings, tripping some users (see issue #7),
  • allow named pipes (input/output),
  • test performances on ARM64 GNU/Linux (Raspberry Pi 3B+),
  • test performances on RISC-V GNU/Linux (Banana Pi BPI-F3),
  • support for sparse contingency tables,
  • faster input parsing through data buffers,
  • faster output with std::format (in 2026?),
  • native compilation on Windows (issue with getopt.h) ,
  • native compilation on BSD (issue with the Makefile),
  • native compilation on macOS

mumu releases follow the Semantic Versioning 2.0.0 rules.

About

C++ implementation of lulu, a R package for post-clustering curation of metabarcoding data

Resources

License

Stars

Watchers

Forks

Packages

No packages published