fast and robust C++ implementation of lulu, a R package for post-clustering curation of metabarcoding data
In short, mumu is a general clustering method that uses both pairwise similarity and co-abundance patterns.
mumu is not a strict lulu clone. There is a bug in
lulu that prevents some
merging from happening. Additionally, mumu filters and sorts input
data differently. When combined, these differences result in slightly
more merging with mumu (by a few percent). Use the --legacy option
if you need to reproduce lulu's results exactly.
mumu is fully tested, with 175 carefully crafted individual
black-box tests, covering 100% of the application-specific C++
code. Tests are written using common Unix/Linux shell utilities. Some
C++ internal tests are also used (assertions), but these are only
active at compile-time, or at runtime when compiling with the debug
flag.
mumu uses C++20 features to make the code simpler, easier to maintain and to port to other operating systems. Please note that mumu has been tested on GNU/Linux. Compilation on other operating systems, such as macOS, BSD, or Windows should be possible but remains untested. Compiling mumu requires a compliant C++ compiler (GCC 10 or more recent, clang 17 or more recent). If your system only provides an older compiler, a recipe for a singularity/Apptainer/docker image is available (see section advanced users).
About the name of the project: m is simply the next letter after l, hence mumu. Any similarity to actual words is purely coincidental.
clone or download a copy the repository:
git clone https://github.com/frederic-mahe/mumu.git
cd ./mumu/
make
make check
make install # as root or sudodependencies are minimal:
- a 64-bit operating system,
make(version 4 or more recent),- GCC 10 (2020) or more recent, or clang 17 (2023) or more recent,
- GNU Awk and other GNU tools for testing
simply run:
mumu \
--otu_table OTU.table \
--match_list matches.list \
--log /dev/null \
--new_otu_table new_OTU.tablewhere the input OTU.table is formatted as such:
| OTUs | sample1 | sample2 | sample3 |
|---|---|---|---|
| A | 12 | 9 | 24 |
| B | 3 | 0 | 6 |
and the input matches.list is formatted as such:
| B | A | 95.6 |
|---|
Given a fasta file input.fasta, a correct list of matches can be
produced with vsearch for all
OTU pairs with at least 84% similarity (--id 0.84, see mumu --help
and man mumu for more details):
vsearch
--usearch_global input.fasta
--db input.fasta
--self
--id 0.84
--iddef 1
--userfields query+target+id
--maxaccepts 0
--query_cov 0.9
--maxhits 10
--userout matches.listAdrien Taudière (@adrientaudiere) published
mumu_pq,
a wrapper that allows to use mumu on
phyloseq objects (R).
build an Apptainer (ex-singularity) image for operating systems with older compilers:
# build image with singularity 3.8.5
# (Alpine edge with GCC 11.2 [2022-02-25])
singularity \
build \
--fakeroot \
--force mumu-alpine.sif \
mumu-alpine.recipe
# test (image is appr. 4 MB)
singularity run mumu-alpine.sif --helpmumu is currently feature-complete (nothing is missing), but refactoring will continue and as more C++ features (C++20 modules, C++23 ranges, C++26 contracts, etc.) are standardized and supported by compilers.
- replicate lulu's results,
- fix lulu's bug,
- allow chained merges,
- high software quality score (softwipe),
- allow empty input files,
- allow process substitutions (input/output),
- compile without warnings with GCC 10 and 11,
- compile without warnings with GCC 12.2,
- compile without warnings with GCC 12.3,
- compile without warnings with GCC 13, 14, and 15
- compile with clang 17 to 22 (
std::rangesis not supported in clang-16), - investigate the five minor failed tests when running on Alpine (as root),
- add a row of column header to the log file? (see issue #4)
- silently strip quote symbols from input table? Exporters often quote strings, tripping some users (see issue #7),
- allow named pipes (input/output),
- test performances on ARM64 GNU/Linux (Raspberry Pi 3B+),
- test performances on RISC-V GNU/Linux (Banana Pi BPI-F3),
- support for sparse contingency tables,
- faster input parsing through data buffers,
- faster output with
std::format(in 2026?), - native compilation on Windows (issue with
getopt.h) , - native compilation on BSD (issue with the Makefile),
- native compilation on macOS
mumu releases follow the Semantic Versioning 2.0.0 rules.