An R CLI tool to take in summarised read information and output a variety of duplex metrics.
Individual metrics (selectable individually)
- frac_singletons
- efficiency
- drop_out_rate
Grouped metrics
GC metrics
- gc_single
- gc_both
- gc_deviation
Family stats
- total_families
- family_mean
- family_median
- family_max
- families_gt1
- single_families
- paired_families
- paired_and_gt1
- The CLI entrypoint is
main.R - Argument parsing and validation are handled in
cli.R - Metric execution logic is in
calculate.R - Core metric implementations are defined in
R/calculate_nanoseq_functions.R
Metric selection is resolved before computation.
Only the requested individual metrics and/or metric groups are evaluated.
- GC metrics are computed only when a reference genome object (.fasta) is provided.
--metricsdefaults toall.- If
--ref_fastais not provided, GC metrics are skipped and a message is printed to the console. - If
--ref_fastais provided, GC metrics are computed (may return NA if insufficient data).
- If
- If GC metrics are explicitly requested (e.g.
--metrics gc) but no reference FASTA is supplied, the program exits with an error.
This method packages the script and all its dependencies into a self-contained environment. It is the most reliable way to run the analysis, as it guarantees that the exact same software versions are used every time.
- Docker: You must have Docker installed and the Docker daemon running. You can download it from the Docker website.
-
Clone the repository:
git clone https://github.com/WEHIGenomicsRnD/calculate-duplex-metrics.git
-
Navigate to the project directory:
cd calculate-duplex-metrics -
Build the Docker image: This command reads the
Dockerfileand builds a container image namedcalculate-duplex-metrics. This may take several minutes the first time you run it.docker build -t calculate-duplex-metrics . -
(Optional) Verify the image: You can check that the image was built successfully by listing your local Docker images.
docker images
You should see
calculate-duplex-metricsin the list.
To run the tool, you use the docker run command. The -v flags are essential for allowing the Docker container to access files on your local machine.
docker run --rm \
-v "$(pwd)/data:/app/data" \
-v "$(pwd)/out:/app/out" \
calculate-duplex-metrics \
Rscript main.R \
--input data/test.rinfo \
--output out/default.csv-v "$(pwd)/data:/app/data": This "mounts" your localdatadirectory into the/app/datadirectory inside the container, so the script can find the input file.-v "$(pwd)/out:/app/out": This mounts your localoutdirectory into the/app/outdirectory inside the container, so the script can write the output file back to your machine.
This method uses the renv package to recreate the exact development environment, using the specific package versions defined in the renv.lock file.
- R: R version 4.4.1
-
Clone the repository and navigate into it.
-
Open an R console in the project's root directory.
-
Restore the environment: This command will install
renvif needed, then install all the packages listed inrenv.lockwith their exact versions.if (!require("renv")) install.packages("renv", repos = "https://cloud.r-project.org") renv::restore()
After restoring the environment, you can run the script directly from your terminal within the cloned repository directory.
Rscript main.R \
--input data/test.rinfo \
--output out/default.csvThis method installs the required R packages directly onto your system. It is more flexible if you have a different version of R, but it is less reproducible as it will use the latest available package versions.
- R: Any modern version of R.
devtoolsR package: This is used to install packages from GitHub.
-
Open an R console.
-
Install
devtoolsand Bioconductor dependencies:# Install devtools from CRAN install.packages("devtools") # Install BiocManager and required Bioconductor packages if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("Rsamtools", "GenomicRanges", "IRanges", "Biostrings"))
-
Install the package from GitHub:
devtools::install_github('WEHIGenomicsRnD/calculate-duplex-metrics')
After installing the dependencies, you can run the script directly from your terminal within the cloned repository directory.
Rscript main.R \
--input data/test.rinfo \
--output out/default.csvNote: The reference genome FASTA is user-provided and not included in this repository. Any compatible reference genome may be used.
Rscript main.R \
--input data/NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001.txt \
--output out/default_with_gc.csv \
--ref_fasta ref/Escherichia_coli_ATCC_10798.fasta
Rscript main.R \
--input data/NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001.txt \
--output out/test_selected_metrics.csv \
--metrics efficiency,drop_out_rateNote: when listing multiple metrics, either omit spaces (efficiency,drop_out_rate) or quote the argument ("efficiency, drop_out_rate").
Rscript main.R \
--input data/NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001.txt \
--output out/test_family_metrics.csv \
--metrics familyRscript main.R \
--input data/NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001.txt \
--output out/test_mixed_metrics.csv \
--metrics efficiency,familyRscript main.R \
--input data/a.txt data/b.txt \
--output out/all_samples_metrics.csv \
--metrics family \
--cores 2
Required:
-i, --input One or more input rinfo files OR a directory containing rinfo files (.txt or .txt.gz)
Note: when --input is a directory, the tool selects matching files using --pattern (default: \.txt(\.gz)?$);
when --input is a list of files, --pattern is ignored
-o, --output Output CSV path (long format)
Optional:
-s, --sample Optional sample name(s). For multiple input files, provide
comma-separated names matching the number of files.
Note: if --input is a directory, --sample is not allowed.
--pattern Regex pattern used to select files when --input is a directory
(default: \.txt(\.gz)?$)
--rlen Read length (default: 151)
--skips Trimmed / ignored bases per read (NanoSeq = 5, xGen = 8)
--ref_fasta Reference genome FASTA (required for GC metrics)
--metrics Comma-separated list of metrics and/or metric groups
- Individual: frac_singletons, efficiency, drop_out_rate
- Groups: gc, family
(default: all)
--cores Number of CPU cores for parallel processing (default: 1)
Note: when listing multiple metrics, either omit spaces (efficiency,drop_out_rate) or quote the argument ("efficiency, drop_out_rate").
Output is written in long format:
sample,metric,value
sample,metric,value
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,frac_singletons,0.0418706803079419
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,efficiency,0.0490258329591602
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,drop_out_rate,0.320805646128878
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,total_families,23825702
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,family_mean,6.748161712309
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,family_median,5
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,family_max,50
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,families_gt1,16771629
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,single_families,6731955
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,paired_families,9994045
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,paired_and_gt1,8152302
Rscript main.R --help
Test that functions return valid numeric values, correct handling of edge cases (NA, zero reads, invalid inputs) and presence of expected metrics names.
Packages: testthat
From the project root run:
Rscript tests/testthat.R