Calculate Duplex Metrics

An R CLI tool to take in summarised read information and output a variety of duplex metrics.

Available metrics

Individual metrics (selectable individually)

frac_singletons
efficiency
drop_out_rate

Grouped metrics

GC metrics

gc_single
gc_both
gc_deviation

Family stats

total_families
family_mean
family_median
family_max
families_gt1
single_families
paired_families
paired_and_gt1

Implementation overview

The CLI entrypoint is main.R
Argument parsing and validation are handled in cli.R
Metric execution logic is in calculate.R
Core metric implementations are defined in R/calculate_nanoseq_functions.R

Metric selection is resolved before computation.
Only the requested individual metrics and/or metric groups are evaluated.

GC metric behaviour

GC metrics are computed only when a reference genome object (.fasta) is provided.
--metrics defaults to all.
- If --ref_fasta is not provided, GC metrics are skipped and a message is printed to the console.
- If --ref_fasta is provided, GC metrics are computed (may return NA if insufficient data).
If GC metrics are explicitly requested (e.g. --metrics gc) but no reference FASTA is supplied, the program exits with an error.

Installation and Usage

Option A: Using Docker (Recommended)

This method packages the script and all its dependencies into a self-contained environment. It is the most reliable way to run the analysis, as it guarantees that the exact same software versions are used every time.

Requirements

Docker: You must have Docker installed and the Docker daemon running. You can download it from the Docker website.

Installation Steps

Clone the repository:

git clone https://github.com/WEHIGenomicsRnD/calculate-duplex-metrics.git

Navigate to the project directory:
```
cd calculate-duplex-metrics
```
Build the Docker image: This command reads the Dockerfile and builds a container image named calculate-duplex-metrics. This may take several minutes the first time you run it.
```
docker build -t calculate-duplex-metrics .
```
(Optional) Verify the image: You can check that the image was built successfully by listing your local Docker images.
```
docker images
```
You should see calculate-duplex-metrics in the list.

Default Usage Example

To run the tool, you use the docker run command. The -v flags are essential for allowing the Docker container to access files on your local machine.

docker run --rm \
  -v "$(pwd)/data:/app/data" \
  -v "$(pwd)/out:/app/out" \
  calculate-duplex-metrics \
  Rscript main.R \
  --input data/test.rinfo \
  --output out/default.csv

-v "$(pwd)/data:/app/data": This "mounts" your local data directory into the /app/data directory inside the container, so the script can find the input file.
-v "$(pwd)/out:/app/out": This mounts your local out directory into the /app/out directory inside the container, so the script can write the output file back to your machine.

Option B: Local Installation with `renv`

This method uses the renv package to recreate the exact development environment, using the specific package versions defined in the renv.lock file.

Requirements

R: R version 4.4.1

Installation Steps

Clone the repository and navigate into it.
Open an R console in the project's root directory.
Restore the environment: This command will install renv if needed, then install all the packages listed in renv.lock with their exact versions.
```
if (!require("renv")) install.packages("renv", repos = "https://cloud.r-project.org")
renv::restore()
```

Default Usage Example

After restoring the environment, you can run the script directly from your terminal within the cloned repository directory.

Rscript main.R \
  --input data/test.rinfo \
  --output out/default.csv

Option C: Local Installation with `devtools`

This method installs the required R packages directly onto your system. It is more flexible if you have a different version of R, but it is less reproducible as it will use the latest available package versions.

Requirements

R: Any modern version of R.
devtools R package: This is used to install packages from GitHub.

Installation Steps

Open an R console.

Install devtools and Bioconductor dependencies:

# Install devtools from CRAN
install.packages("devtools")

# Install BiocManager and required Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("Rsamtools", "GenomicRanges", "IRanges", "Biostrings"))

Install the package from GitHub:

devtools::install_github('WEHIGenomicsRnD/calculate-duplex-metrics')

Default Usage Example

After installing the dependencies, you can run the script directly from your terminal within the cloned repository directory.

Rscript main.R \
  --input data/test.rinfo \
  --output out/default.csv

Additional Usage Examples

Example: default mode with GC enabled (requires reference genome)

Note: The reference genome FASTA is user-provided and not included in this repository. Any compatible reference genome may be used.

Rscript main.R \
  --input data/NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001.txt \
  --output out/default_with_gc.csv \
  --ref_fasta ref/Escherichia_coli_ATCC_10798.fasta

Example: select individual metrics only

Rscript main.R \
  --input data/NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001.txt \
  --output out/test_selected_metrics.csv \
  --metrics efficiency,drop_out_rate

Note: when listing multiple metrics, either omit spaces (efficiency,drop_out_rate) or quote the argument ("efficiency, drop_out_rate").

Example: select metric groups

Rscript main.R \
  --input data/NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001.txt \
  --output out/test_family_metrics.csv \
  --metrics family

Example: mixed selection (individual + group)

Rscript main.R \
  --input data/NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001.txt \
  --output out/test_mixed_metrics.csv \
  --metrics efficiency,family

Example: multiple input files

Rscript main.R \
  --input data/a.txt data/b.txt \
  --output out/all_samples_metrics.csv \
  --metrics family \
  --cores 2

CLI flags

Required:
  -i, --input        One or more input rinfo files OR a directory containing rinfo files (.txt or .txt.gz)
                     Note: when --input is a directory, the tool selects matching files using --pattern (default: \.txt(\.gz)?$);
                           when --input is a list of files, --pattern is ignored
  -o, --output       Output CSV path (long format)

Optional:
  -s, --sample       Optional sample name(s). For multiple input files, provide
                     comma-separated names matching the number of files.
                     Note: if --input is a directory, --sample is not allowed.

      --pattern      Regex pattern used to select files when --input is a directory
                     (default: \.txt(\.gz)?$)

      --rlen         Read length (default: 151)
      --skips        Trimmed / ignored bases per read (NanoSeq = 5, xGen = 8)

      --ref_fasta    Reference genome FASTA (required for GC metrics)

      --metrics      Comma-separated list of metrics and/or metric groups
                     - Individual: frac_singletons, efficiency, drop_out_rate
                     - Groups: gc, family
                     (default: all)

      --cores        Number of CPU cores for parallel processing (default: 1)

Note: when listing multiple metrics, either omit spaces (efficiency,drop_out_rate) or quote the argument ("efficiency, drop_out_rate").

Outputs

Output is written in long format:

sample,metric,value

Example:

sample,metric,value
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,frac_singletons,0.0418706803079419
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,efficiency,0.0490258329591602
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,drop_out_rate,0.320805646128878
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,total_families,23825702
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,family_mean,6.748161712309
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,family_median,5
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,family_max,50
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,families_gt1,16771629
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,single_families,6731955
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,paired_families,9994045
NanoMB1Rep1_HJK2GDSX3_CGGCTAAT-CTCGTTCT_L001,paired_and_gt1,8152302

Sanity check the CLI

Rscript main.R --help

Testing

Test that functions return valid numeric values, correct handling of edge cases (NA, zero reads, invalid inputs) and presence of expected metrics names.

Requirements

Packages: testthat

To run all tests

From the project root run:

Rscript tests/testthat.R

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
R		R
data		data
renv		renv
tests		tests
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.R		main.R
renv.lock		renv.lock

License

WEHIGenomicsRnD/calculate-duplex-metrics

Folders and files

Latest commit

History

Repository files navigation

Calculate Duplex Metrics

Available metrics

Implementation overview

GC metric behaviour

Installation and Usage

Option A: Using Docker (Recommended)

Requirements

Installation Steps

Default Usage Example

Option B: Local Installation with renv

Requirements

Installation Steps

Default Usage Example

Option C: Local Installation with devtools

Requirements

Installation Steps

Default Usage Example

Additional Usage Examples

Example: default mode with GC enabled (requires reference genome)

Example: select individual metrics only

Example: select metric groups

Example: mixed selection (individual + group)

Example: multiple input files

CLI flags

Outputs

Example:

Sanity check the CLI

Testing

Requirements

To run all tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Option B: Local Installation with `renv`

Option C: Local Installation with `devtools`

Packages