A pan-transcriptome indexing and analysis toolkit for long-read sequencing data. Atroplex builds a spatial index with graph structure from multiple annotation sources, tracks features across samples, and provides detailed analysis of exon/segment sharing, splicing complexity, and isoform diversity.
docker pull <dockerhub-username>/atroplex:latest
docker run --rm -v $(pwd):/data atroplex build -b /data/annotation.gtf --stats -o /data/output/Or build the image locally:
docker build -t atroplex .
docker run --rm -v $(pwd):/data atroplex --helpRequires CMake 3.14+, a C++20 compiler, and htslib installed via pkg-config.
cmake -B build -S .
cmake --build build
./build/atroplex --help- cxxopts (v3.2.1): Command-line argument parsing (fetched automatically)
- genogrove (v0.14.0): Genomic interval data structures and graph structure (fetched automatically)
- htslib: Reading BAM/SAM/FASTQ files (system dependency via pkg-config)
- zlib: Compression support
# Build index from a single annotation file with summary statistics
atroplex build -b annotation.gtf --stats
# Build pan-transcriptome from multiple sources via manifest
atroplex build -m manifest.tsv --stats
# Run full per-sample analysis (sharing, splicing hubs, diversity)
atroplex analyze -m manifest.tsv -o results/
# Discover novel transcripts from long-read data
atroplex discover -i reads.bam -m manifest.tsvBuilds a genogrove index from one or more annotation files. Each input file is treated as a separate entry in the pan-transcriptome, with features deduplicated across files and sample/source provenance tracked on every exon and segment.
# From manifest (full metadata per sample)
atroplex build -m ENCODE/manifest.tsv --stats -o results/
# From annotation files directly (metadata parsed from GFF headers)
atroplex build -b gencode.gtf -b sample1.gtf -b sample2.gtf --stats
# Both combined
atroplex build -m manifest.tsv -b extra_annotation.gtf --statsUse --stats to write a compact index summary to the output directory.
Performs detailed per-sample analysis including Jaccard isoform diversity, exon/segment sharing statistics, and splicing hub detection. This is more expensive than --stats and produces multiple output files organized in subfolders.
# Full analysis from manifest
atroplex analyze -m ENCODE/manifest.tsv -o results/
# Full analysis from annotation files
atroplex analyze -b gencode.gtf -b sample1.gtf -o results/Clusters aligned long reads by splice junction signature and matches them against the index to classify transcripts as known, compatible, or novel.
atroplex discover -i reads.bam -m manifest.tsv -o results/(Under development)
| Option | Description |
|---|---|
-o, --output-dir |
Output directory (default: input file directory) |
-t, --threads |
Number of threads (default: 1) |
-s, --stats |
Write compact index summary |
--progress |
Show progress output |
-m, --manifest |
Sample manifest file (TSV) |
-b, --build-from |
Build from GFF/GTF file(s) |
-g, --genogrove |
Load pre-built genogrove index (.gg) |
-k, --order |
Genogrove tree order (default: 3) |
A tab-separated file specifying input files with metadata. This is the recommended way to build a pan-transcriptome with full sample tracking.
file id type assay biosample condition species platform pipeline description
gencode.v49.gtf GENCODE_v49 annotation . . . Homo sapiens . GENCODE Reference annotation
sample1.gtf HL60_M1_rep1 sample RNA-seq HL-60 M1 macrophage Homo sapiens PacBio Sequel II TALON HL-60 M1 replicate 1
sample2.gtf HL60_M1_rep2 sample RNA-seq HL-60 M1 macrophage Homo sapiens PacBio Sequel II TALON HL-60 M1 replicate 2- Tab-separated,
"."for empty values (VCF convention) filecolumn is required; all others optionaltype:"annotation"for reference annotations,"sample"for experimental data (default:"sample")- Column names are case-insensitive
- Relative paths resolved from manifest directory
idauto-generated from filename if not provided
The type field matters: only entries with type = "sample" count toward "conserved" thresholds (features present in ALL samples). Reference annotations participate in structural analysis but don't inflate sample counts.
When using --build-from without a manifest, metadata is parsed from GFF/GTF headers:
##id: GENCODE_v49_GRCh38
##description: Evidence-based annotation of the human genome
##species: Homo sapiens
##annotation_source: GENCODE
##annotation_version: v49
##type: annotation
Each exon feature must have gene_id and transcript_id in column 9.
Expression values are automatically parsed from transcript-level GFF entries. Priority order: counts > TPM > FPKM > RPKM > cov. Detected values are:
- Stored on segments (transcript-level)
- Propagated to exons (accumulated across transcripts sharing the exon)
- The expression type (counts, TPM, etc.) is recorded on the sample and appears in output column headers
Use scripts/inject_expression.py to add expression counts from TALON quantification TSV files into GTF files before building.
Atroplex normalizes chromosome names to UCSC/GENCODE style automatically:
| Input | Normalized |
|---|---|
1, 2, ... 22 |
chr1, chr2, ... chr22 |
X, Y |
chrX, chrY |
MT |
chrM |
chr1 (already prefixed) |
chr1 (unchanged) |
This allows mixing annotations from different sources (e.g., GENCODE + Ensembl).
Written directly to the output directory:
{basename}.index_stats.txt
A compact text summary with:
- Overview (chromosomes, genes, transcripts, segments, exons, edges)
- Input entries listed by name and type
- Genes by biotype
- Transcripts per gene / exons per segment distributions
- Per-chromosome breakdown
All output goes to {output-dir}/analysis/ with organized subfolders:
analysis/
{basename}.analysis.txt Full text report
{basename}.sample_stats.csv Per-sample metrics (CSV)
{basename}.source_stats.csv Per-source metrics (CSV)
sharing/
{basename}.exon_sharing.tsv Exon sharing summary
{basename}.segment_sharing.tsv Segment sharing summary
{basename}.conserved_exons.tsv Conserved exon detail
splicing_hubs/
{basename}.splicing_hubs.tsv Splicing hub exons
{basename}.branch_details.tsv Per-branch detail
Human-readable report with all sections: overview, biotype breakdown, distributions, exon sharing (constitutive/alternative), segment sharing (conserved/shared/sample-specific), graph structure, isoform diversity (Jaccard), per-sample comparison table, per-source comparison table, and top splicing hubs.
Samples as columns, metrics as rows. Includes metadata rows (source_file, assay, biosample, etc.) followed by statistics: segments, exclusive segments, exons, exclusive exons, genes, transcripts, isoform diversity, deduplication ratio, mean expression, expressed segments.
GFF sources (column 2: HAVANA, ENSEMBL, TALON, etc.) as columns, metrics as rows: segments, exclusive segments, exons, exclusive exons, genes.
Metrics as rows, samples as columns, plus a total column with global values:
| Metric | Description |
|---|---|
total |
Total exons in this sample |
exclusive |
Exons only in this sample |
shared |
Exons in 2+ but not all samples |
conserved |
Exons in ALL samples |
constitutive |
Exons in all transcripts of their gene |
alternative |
Exons in only some transcripts of their gene |
Same format as exon sharing:
| Metric | Description |
|---|---|
total |
Total segments in this sample |
exclusive |
Segments only in this sample |
shared |
Segments in 2+ but not all samples |
conserved |
Segments in ALL samples |
One row per exon that is present in all samples. Provides per-sample transcript counts and expression values for cross-sample comparison of core exons.
| Column | Description |
|---|---|
exon_id |
Exon identifier |
gene_name |
Gene symbol |
gene_id |
Gene identifier |
chromosome |
Chromosome |
coordinate |
Genomic coordinate (chr:strand:start-end) |
n_transcripts |
Total transcripts using this exon |
constitutive |
yes if present in all transcripts of the gene, no if alternative |
{sample}.transcripts |
Number of transcripts using this exon in the sample |
{sample}.{expr_type} |
Expression value (e.g., .counts, .TPM); only for sample-type entries |
- Expression columns only appear for entries with
type = "sample"(not annotations) - Missing expression shown as
.(present but not quantified) - Sorted by chromosome then coordinate
Exons with more than 10 unique downstream exon targets, indicating complex alternative splicing decision points. One row per hub exon.
| Column | Description |
|---|---|
gene_name |
Gene symbol |
gene_id |
Gene identifier |
exon_id |
Hub exon identifier |
coordinate |
Genomic coordinate |
exon_number |
1-based position in a representative transcript chain |
total_exons |
Total exons in the representative segment |
total_branches |
Total unique downstream targets across all samples |
total_transcripts |
Total transcripts using this hub exon |
Per-sample columns (one set per entry):
| Column | Description |
|---|---|
{sample}.branches |
Unique downstream targets in this sample |
{sample}.shared |
Branches also present in at least one other sample |
{sample}.unique |
Branches only in this sample |
{sample}.transcripts |
Transcripts using this hub exon in this sample |
{sample}.entropy |
Shannon entropy of branch usage: H = -Sigma(f_i * log2(f_i)) |
{sample}.psi |
Traditional PSI (Percent Spliced In): hub transcripts / gene transcripts |
{sample}.{expr_type} |
Expression at hub exon (sample-type entries only) |
- Entropy measures splicing complexity at the junction. Higher entropy means more evenly distributed branch usage; 0 means all transcripts take the same path.
- PSI measures exon inclusion rate. PSI = 1.0 means all transcripts of the gene include this exon.
- Missing values shown as
.(hub exon not present in that sample).
One row per (hub exon, downstream target) pair. Shows how transcript flow is distributed across individual branch targets.
| Column | Description |
|---|---|
hub_gene_name |
Gene symbol |
hub_gene_id |
Gene identifier |
hub_exon_id |
Hub exon identifier |
hub_coordinate |
Hub genomic coordinate |
target_exon_id |
Downstream target exon identifier |
target_coordinate |
Target genomic coordinate |
{sample}.fraction |
Branch usage fraction: target transcripts / hub transcripts |
{sample}.{expr_type} |
Expression at target exon (sample-type entries only) |
- Fraction values sum to ~1.0 across all targets for a given hub and sample
.indicates the target is not present in that sample
{input}.atroplex.tsv Per-cluster match results
{input}.atroplex.summary.txt Match statistics
Atroplex builds a combined index from multiple annotation sources. Each input file (reference annotation or sample assembly) is registered with metadata, and every feature (exon, segment) tracks which samples and sources it came from. Features at identical coordinates are deduplicated:
- Exons: Deduplicated by genomic coordinates (chromosome + strand + start + end)
- Segments: Deduplicated by exon structure (ordered list of exon coordinates within a transcript)
-
Segments are spatially indexed transcript paths used for coarse queries. A segment represents one or more transcripts that share the same exon structure. Segments participate in interval tree queries.
-
Exons are graph-only external keys used for fine-grained verification. They are linked into chains via edges and are not spatially indexed.
The graph connects segments to their first exon (SEGMENT_TO_EXON edge), and exons to subsequent exons (EXON_TO_EXON edges).
Expression values can come from two sources:
- Transcript-level (current): Auto-detected from GFF transcript entries (counts, TPM, FPKM, RPKM, cov). Stored on segments and propagated to exons by accumulation across transcripts.
- Exon-level (future): Direct per-exon quantification (e.g., from short-read counting). Stored directly on exon features.
When transcript-level expression is propagated to exons, it is accumulated — if an exon appears in 3 transcripts with counts 100, 50, and 200, the exon's expression for that sample will be 350. For constitutive exons (present in all transcripts of a gene), this equals total gene expression.
Entries are classified as "annotation" (reference catalogs like GENCODE) or "sample" (experimental assemblies). This distinction affects:
- Conserved/sharing statistics: Only sample-type entries count toward "all samples" thresholds
- Expression output: Expression columns only appear for sample-type entries in TSV output files
- Column headers: Expression columns include the detected type (e.g.,
SAMPLE1.counts,SAMPLE2.TPM)
Injects expression counts from TALON quantification TSV into GTF files:
python scripts/inject_expression.py \
input.gtf counts.tsv output.expr.gtf \
--header id=SAMPLE_001 \
--header species="Homo sapiens"Matches on talon_transcript attribute in GTF to transcript_ID in TSV. Adds counts "N" attribute to transcript lines.
Generates multi-panel visualization from analysis output:
Rscript scripts/visualize_stats.R <stats_dir> [output_prefix]Requires: ggplot2, tidyr, dplyr, scales, patchwork. Produces PDF and PNG.
GPLv3. See LICENSE for details.