Benchmarks runtime vs compile-time polymorphism for compute functions in C++
- This project benchmarks Runtime Polymorphism vs. Compile-Time Polymorphism (CRTP & C++20 Concepts) across multiple compiler optimization levels (-O0, -O1, -O2, -O3).
- Without compiler optimizations (-O0), Runtime Polymorphism is the fastest due to lower template instantiation overhead.
- However, CRTP and Concepts benefit massively from compiler optimizations, while Runtime Polymorphism sees only minor improvementsβespecially for complex computations.
- With full optimization (-O3), Runtime Polymorphism has:
- 17Γ longer execution time
- 27Γ more branches
- 50Γ more instructions
- 5 orders of magnitude more memory accesses
- C++20 Concepts perform similarly to CRTP at -O3, but miss out on CRTPβs -O2 performance gains.
Important
Polymorphism performance is highly use-case dependent. While these benchmarks highlight significant trends, your mileage may vary depending on factors like CPU architecture, compiler version, optimization settings, and workload characteristics. Always profile your own use case before making big design decisions!
When discussing runtime vs compile-time polymorphism in C++, itβs often stated that compile-time polymorphism can be faster due to factors like function inlining, elimination of virtual table lookups, and better compiler optimizations. However, these discussions are frequently followed by a caveat:
Performance differences are highly use-case dependent, and profiling is required to determine the impact in a given scenario.
As I explored this, I expected to find many benchmark results from developers evaluating whether compile-time polymorphism would benefit their specific use cases. However, after searching the web, GitHub, and academic literature, I realized that hard numbers were surprisingly difficult to find β most discussions acknowledge potential differences, but few provide concrete performance data.
- Runtime polymorphism via virtual function calls.
- Compile-time polymorphism using the Curiously Recurring Template Pattern (CRTP).
- C++20 Concepts which enable compile-time type enforcement and selection, achieving behavior similar to polymorphism without inheritance β and offer much clearer compiler error messages than other uses of template programming (including CRTP).
Simple Computation β Fused Multiply-Add (FMA):
inline double ComputeFMA(double x) { return x * 1.414 + 2.718; }Complex Computation β Sin, Log and a Square Root:
inline double ComputeExpensive(double x) {
return std::sin(x) * std::log(x + 1) + std::sqrt(x);
}- O0
- O1
- O2
- O3
This project is designed with modularity and ease of modification in mind, making it a useful starting point for benchmarking compile-time vs. runtime polymorphism with other compute functions. The codebase provides a structured comparison framework that can be easily adapted to test additional operations.
- C++ compiler compatible with C++20 standard
- git
- cmake
- Miniconda or Conda (optional, )
- A Linux system with
perfinstalled and a user withsudoprivileges - Conda or Miniconda (optional, but recommended for easy creation of Python environment for running automated tests)
Clone this repository and change working directory:
https://github.com/duanegoodner/dispatch-model-benchmarks
cd dispatch-model-benchmarksCreate the build directory:
mkdir buildThen generate the build system and compile:
cmake -B build
cmake --build buildThe default compiler flags specified by CMakeLists.txt are -O3 -march=native. The following CMake confiuration options are also available.
| -D Option | Compiler Flags |
|---|---|
ENABLE_DEBUG |
-O0 -g |
ENABLE_NO_INLINE |
-fno-inline |
ENABLE_LOW_OPT |
-O1 |
ENABLE_PROFILING |
-pg |
RESET_DEFAULTS |
-O3 -march=native |
In the build command sequence, replacing this:
cmake -B buildwith this:
cmake -B build -DENABLE_PROFILING=ONwill cause the compiler flags to be -O3 -march=native -pg.
Before running benchmarks and profiling, we can verify the functionality of all core components using automated tests built with GoogleTest. The test source code is located in ./test/core/.
To run all tests, use the provided script:
./test/core/run_tests.shThe benchmark executable can be run directly from the command line.
./build/bin/benchmark --helpOutput:
Usage: ./build/bin/benchmark [polymorphism_category] [computation] [-n iterations] [-s]
- No arguments: Runs all tests with the default iteration count.
- With two arguments: Runs a specific test with the default iteration count.
- With '-n iterations': Runs all tests with a custom iteration count.
- With '-s': Saves execution time data.
Valid arguments:
------------------------
Polymorphism Categories:
------------------------
- concepts
- crtp
- runtime
Compute Functions:
------------------
- expensive
- fma
Other Options:
--help Show this help message
-n [iterations] Specify a custom iteration count
-s Save execution time data
To run all possible combinations of polymorphism type and compute function, and save the execution time data, run:
./build/bin/benchmark -sOutput:
Running: polymorphism_tests::TestRuntime FMA
Iteration Count: 1000000000
FMA Computation: Runtime Polymorphism Time = 1.52441 seconds
Running: polymorphism_tests::TestRuntime Expensive
Iteration Count: 1000000000
Expensive Computation: Runtime Polymorphism Time = 6.38158 seconds
Running: polymorphism_tests::TestCRTP FMA
Iteration Count: 1000000000
FMA Computation: CRTP Polymorphism Time = 0.378618 seconds
Running: polymorphism_tests::TestCRTP Expensive
Iteration Count: 1000000000
Expensive Computation: CRTP Polymorphism Time = 0.378513 seconds
Running: polymorphism_tests::TestConcepts FMA
Iteration Count: 1000000000
FMA Computation: C++20 Concepts Polymorphism Time = 0.378408 seconds
Running: polymorphism_tests::TestConcepts Expensive
Iteration Count: 1000000000
Expensive Computation: C++20 Concepts Polymorphism Time = 0.389357 seconds
Test results saved to: data/run_all_tests_results/2025-02-27-21-06-49-524.txtTo run a single test, we need to provide arguments identifying the polymorphism type and the compute function. For example:
./build/bin/benchmark crtp expensiveOutput:
Running: polymorphism_tests::TestCRTP Expensive
Iteration Count: 1000000000
Expensive Computation: CRTP Polymorphism Time = 0.391168 seconds./build/bin/benchmark runtime fma -n 500000Output:
Running: polymorphism_tests::TestRuntime FMA
Iteration Count: 500000
FMA Computation: Runtime Polymorphism Time = 0.00343413 secondsWe can use the Linux tool perf to gain more insight into differences among various forms of polymorphism and compute functions.
Python module ./test/profiling/multi_build_per_tester can be used to run the benchmarks and save results as both raw .txt files and as cleaned .feather files that can be easily read into a Pandas DataFrame. All Python dependencies needed to run this module are specified in environment.yml.
From the project root, run:
conda env create -f environment.yml
conda activate dispatch-model-benchmarks-envOnce our Conda environment is activated, can get command line help by running:
python test/profiling/multi_build_perf_tester.py --helpOutput:
usage: multi_build_perf_tester.py [-h] [-o OPTIMIZATION_LEVELS [OPTIMIZATION_LEVELS ...]] [-p POLYMORPHISM_TYPES [POLYMORPHISM_TYPES ...]]
[-c COMPUTE_FUNCTIONS [COMPUTE_FUNCTIONS ...]] [-r NUM_RUNS_PER_CONDITION] [-i NUM_ITERATIONS_PER_RUN]
Run performance tests for multiple build configurations.
options:
-h, --help show this help message and exit
-o OPTIMIZATION_LEVELS [OPTIMIZATION_LEVELS ...], --optimization_levels OPTIMIZATION_LEVELS [OPTIMIZATION_LEVELS ...]
List of optimization levels to test (default = O0, O1, O2, O3).
-p POLYMORPHISM_TYPES [POLYMORPHISM_TYPES ...], --polymorphism_types POLYMORPHISM_TYPES [POLYMORPHISM_TYPES ...]
List of polymorphism types to test (default = crtp, concepts, runtime).
-c COMPUTE_FUNCTIONS [COMPUTE_FUNCTIONS ...], --compute_functions COMPUTE_FUNCTIONS [COMPUTE_FUNCTIONS ...]
List of compute functions to test (default = fma, expensive).
-r NUM_RUNS_PER_CONDITION, --num_runs_per_condition NUM_RUNS_PER_CONDITION
Number of runs per condition (default = 5).
-i NUM_ITERATIONS_PER_RUN, --num_iterations_per_run NUM_ITERATIONS_PER_RUN
Number of iterations per run (default = 1000000000).
To profile our FMA compute function using C++20 Concepts and aggressive (-O3) compiler optimization, we can run:
python -o 03 -p concepts -c fma
This will initiate a fresh build of the project binary (./build/bin/benchmark) prior to running the test. The terminal output should look similar to
...
β
Build completed for -O3
Binary size for optimization levelO3: 79936 bytes (78.06 KB)
------------------------------------------------
Running perf for: Polymorphism Type = concepts
Compute Function = fma
Number of Runs = 5
Number of Iterations per Run = 1000000000
Running perf to collect detailed event data...
Results saved to: /home/duane/dproj/dispatch-model-benchmarks/data/perf/2025-03-19_15-37-38_O3/1_concepts_fma.txt
Re-running perf to collect summary data...
Results saved to: /home/duane/dproj/dispatch-model-benchmarks/data/perf/2025-03-19_15-37-38_O3/1_concepts_fma_summary.txt
Building Dataframe from detailed perf output...
DataFrame saved to /home/duane/dproj/dispatch-model-benchmarks/data/perf/2025-03-19_15-37-38_O3/perf_detailed_runs.feather
Building Dataframe from summary perf output...
DataFrame saved to /home/duane/dproj/dispatch-model-benchmarks/data/perf/2025-03-19_15-37-38_O3/perf_summary_runs.feather
Note that there are two sets of runs for each condition: one for detailed perf output and one for summary perf output (perf does not support collecting summary info and custom events in the same run).
To run all possible combinations of polymorphism type and compute function, we can run the test module without passing any:
python test/profiling/multi_build_perf_tester.pyThis will result in builds at four different optimization levels, and will all possible combinations of Polymorphism Type and Compute Function with each resulting binary.
Output from the profiling runs will be saved in a timestamped directory under ./data/perf/. The directory will include raw .txt data produced by perf as well as cleaned data in two .feather files: perf_detailed_runs.feather and perf_summary_runs.feather. Each of these can be imported into Python as a Pandas Dataframe. For example usage, see the code in./test/profiling/view_dfs.py.
Tip
The compiler optimizaton level for a particular set of runs is indicated in the output directory name and is also included in .txt files located in the output directory.
Markdown-formatted tables with some of the more interesting perf events data can be generated using Python module test/profiling/view_dfs.py. Get command line help for this simple module with:
python test/profiling/view_dfs.py --help
Output:
usage: view_dfs.py [-h] [-d DATA_DIR]
Analyze and display performance data.
options:
-h, --help show this help message and exit
-d DATA_DIR, --data_dir DATA_DIR
Name of data-containing directory under ./data/perf/ (default =
view_dfs.DEFAULT_DATA_DIR_NAME)
Note
Data in the Results section below were generated using the view_dfs module.
The specific perf events collected by ./test/profiling/perf_test.py are specified in ./tests/profiling/perf_events.json. The content of this file can be customized to collect additional perf events.
This section describes test conditions and results of perf profiling performed for all possible combinations:
- Polymorphism Type: Runtime, CRTP, C++20 Concepts
- Compute Function: FMA, Expensive
- Compiler Optimization Levels: -O0, -O1, -O2, -O3
- CPU = 13th Gen Intel(R) Core(TM) i7-13700K
- OS = Debian 12.9
- Linux Kernel = 6.1.0-31-amd64
- Compiler = GCC 12.2.0
Note
The data in the results section below is from fresh test runs including all optimization levels (-O0, -O1, -O2, -O3).
A detailed analysis and visualization of trends will be added soon.
- Runtime Polymorphism (
virtualfunctions) is consistently the slowest across all optimization levels, with the gap widening significantly for complex computations. - CRTP is consistently the fastest approach, outperforming both Runtime Polymorphism and C++20 Concepts, particularly at
-O2. - C++20 Concepts behave similarly to CRTP at
-O3but struggle at-O2, showing significantly slower execution times. - Compiler Optimization Level Matters:
- At
-O1and-O3, Concepts and CRTP perform nearly identically. - At
-O2, CRTP remains fast, but Concepts slow down dramatically.
- At
- Branch Mispredictions & Cache Misses are significantly worse for Runtime Polymorphism, making it less efficient than compile-time alternatives.
- Binary size shrinks from
-O0to-O3, with-O3generating the smallest executables while also delivering the best performance. - Loop Unrolling and Function Inlining seem to play a major role in CRTPβs advantage, particularly at
-O2where Concepts do not benefit from the same optimizations.
- Investigate assembly differences (
objdump) between Concepts and CRTP at-O2. - Examine additional compiler flags to see if Concepts can be optimized further.
- Compare binary size trends for all approaches across optimization levels.
| Test Name | Time (s) | Branches | Instructions | Micro Ops |
|---|---|---|---|---|
| Runtime FMA | 2.95447 | 6.98418e+09 | 5.48726e+10 | 6.23562e+10 |
| Runtime Expensive | 10.0609 | 3.00041e+10 | 2.021e+11 | 2.19918e+11 |
| Concepts FMA | 3.33434 | 7.00953e+09 | 5.10138e+10 | 5.94976e+10 |
| Concepts Expensive | 9.9803 | 2.99839e+10 | 1.97852e+11 | 2.14807e+11 |
| CRTP FMA | 4.5475 | 9.00312e+09 | 6.50036e+10 | 7.57248e+10 |
| CRTP Expensive | 11.8889 | 3.19839e+10 | 2.11878e+11 | 2.31614e+11 |
| Test Name | Time (s) | Branches | Instructions | Micro Ops |
|---|---|---|---|---|
| Runtime FMA | 1.51398 | 4.00473e+09 | 1.30049e+10 | 1.40098e+10 |
| Runtime Expensive | 6.7236 | 2.29037e+10 | 1.57141e+11 | 1.62151e+11 |
| Concepts FMA | 0.37788 | 1.00527e+09 | 4.00984e+09 | 3.01393e+09 |
| Concepts Expensive | 6.17581 | 2.20039e+10 | 1.55921e+11 | 1.59944e+11 |
| CRTP FMA | 0.38335 | 1.00169e+09 | 3.99376e+09 | 3.00774e+09 |
| CRTP Expensive | 0.379375 | 1.00166e+09 | 3.99497e+09 | 3.00839e+09 |
| Test Name | Time (s) | Branches | Instructions | Micro Ops |
|---|---|---|---|---|
| Runtime FMA | 1.51137 | 3.00754e+09 | 1.30168e+10 | 1.40358e+10 |
| Runtime Expensive | 6.4254 | 2.19883e+10 | 1.5375e+11 | 1.58809e+11 |
| Concepts FMA | 1.51909 | 3.00082e+09 | 1.20043e+10 | 1.20152e+10 |
| Concepts Expensive | 6.4359 | 2.20507e+10 | 1.53393e+11 | 1.57767e+11 |
| CRTP FMA | 0.38022 | 1.00184e+09 | 3.99637e+09 | 3.00877e+09 |
| CRTP Expensive | 0.378914 | 9.9946e+08 | 3.99163e+09 | 3.00997e+09 |
| Test Name | Time (s) | Branches | Instructions | Micro Ops |
|---|---|---|---|---|
| Runtime FMA | 1.51615 | 3.00083e+09 | 1.30127e+10 | 1.4032e+10 |
| Runtime Expensive | 6.4408 | 2.19913e+10 | 1.53851e+11 | 1.58795e+11 |
| Concepts FMA | 0.378842 | 1.00299e+09 | 3.99507e+09 | 3.00684e+09 |
| Concepts Expensive | 0.379948 | 1.00221e+09 | 3.99772e+09 | 3.00826e+09 |
| CRTP FMA | 0.38215 | 1.00139e+09 | 3.99694e+09 | 3.01604e+09 |
| CRTP Expensive | 0.38097 | 1.00037e+09 | 3.99236e+09 | 3.00623e+09 |
| Test Name | Branches | Branch Misses | Miss Fraction |
|---|---|---|---|
| Runtime FMA | 6.98418e+09 | 7656 | 1.09619e-06 |
| Runtime Expensive | 3.00041e+10 | 32902 | 1.09658e-06 |
| Concepts FMA | 7.00953e+09 | 4.38924e+06 | 0.000626182 |
| Concepts Expensive | 2.99839e+10 | 29983 | 9.9997e-07 |
| CRTP FMA | 9.00312e+09 | 175674 | 1.95126e-05 |
| CRTP Expensive | 3.19839e+10 | 47488 | 1.48475e-06 |
| Test Name | Branches | Branch Misses | Miss Fraction |
|---|---|---|---|
| Runtime FMA | 4.00473e+09 | 5640 | 1.40834e-06 |
| Runtime Expensive | 2.29037e+10 | 21518 | 9.39501e-07 |
| Concepts FMA | 1.00527e+09 | 956 | 9.50986e-07 |
| Concepts Expensive | 2.20039e+10 | 25431 | 1.15575e-06 |
| CRTP FMA | 1.00169e+09 | 1866 | 1.86286e-06 |
| CRTP Expensive | 1.00166e+09 | 2317 | 2.31317e-06 |
| Test Name | Branches | Branch Misses | Miss Fraction |
|---|---|---|---|
| Runtime FMA | 3.00754e+09 | 6469 | 2.15093e-06 |
| Runtime Expensive | 2.19883e+10 | 24512 | 1.11477e-06 |
| Concepts FMA | 3.00082e+09 | 3955 | 1.31797e-06 |
| Concepts Expensive | 2.20507e+10 | 18777 | 8.51538e-07 |
| CRTP FMA | 1.00184e+09 | 1249 | 1.24671e-06 |
| CRTP Expensive | 9.9946e+08 | 956 | 9.56516e-07 |
| Test Name | Branches | Branch Misses | Miss Fraction |
|---|---|---|---|
| Runtime FMA | 3.00083e+09 | 6426 | 2.1414e-06 |
| Runtime Expensive | 2.19913e+10 | 28481 | 1.2951e-06 |
| Concepts FMA | 1.00299e+09 | 2031 | 2.02495e-06 |
| Concepts Expensive | 1.00221e+09 | 3443 | 3.43542e-06 |
| CRTP FMA | 1.00139e+09 | 1766 | 1.76355e-06 |
| CRTP Expensive | 1.00037e+09 | 1886 | 1.88531e-06 |
| Test Name | mem_inst_retired.all_stores | mem_inst_retired.all_loads | cache-misses | cache-Miss Fraction |
|---|---|---|---|---|
| Runtime FMA | 1.30384e+10 | 2.30473e+10 | 75973 | 3.2964e-06 |
| Runtime Expensive | 2.19943e+10 | 6.29821e+10 | 186677 | 2.96397e-06 |
| Concepts FMA | 1.30167e+10 | 1.90293e+10 | 81142 | 4.26405e-06 |
| Concepts Expensive | 2.20129e+10 | 5.90384e+10 | 121468 | 2.05744e-06 |
| CRTP FMA | 1.70017e+10 | 2.30038e+10 | 103357 | 4.49304e-06 |
| CRTP Expensive | 2.60163e+10 | 6.30247e+10 | 126241 | 2.00304e-06 |
| Test Name | mem_inst_retired.all_stores | mem_inst_retired.all_loads | cache-misses | cache-Miss Fraction |
|---|---|---|---|---|
| Runtime FMA | 2.00067e+09 | 7.00035e+09 | 120655 | 1.72356e-05 |
| Runtime Expensive | 1.10183e+10 | 4.51358e+10 | 136076 | 3.01481e-06 |
| Concepts FMA | 113367 | 207912 | 104863 | 0.504362 |
| Concepts Expensive | 1.09808e+10 | 4.28297e+10 | 149766 | 3.49678e-06 |
| CRTP FMA | 103279 | 178090 | 100107 | 0.562115 |
| CRTP Expensive | 98230 | 181020 | 101927 | 0.56307 |
| Test Name | mem_inst_retired.all_stores | mem_inst_retired.all_loads | cache-misses | cache-Miss Fraction |
|---|---|---|---|---|
| Runtime FMA | 2.00467e+09 | 7.01407e+09 | 126230 | 1.79967e-05 |
| Runtime Expensive | 1.00014e+10 | 4.40329e+10 | 100436 | 2.28093e-06 |
| Concepts FMA | 2.00108e+09 | 5.00331e+09 | 104256 | 2.08374e-05 |
| Concepts Expensive | 9.91248e+09 | 4.16685e+10 | 120469 | 2.89113e-06 |
| CRTP FMA | 116233 | 212866 | 99111 | 0.465603 |
| CRTP Expensive | 105460 | 202577 | 89321 | 0.440924 |
| Test Name | mem_inst_retired.all_stores | mem_inst_retired.all_loads | cache-misses | cache-Miss Fraction |
|---|---|---|---|---|
| Runtime FMA | 2.00139e+09 | 7.00331e+09 | 134751 | 1.9241e-05 |
| Runtime Expensive | 9.97602e+09 | 4.39428e+10 | 129370 | 2.94406e-06 |
| Concepts FMA | 96455 | 180536 | 124091 | 0.687348 |
| Concepts Expensive | 94656 | 181633 | 131838 | 0.725848 |
| CRTP FMA | 109071 | 215144 | 112663 | 0.523663 |
| CRTP Expensive | 107460 | 201802 | 118616 | 0.587784 |