A high-performance C++20 functional simulator for the Stillwater KPU - a specialized hardware accelerator for knowledge processing and AI workloads. Features Python bindings for easy testing and education.
# Ubuntu/Debian
sudo apt install build-essential cmake ninja-build libomp-dev python3-dev python3-pip
# macOS
brew install cmake ninja libomp python3
xcode-select --install
# Windows
# Install Visual Studio 2022 with C++ support
# Install CMake and Python from official websitesThe project uses CMake presets for streamlined configuration. Choose the preset for your platform:
# Linux (default: GCC with Ninja)
cmake --preset=release
cmake --build --preset=release
# Linux with Clang
cmake --preset=linux-clang
cmake --build build
# Windows (Visual Studio)
cmake --preset=windows-msvc
cmake --build build --config Release
# macOS (Xcode)
cmake --preset=macos
cmake --build build --config Release
# Debug build (with sanitizers)
cmake --preset=debug
cmake --build --preset=debugNote: Most presets use the Ninja build system. If you don't have Ninja installed:
- Ubuntu/Debian:
sudo apt install ninja-build - macOS:
brew install ninja - Windows: Download from Ninja releases or use Visual Studio preset
Alternative without Ninja:
# Use Unix Makefiles or Visual Studio instead
cmake -B build -DCMAKE_BUILD_TYPE=Release -G "Unix Makefiles"
cmake --build build| Preset | Description | Use Case |
|---|---|---|
release |
Optimized release build | Production use |
debug |
Debug build with sanitizers | Development & debugging |
minimal |
Core components only | Minimal installation |
full |
All features enabled (CUDA, OpenCL, docs) | Full-featured build |
linux-gcc |
Linux with GCC | Linux development |
linux-clang |
Linux with Clang | Linux with Clang toolchain |
windows-msvc |
Windows with MSVC | Windows development |
macos |
macOS with Xcode | macOS development |
CMake presets automatically configure sensible defaults. To customize:
# With domain_flow integration (local installation)
cmake --preset=release -DKPU_DOMAIN_FLOW_LOCAL_PATH=~/dev/domain_flow
cmake --build --preset=release
# Minimal build (no tests, examples, or Python)
cmake --preset=minimal
cmake --build build
# Full build with all features
cmake --preset=full
cmake --build --preset=full
# Custom configuration
cmake -B build -DCMAKE_BUILD_TYPE=Release \
-DKPU_BUILD_PYTHON_BINDINGS=ON \
-DKPU_ENABLE_OPENMP=ON \
-GNinja
cmake --build build| Option | Default | Description |
|---|---|---|
KPU_BUILD_TESTS |
ON | Build test suite |
KPU_BUILD_EXAMPLES |
ON | Build examples |
KPU_BUILD_TOOLS |
ON | Build development tools |
KPU_BUILD_PYTHON_BINDINGS |
ON | Build Python bindings |
KPU_BUILD_BENCHMARKS |
ON | Build benchmark suite |
KPU_BUILD_MODELS |
ON | Build architecture models |
KPU_BUILD_DOCS |
OFF | Build documentation |
KPU_ENABLE_OPENMP |
ON | Enable OpenMP parallelization |
KPU_ENABLE_CUDA |
OFF | Enable CUDA support |
KPU_ENABLE_OPENCL |
OFF | Enable OpenCL support |
KPU_ENABLE_PROFILING |
OFF | Enable profiling support |
KPU_ENABLE_SANITIZERS |
OFF | Enable sanitizers (debug builds) |
After building, install the CLI tools to a local ./bin directory for easy access:
# Build and install all tools
./scripts/install-tools.sh
# Or install only DFG tools
./scripts/install-tools.sh dfg
# Clean installed tools
./scripts/install-tools.sh cleanInstalled tools:
| Tool | Description |
|---|---|
kpu-dfg-gen |
Generate Data Flow Graphs from templates |
kpu-dfg-sched |
Schedule DFG nodes using ASAP/ALAP/LIST algorithms |
kpu-dfg-compile |
Compile scheduled DFG to BlockMover programs |
kpu-dfg-viz |
Export to DOT, Chrome Trace, Mermaid formats |
kpu-dfg-analyze |
Statistics, critical path analysis, validation |
kpu-runner |
Run compiled models |
kpu-config |
Configuration management |
kpu-kpubin-disasm |
Disassemble .kpubin files |
Three ways to use installed tools:
# 1. Source environment (interactive sessions)
source scripts/env.sh
kpu-dfg-gen --help
# 2. Include in pipeline scripts
#!/bin/bash
source "$(dirname "$0")/../../scripts/kpu-env.sh" || exit 1
kpu-dfg-gen --template matmul -M 1024 -N 1024 -K 1024 -o dfg.json
# 3. Direct path
./bin/kpu-dfg-gen --helpExample DFG pipeline:
# Generate β Schedule β Compile β Analyze
kpu-dfg-gen --template matmul -M 1024 -N 1024 -K 1024 --tiles 4x4x4 -o dfg.json
kpu-dfg-sched -i dfg.json -o scheduled.json --algorithm ASAP
kpu-dfg-compile -i scheduled.json -o programs.json
kpu-dfg-analyze -i dfg.json --stats --critical-path
# Generate timeline for Perfetto visualization
kpu-dfg-viz -i scheduled.json -o timeline.json --format chrome-traceSee docs/dfg-toolchain.md for complete documentation.
# After building with Python bindings enabled
pip3 install -e .
# Or use the Python module directly from build directory
export PYTHONPATH=$PWD/build:$PYTHONPATHThe simulator integrates with the domain_flow intermediate representation for computational graphs.
# Option 1: Local installation (recommended for development)
cmake --preset=release -DKPU_DOMAIN_FLOW_LOCAL_PATH=~/dev/domain_flow
cmake --build --preset=release
# Option 2: FetchContent (automatic download - requires CMake 3.28+)
cmake --preset=release # Automatically fetches domain_flow from GitHub
cmake --build --preset=release
# Option 3: JSON-only mode (no domain_flow dependency)
cmake --preset=release -DKPU_USE_DOMAIN_FLOW=OFF
cmake --build --preset=release#include <sw/compiler/graph_loader.hpp>
// Load from domain_flow native format (.dfg)
auto graph = sw::kpu::compiler::load_graph("models/mobilenet_v1.dfg");
// Or load from JSON format
auto graph = sw::kpu::compiler::load_graph("models/simple_matmul.json");
// Inspect graph
std::cout << "Graph: " << graph->name << "\n";
std::cout << "Operators: " << graph->operators.size() << "\n";
std::cout << "Tensors: " << graph->tensors.size() << "\n";
// Validate graph structure
if (graph->validate()) {
std::cout << "β Graph is valid\n";
}For more details, see domain_flow integration guide.
The KPU simulator models a specialized hardware accelerator with:
- Memory Hierarchy: External memory (1GB), L3 tile, L2 banks, L1 buffers, scratchpad
- Data Movement Engines:
- DMA engine for asynchronous transfers
- BlockMover for block-level data movement
- Streamer for stream-based data movement
- Compute Engines:
- ComputeFabric for general-purpose compute
- SystolicArray (tau111_s001) for matrix multiplication
- Compiler Infrastructure: Graph loader, operator mapping, schedule generation (WIP)
- Configuration System: JSON-based system configuration
- Trace Logger: Performance tracing and analysis
- Modern C++20: RAII, smart pointers, concepts, ranges
- Thread-Safe Design: All components support concurrent access
- Comprehensive Testing: 30/30 tests passing with CTest integration
- Python Integration: NumPy-compatible Python API via pybind11
- Cross-Platform: Windows, Linux, macOS support
#include <sw/system/toplevel.hpp>
#include <sw/kpu/kpu_simulator.hpp>
// Option 1: Create with default configuration
sw::sim::SystemSimulator system;
system.initialize();
// Option 2: Load from JSON configuration file
sw::sim::SystemSimulator system("configs/default_kpu.json");
system.initialize();
// Get KPU instance
auto* kpu = system.get_kpu(0);#include <sw/kpu/kpu_simulator.hpp>
// Create KPU with custom configuration
sw::kpu::KPUSimulator::Config config(
2, // 2 memory banks
1024, // 1GB each
100, // 100 GB/s bandwidth
2, // 2 scratchpads
64, // 64KB each
2, // 2 compute tiles
2 // 2 DMA engines
);
sw::kpu::KPUSimulator kpu(config);
// Check configuration
std::cout << "Using systolic arrays: " << kpu.is_using_systolic_arrays() << "\n";
std::cout << "Systolic array size: "
<< kpu.get_systolic_array_rows() << "x"
<< kpu.get_systolic_array_cols() << "\n";See examples/basic/matrix_multiply.cpp for a complete example.
See examples/basic/data_movement_pipeline.cpp for DMA and data orchestration examples.
import stillwater_kpu as kpu
import numpy as np
# Create simulator with context manager
with kpu.Simulator() as sim:
print(f"Main memory: {sim.main_memory_size // (1024**3)} GB")
print(f"Scratchpad: {sim.scratchpad_size // (1024**2)} MB")
# Matrix multiplication
A = np.random.randn(100, 200).astype(np.float32)
B = np.random.randn(200, 150).astype(np.float32)
# KPU computation
C = sim.matmul(A, B)
# Verify against NumPy
C_numpy = A @ B
assert np.allclose(C, C_numpy), "Results don't match!"
print("β Results match NumPy reference")import stillwater_kpu as kpu
with kpu.Simulator() as sim:
# Benchmark matrix multiplication
results = sim.benchmark_matmul(
M=256, N=256, K=256,
iterations=10
)
print(f"Matrix size: {results['matrix_size']}")
print(f"KPU time: {results['kpu_time_ms']:.2f} ms")
print(f"NumPy time: {results['numpy_time_ms']:.2f} ms")
print(f"KPU GFLOPS: {results['kpu_gflops']:.2f}")See the examples/python/ directory for:
- Neural network layer computation
- Performance scaling analysis
- Matrix chain optimization
- Educational demonstrations
# Run all KPU simulator tests (excludes external domain_flow tests)
cd build
ctest --output-on-failure
# Or use the helper script
./scripts/run_tests.sh
# Run specific test
ctest -R graph_loader -V
# Run with specific preset
ctest --preset=unit # Unit tests only
ctest --preset=integration # Integration tests only
ctest --preset=performance # Performance tests onlyThe test suite includes 30 comprehensive tests:
| Category | Tests | Coverage |
|---|---|---|
| Memory | 4 | Allocation, sparse memory, memory maps |
| DMA | 6 | Basic transfers, performance, tensor movement, tracing |
| Data Movement | 4 | BlockMover, Streamer operations |
| Compute | 3 | ComputeFabric, SystolicArray |
| Storage Scheduler | 5 | EDDO/IDDO workflows, performance |
| System | 3 | Configuration, formatting |
| Integration | 3 | End-to-end, multi-component, Python bindings |
| Compiler | 1 | Graph loader |
Status: 30/30 PASSING β
The build system includes tests from external dependencies (domain_flow). To run only KPU tests:
# Recommended: Exclude external tests
ctest --test-dir build -E "^(dsp_|nla_|dfa_|dnn_|ctl_|cnn_)" --output-on-failureKPU-simulator/
βββ include/sw/ # Public C++ headers
β βββ system/ # System simulator (toplevel, config)
β βββ kpu/ # KPU components
β βββ memory/ # Memory hierarchy
β βββ compute/ # Compute engines
β βββ datamovement/ # Data movement (DMA)
β βββ compiler/ # Graph loader & compiler
β βββ driver/ # Memory manager
β βββ trace/ # Tracing infrastructure
βββ src/ # Implementation
β βββ system/ # System implementation
β βββ components/ # Component implementations
β βββ compiler/ # Graph loader implementation
β βββ bindings/ # C and Python bindings
β βββ simulator/ # Core simulator
β βββ driver/ # Driver implementation
βββ tools/ # CLI tools
β βββ dfg/ # DFG toolchain (gen, sched, compile, viz, analyze)
β βββ runner/ # Model runner
β βββ configuration/ # Configuration tools
β βββ analysis/ # Analysis tools (disassembler)
βββ tests/ # Test suite (30 tests)
βββ examples/ # C++ and Python examples
β βββ basic/ # Basic C++ examples
β βββ pipelines/ # Pipeline scripts using CLI tools
β βββ python/ # Python examples
βββ bin/ # Installed CLI tools (created by install-tools.sh)
βββ docs/ # Documentation (50+ files)
βββ cmake/ # CMake modules
βββ scripts/ # Build and utility scripts
β βββ install-tools.sh # Install CLI tools to ./bin
β βββ env.sh # Environment setup (interactive)
β βββ kpu-env.sh # Environment setup (for scripts)
βββ configs/ # Configuration files
βββ test_graphs/ # Test computational graphs
βββ CMakePresets.json # CMake presets configuration
- Algorithm: Standard GEMM (General Matrix Multiply) with systolic array support
- Parallelization: OpenMP for matrices > 1024 elements
- Precision: Single-precision floating-point (float32)
- Memory: Optimized cache-friendly access patterns
- External Memory: 1GB (configurable)
- L3 Tile: Main working memory
- L2 Banks: Mid-level cache
- L1 Buffers: Fast scratch memory
- Scratchpad: Software-managed (1MB default)
- DMA Engine: Asynchronous transfers with address-based API
- BlockMover: Efficient block-level data movement
- Streamer: Stream-based data orchestration
Top-level system simulator that manages all components.
Methods:
SystemSimulator()- Create with default configurationSystemSimulator(const SystemConfig& config)- Create with specific configurationSystemSimulator(const std::filesystem::path& config_file)- Load from JSONbool initialize()- Initialize simulatorbool is_initialized() const- Check initialization statussw::kpu::KPUSimulator* get_kpu(size_t index)- Get KPU by indexvoid print_config() const- Print configuration summaryvoid shutdown()- Cleanup resources
KPU accelerator simulator.
Configuration:
struct Config {
size_t memory_bank_count;
size_t memory_bank_size_mb;
double memory_bandwidth_gbps;
size_t scratchpad_count;
size_t scratchpad_size_kb;
size_t compute_tile_count;
size_t dma_engine_count;
};Methods:
KPUSimulator(const Config& config)- Create with configurationbool is_using_systolic_arrays() const- Check systolic array usagesize_t get_systolic_array_rows() const- Get systolic array dimensionssize_t get_systolic_array_cols() const- Get systolic array dimensions
Computational graph loader.
Functions:
std::unique_ptr<ComputationalGraph> load_graph(const std::string& path)- Load graph from .dfg or .json
Python wrapper for KPU simulator.
Methods:
__init__(main_memory_size=1<<30, scratchpad_size=1<<20)- Create simulatormatmul(A, B)- Matrix multiplication (NumPy arrays)benchmark_matmul(M, N, K, iterations=10)- Performance benchmark__enter__() / __exit__()- Context manager support
Properties:
main_memory_size- Main memory size in bytesscratchpad_size- Scratchpad size in bytes
std::out_of_range- Memory access violationsstd::invalid_argument- Invalid parametersstd::runtime_error- Resource allocation failures
KPUMemoryError- Memory access errorsKPUDimensionError- Matrix dimension mismatchesKPUError- General simulator errors
Current Version: 0.1.0 (Beta)
- Core simulator architecture
- Memory hierarchy implementation
- DMA and data movement engines
- Compute fabric and systolic arrays
- Python bindings with NumPy integration
- Comprehensive test suite (30/30 passing)
- domain_flow graph loading
- Configuration system
- Trace logging
- Schedule generation from computational graphs
- Tensor metadata extraction
- Framework importers (ONNX, PyTorch, JAX)
- Optimization passes
- Additional data types (int8, int16, bfloat16)
- Advanced operations (convolution, activation functions)
- Real-time performance monitoring UI
- Multi-KPU distributed simulation
- Developer Setup Guide - Development environment setup
- Quick Start Guide - domain_flow integration quick start
- Architecture Specification - Detailed KPU architecture
- DFG Toolchain - CLI tools for DFG generation, scheduling, compilation
- domain_flow Integration - Computational graph integration
- Configuration Guide - System configuration
- Data Orchestration - Data movement details
- Tracing System - Performance tracing
We welcome contributions! Please follow these guidelines:
- Code Style: Follow modern C++20 best practices
- Thread Safety: Maintain thread-safe design for shared components
- Testing: Add comprehensive tests for new features (use CTest)
- Documentation: Update documentation for API changes
- Performance: Benchmark performance impact of modifications
# Create feature branch
git checkout -b feature/your-feature
# Build with debug preset
cmake --preset=debug
cmake --build --preset=debug
# Run tests
ctest --preset=default --output-on-failure
# Submit pull requestExclude domain_flow's own tests which may fail independently:
# From project root
ctest --test-dir build -E "^(dsp_|nla_|dfa_|dnn_|ctl_|cnn_)" --output-on-failure
# Or from build directory
cd build
ctest -E "^(dsp_|nla_|dfa_|dnn_|ctl_|cnn_)" --output-on-failureResult: 30/30 tests pass β
ctest --test-dir build --output-on-failureNote: This will include 12 domain_flow tests (tests #1-12) which may fail. These failures are from the external domain_flow library and do not affect KPU-simulator functionality.
# Memory tests only
ctest --test-dir build -R "memory" -V
# DMA tests only
ctest --test-dir build -R "dma" -V
# Graph loader tests
ctest --test-dir build -R "graph_loader" -V
# Storage scheduler tests
ctest --test-dir build -R "storage" -V
# Integration tests
ctest --test-dir build -R "integration" -Vctest --test-dir build -R "test_name" -V- System Tests: Configuration, formatting
- Memory Tests: Allocation, sparse memory, memory map
- DMA Tests: Basic, performance, tensor movement, tracing
- Block Mover Tests: Basic operations, tracing
- Streamer Tests: Basic operations, tracing
- Compute Tests: Basic fabric operations, systolic array
- Storage Scheduler Tests: IDDO, EDDO workflows, performance
- Integration Tests: End-to-end, multi-component, Python bindings
These are from the domain_flow library dependency:
- Tests #1-12: dsp_, nla_, dfa_, dnn_, ctl_, cnn_
Note: These tests may fail or not run properly. They test domain_flow functionality, not KPU-simulator.
Add to .github/workflows/*.yml:
- name: Run tests
run: |
ctest --test-dir build -E "^(dsp_|nla_|dfa_|dnn_|ctl_|cnn_)" --output-on-failureOr use the exclude pattern file:
- name: Run tests
run: |
ctest --test-dir build -E "$(cat .github/workflows/test-exclude-pattern.txt)" --output-on-failure# Expected results when excluding domain_flow tests:
100% tests passed, 0 tests failed out of 30
Total Test time (real) = ~15 sec- Check build succeeded:
cmake --build build - Verify working directory: run from project root or use
--test-dir build
- Run:
scripts/copy_domain_flow_graphs.sh - This copies .dfg test files from domain_flow dependency
- Verify Python bindings built: check for
stillwater_kpu.*.soin build output - Check Python environment matches build (Python 3.12 expected)
- May need larger system memory
- Some tests validate sparse memory allocation
Some tests include performance benchmarks:
dma_performance_test: DMA throughputstorage_scheduler_performance_test: EDDO command processingend_to_end: Full system performance
Run with:
ctest --test-dir build -R "performance" -VThis project is released under the MIT License. See LICENSE file for details.
Stillwater Computing, Inc. Accelerating Innovation (TM)
Version: 0.1.0 Build System: CMake 3.20+ with presets Language: C++20 Python Support: Python 3.8-3.12