Skip to content

stillwater-sc/kpu-sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

493 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Stillwater Knowledge Processing Unit (KPU) Simulator

A high-performance C++20 functional simulator for the Stillwater KPU - a specialized hardware accelerator for knowledge processing and AI workloads. Features Python bindings for easy testing and education.

Quick Start

Prerequisites

# Ubuntu/Debian
sudo apt install build-essential cmake ninja-build libomp-dev python3-dev python3-pip

# macOS
brew install cmake ninja libomp python3
xcode-select --install

# Windows
# Install Visual Studio 2022 with C++ support
# Install CMake and Python from official websites

Build with CMake Presets (Recommended)

The project uses CMake presets for streamlined configuration. Choose the preset for your platform:

# Linux (default: GCC with Ninja)
cmake --preset=release
cmake --build --preset=release

# Linux with Clang
cmake --preset=linux-clang
cmake --build build

# Windows (Visual Studio)
cmake --preset=windows-msvc
cmake --build build --config Release

# macOS (Xcode)
cmake --preset=macos
cmake --build build --config Release

# Debug build (with sanitizers)
cmake --preset=debug
cmake --build --preset=debug

Note: Most presets use the Ninja build system. If you don't have Ninja installed:

  • Ubuntu/Debian: sudo apt install ninja-build
  • macOS: brew install ninja
  • Windows: Download from Ninja releases or use Visual Studio preset

Alternative without Ninja:

# Use Unix Makefiles or Visual Studio instead
cmake -B build -DCMAKE_BUILD_TYPE=Release -G "Unix Makefiles"
cmake --build build

Available Configure Presets

Preset Description Use Case
release Optimized release build Production use
debug Debug build with sanitizers Development & debugging
minimal Core components only Minimal installation
full All features enabled (CUDA, OpenCL, docs) Full-featured build
linux-gcc Linux with GCC Linux development
linux-clang Linux with Clang Linux with Clang toolchain
windows-msvc Windows with MSVC Windows development
macos macOS with Xcode macOS development

Advanced Build Options

CMake presets automatically configure sensible defaults. To customize:

# With domain_flow integration (local installation)
cmake --preset=release -DKPU_DOMAIN_FLOW_LOCAL_PATH=~/dev/domain_flow
cmake --build --preset=release

# Minimal build (no tests, examples, or Python)
cmake --preset=minimal
cmake --build build

# Full build with all features
cmake --preset=full
cmake --build --preset=full

# Custom configuration
cmake -B build -DCMAKE_BUILD_TYPE=Release \
  -DKPU_BUILD_PYTHON_BINDINGS=ON \
  -DKPU_ENABLE_OPENMP=ON \
  -GNinja
cmake --build build

Configuration Options

Option Default Description
KPU_BUILD_TESTS ON Build test suite
KPU_BUILD_EXAMPLES ON Build examples
KPU_BUILD_TOOLS ON Build development tools
KPU_BUILD_PYTHON_BINDINGS ON Build Python bindings
KPU_BUILD_BENCHMARKS ON Build benchmark suite
KPU_BUILD_MODELS ON Build architecture models
KPU_BUILD_DOCS OFF Build documentation
KPU_ENABLE_OPENMP ON Enable OpenMP parallelization
KPU_ENABLE_CUDA OFF Enable CUDA support
KPU_ENABLE_OPENCL OFF Enable OpenCL support
KPU_ENABLE_PROFILING OFF Enable profiling support
KPU_ENABLE_SANITIZERS OFF Enable sanitizers (debug builds)

Install CLI Tools

After building, install the CLI tools to a local ./bin directory for easy access:

# Build and install all tools
./scripts/install-tools.sh

# Or install only DFG tools
./scripts/install-tools.sh dfg

# Clean installed tools
./scripts/install-tools.sh clean

Installed tools:

Tool Description
kpu-dfg-gen Generate Data Flow Graphs from templates
kpu-dfg-sched Schedule DFG nodes using ASAP/ALAP/LIST algorithms
kpu-dfg-compile Compile scheduled DFG to BlockMover programs
kpu-dfg-viz Export to DOT, Chrome Trace, Mermaid formats
kpu-dfg-analyze Statistics, critical path analysis, validation
kpu-runner Run compiled models
kpu-config Configuration management
kpu-kpubin-disasm Disassemble .kpubin files

Three ways to use installed tools:

# 1. Source environment (interactive sessions)
source scripts/env.sh
kpu-dfg-gen --help

# 2. Include in pipeline scripts
#!/bin/bash
source "$(dirname "$0")/../../scripts/kpu-env.sh" || exit 1
kpu-dfg-gen --template matmul -M 1024 -N 1024 -K 1024 -o dfg.json

# 3. Direct path
./bin/kpu-dfg-gen --help

Example DFG pipeline:

# Generate β†’ Schedule β†’ Compile β†’ Analyze
kpu-dfg-gen --template matmul -M 1024 -N 1024 -K 1024 --tiles 4x4x4 -o dfg.json
kpu-dfg-sched -i dfg.json -o scheduled.json --algorithm ASAP
kpu-dfg-compile -i scheduled.json -o programs.json
kpu-dfg-analyze -i dfg.json --stats --critical-path

# Generate timeline for Perfetto visualization
kpu-dfg-viz -i scheduled.json -o timeline.json --format chrome-trace

See docs/dfg-toolchain.md for complete documentation.

Install Python Package

# After building with Python bindings enabled
pip3 install -e .

# Or use the Python module directly from build directory
export PYTHONPATH=$PWD/build:$PYTHONPATH

domain_flow Integration

The simulator integrates with the domain_flow intermediate representation for computational graphs.

Build with domain_flow

# Option 1: Local installation (recommended for development)
cmake --preset=release -DKPU_DOMAIN_FLOW_LOCAL_PATH=~/dev/domain_flow
cmake --build --preset=release

# Option 2: FetchContent (automatic download - requires CMake 3.28+)
cmake --preset=release  # Automatically fetches domain_flow from GitHub
cmake --build --preset=release

# Option 3: JSON-only mode (no domain_flow dependency)
cmake --preset=release -DKPU_USE_DOMAIN_FLOW=OFF
cmake --build --preset=release

Load Computational Graphs

#include <sw/compiler/graph_loader.hpp>

// Load from domain_flow native format (.dfg)
auto graph = sw::kpu::compiler::load_graph("models/mobilenet_v1.dfg");

// Or load from JSON format
auto graph = sw::kpu::compiler::load_graph("models/simple_matmul.json");

// Inspect graph
std::cout << "Graph: " << graph->name << "\n";
std::cout << "Operators: " << graph->operators.size() << "\n";
std::cout << "Tensors: " << graph->tensors.size() << "\n";

// Validate graph structure
if (graph->validate()) {
    std::cout << "βœ“ Graph is valid\n";
}

For more details, see domain_flow integration guide.

Architecture Overview

The KPU simulator models a specialized hardware accelerator with:

Core Components

  • Memory Hierarchy: External memory (1GB), L3 tile, L2 banks, L1 buffers, scratchpad
  • Data Movement Engines:
    • DMA engine for asynchronous transfers
    • BlockMover for block-level data movement
    • Streamer for stream-based data movement
  • Compute Engines:
    • ComputeFabric for general-purpose compute
    • SystolicArray (tau111_s001) for matrix multiplication
  • Compiler Infrastructure: Graph loader, operator mapping, schedule generation (WIP)
  • Configuration System: JSON-based system configuration
  • Trace Logger: Performance tracing and analysis

Key Features

  • Modern C++20: RAII, smart pointers, concepts, ranges
  • Thread-Safe Design: All components support concurrent access
  • Comprehensive Testing: 30/30 tests passing with CTest integration
  • Python Integration: NumPy-compatible Python API via pybind11
  • Cross-Platform: Windows, Linux, macOS support

Usage Examples

C++ API

Basic System Setup

#include <sw/system/toplevel.hpp>
#include <sw/kpu/kpu_simulator.hpp>

// Option 1: Create with default configuration
sw::sim::SystemSimulator system;
system.initialize();

// Option 2: Load from JSON configuration file
sw::sim::SystemSimulator system("configs/default_kpu.json");
system.initialize();

// Get KPU instance
auto* kpu = system.get_kpu(0);

Simple KPU Configuration

#include <sw/kpu/kpu_simulator.hpp>

// Create KPU with custom configuration
sw::kpu::KPUSimulator::Config config(
    2,      // 2 memory banks
    1024,   // 1GB each
    100,    // 100 GB/s bandwidth
    2,      // 2 scratchpads
    64,     // 64KB each
    2,      // 2 compute tiles
    2       // 2 DMA engines
);

sw::kpu::KPUSimulator kpu(config);

// Check configuration
std::cout << "Using systolic arrays: " << kpu.is_using_systolic_arrays() << "\n";
std::cout << "Systolic array size: "
          << kpu.get_systolic_array_rows() << "x"
          << kpu.get_systolic_array_cols() << "\n";

Matrix Multiplication

See examples/basic/matrix_multiply.cpp for a complete example.

Data Movement Pipeline

See examples/basic/data_movement_pipeline.cpp for DMA and data orchestration examples.

Python API

Basic Usage

import stillwater_kpu as kpu
import numpy as np

# Create simulator with context manager
with kpu.Simulator() as sim:
    print(f"Main memory: {sim.main_memory_size // (1024**3)} GB")
    print(f"Scratchpad: {sim.scratchpad_size // (1024**2)} MB")

    # Matrix multiplication
    A = np.random.randn(100, 200).astype(np.float32)
    B = np.random.randn(200, 150).astype(np.float32)

    # KPU computation
    C = sim.matmul(A, B)

    # Verify against NumPy
    C_numpy = A @ B
    assert np.allclose(C, C_numpy), "Results don't match!"
    print("βœ“ Results match NumPy reference")

Performance Benchmarking

import stillwater_kpu as kpu

with kpu.Simulator() as sim:
    # Benchmark matrix multiplication
    results = sim.benchmark_matmul(
        M=256, N=256, K=256,
        iterations=10
    )

    print(f"Matrix size: {results['matrix_size']}")
    print(f"KPU time: {results['kpu_time_ms']:.2f} ms")
    print(f"NumPy time: {results['numpy_time_ms']:.2f} ms")
    print(f"KPU GFLOPS: {results['kpu_gflops']:.2f}")

Advanced Examples

See the examples/python/ directory for:

  • Neural network layer computation
  • Performance scaling analysis
  • Matrix chain optimization
  • Educational demonstrations

Testing

Run Tests with CTest

# Run all KPU simulator tests (excludes external domain_flow tests)
cd build
ctest --output-on-failure

# Or use the helper script
./scripts/run_tests.sh

# Run specific test
ctest -R graph_loader -V

# Run with specific preset
ctest --preset=unit          # Unit tests only
ctest --preset=integration   # Integration tests only
ctest --preset=performance   # Performance tests only

Test Categories

The test suite includes 30 comprehensive tests:

Category Tests Coverage
Memory 4 Allocation, sparse memory, memory maps
DMA 6 Basic transfers, performance, tensor movement, tracing
Data Movement 4 BlockMover, Streamer operations
Compute 3 ComputeFabric, SystolicArray
Storage Scheduler 5 EDDO/IDDO workflows, performance
System 3 Configuration, formatting
Integration 3 End-to-end, multi-component, Python bindings
Compiler 1 Graph loader

Status: 30/30 PASSING βœ…

Excluding External Tests

The build system includes tests from external dependencies (domain_flow). To run only KPU tests:

# Recommended: Exclude external tests
ctest --test-dir build -E "^(dsp_|nla_|dfa_|dnn_|ctl_|cnn_)" --output-on-failure

Project Structure

KPU-simulator/
β”œβ”€β”€ include/sw/              # Public C++ headers
β”‚   β”œβ”€β”€ system/              # System simulator (toplevel, config)
β”‚   β”œβ”€β”€ kpu/                 # KPU components
β”‚   β”œβ”€β”€ memory/              # Memory hierarchy
β”‚   β”œβ”€β”€ compute/             # Compute engines
β”‚   β”œβ”€β”€ datamovement/        # Data movement (DMA)
β”‚   β”œβ”€β”€ compiler/            # Graph loader & compiler
β”‚   β”œβ”€β”€ driver/              # Memory manager
β”‚   └── trace/               # Tracing infrastructure
β”œβ”€β”€ src/                     # Implementation
β”‚   β”œβ”€β”€ system/              # System implementation
β”‚   β”œβ”€β”€ components/          # Component implementations
β”‚   β”œβ”€β”€ compiler/            # Graph loader implementation
β”‚   β”œβ”€β”€ bindings/            # C and Python bindings
β”‚   β”œβ”€β”€ simulator/           # Core simulator
β”‚   └── driver/              # Driver implementation
β”œβ”€β”€ tools/                   # CLI tools
β”‚   β”œβ”€β”€ dfg/                 # DFG toolchain (gen, sched, compile, viz, analyze)
β”‚   β”œβ”€β”€ runner/              # Model runner
β”‚   β”œβ”€β”€ configuration/       # Configuration tools
β”‚   └── analysis/            # Analysis tools (disassembler)
β”œβ”€β”€ tests/                   # Test suite (30 tests)
β”œβ”€β”€ examples/                # C++ and Python examples
β”‚   β”œβ”€β”€ basic/               # Basic C++ examples
β”‚   β”œβ”€β”€ pipelines/           # Pipeline scripts using CLI tools
β”‚   └── python/              # Python examples
β”œβ”€β”€ bin/                     # Installed CLI tools (created by install-tools.sh)
β”œβ”€β”€ docs/                    # Documentation (50+ files)
β”œβ”€β”€ cmake/                   # CMake modules
β”œβ”€β”€ scripts/                 # Build and utility scripts
β”‚   β”œβ”€β”€ install-tools.sh     # Install CLI tools to ./bin
β”‚   β”œβ”€β”€ env.sh               # Environment setup (interactive)
β”‚   └── kpu-env.sh           # Environment setup (for scripts)
β”œβ”€β”€ configs/                 # Configuration files
β”œβ”€β”€ test_graphs/             # Test computational graphs
└── CMakePresets.json        # CMake presets configuration

Performance Characteristics

Compute Performance

  • Algorithm: Standard GEMM (General Matrix Multiply) with systolic array support
  • Parallelization: OpenMP for matrices > 1024 elements
  • Precision: Single-precision floating-point (float32)
  • Memory: Optimized cache-friendly access patterns

Memory Hierarchy

  • External Memory: 1GB (configurable)
  • L3 Tile: Main working memory
  • L2 Banks: Mid-level cache
  • L1 Buffers: Fast scratch memory
  • Scratchpad: Software-managed (1MB default)

Data Movement

  • DMA Engine: Asynchronous transfers with address-based API
  • BlockMover: Efficient block-level data movement
  • Streamer: Stream-based data orchestration

API Reference

C++ Classes

sw::sim::SystemSimulator

Top-level system simulator that manages all components.

Methods:

  • SystemSimulator() - Create with default configuration
  • SystemSimulator(const SystemConfig& config) - Create with specific configuration
  • SystemSimulator(const std::filesystem::path& config_file) - Load from JSON
  • bool initialize() - Initialize simulator
  • bool is_initialized() const - Check initialization status
  • sw::kpu::KPUSimulator* get_kpu(size_t index) - Get KPU by index
  • void print_config() const - Print configuration summary
  • void shutdown() - Cleanup resources

sw::kpu::KPUSimulator

KPU accelerator simulator.

Configuration:

struct Config {
    size_t memory_bank_count;
    size_t memory_bank_size_mb;
    double memory_bandwidth_gbps;
    size_t scratchpad_count;
    size_t scratchpad_size_kb;
    size_t compute_tile_count;
    size_t dma_engine_count;
};

Methods:

  • KPUSimulator(const Config& config) - Create with configuration
  • bool is_using_systolic_arrays() const - Check systolic array usage
  • size_t get_systolic_array_rows() const - Get systolic array dimensions
  • size_t get_systolic_array_cols() const - Get systolic array dimensions

sw::kpu::compiler::GraphLoader

Computational graph loader.

Functions:

  • std::unique_ptr<ComputationalGraph> load_graph(const std::string& path) - Load graph from .dfg or .json

Python Classes

stillwater_kpu.Simulator

Python wrapper for KPU simulator.

Methods:

  • __init__(main_memory_size=1<<30, scratchpad_size=1<<20) - Create simulator
  • matmul(A, B) - Matrix multiplication (NumPy arrays)
  • benchmark_matmul(M, N, K, iterations=10) - Performance benchmark
  • __enter__() / __exit__() - Context manager support

Properties:

  • main_memory_size - Main memory size in bytes
  • scratchpad_size - Scratchpad size in bytes

Error Handling

C++ Exceptions

  • std::out_of_range - Memory access violations
  • std::invalid_argument - Invalid parameters
  • std::runtime_error - Resource allocation failures

Python Exceptions

  • KPUMemoryError - Memory access errors
  • KPUDimensionError - Matrix dimension mismatches
  • KPUError - General simulator errors

Development Status

Current Version: 0.1.0 (Beta)

Completed βœ…

  • Core simulator architecture
  • Memory hierarchy implementation
  • DMA and data movement engines
  • Compute fabric and systolic arrays
  • Python bindings with NumPy integration
  • Comprehensive test suite (30/30 passing)
  • domain_flow graph loading
  • Configuration system
  • Trace logging

In Progress 🚧

  • Schedule generation from computational graphs
  • Tensor metadata extraction
  • Framework importers (ONNX, PyTorch, JAX)
  • Optimization passes

Planned πŸ“‹

  • Additional data types (int8, int16, bfloat16)
  • Advanced operations (convolution, activation functions)
  • Real-time performance monitoring UI
  • Multi-KPU distributed simulation

Documentation

Contributing

We welcome contributions! Please follow these guidelines:

  1. Code Style: Follow modern C++20 best practices
  2. Thread Safety: Maintain thread-safe design for shared components
  3. Testing: Add comprehensive tests for new features (use CTest)
  4. Documentation: Update documentation for API changes
  5. Performance: Benchmark performance impact of modifications

Development Workflow

# Create feature branch
git checkout -b feature/your-feature

# Build with debug preset
cmake --preset=debug
cmake --build --preset=debug

# Run tests
ctest --preset=default --output-on-failure

# Submit pull request

Testing Guide

Running Tests

Run All KPU-Simulator Tests (Recommended)

Exclude domain_flow's own tests which may fail independently:

# From project root
ctest --test-dir build -E "^(dsp_|nla_|dfa_|dnn_|ctl_|cnn_)" --output-on-failure

# Or from build directory
cd build
ctest -E "^(dsp_|nla_|dfa_|dnn_|ctl_|cnn_)" --output-on-failure

Result: 30/30 tests pass βœ…

Run All Tests (Including domain_flow)

ctest --test-dir build --output-on-failure

Note: This will include 12 domain_flow tests (tests #1-12) which may fail. These failures are from the external domain_flow library and do not affect KPU-simulator functionality.

Run Specific Test Suites

# Memory tests only
ctest --test-dir build -R "memory" -V

# DMA tests only
ctest --test-dir build -R "dma" -V

# Graph loader tests
ctest --test-dir build -R "graph_loader" -V

# Storage scheduler tests
ctest --test-dir build -R "storage" -V

# Integration tests
ctest --test-dir build -R "integration" -V

Run Single Test

ctest --test-dir build -R "test_name" -V

Test Categories

KPU-Simulator Tests (30 tests)

  • System Tests: Configuration, formatting
  • Memory Tests: Allocation, sparse memory, memory map
  • DMA Tests: Basic, performance, tensor movement, tracing
  • Block Mover Tests: Basic operations, tracing
  • Streamer Tests: Basic operations, tracing
  • Compute Tests: Basic fabric operations, systolic array
  • Storage Scheduler Tests: IDDO, EDDO workflows, performance
  • Integration Tests: End-to-end, multi-component, Python bindings

Domain Flow Tests (12 tests - external)

These are from the domain_flow library dependency:

  • Tests #1-12: dsp_, nla_, dfa_, dnn_, ctl_, cnn_

Note: These tests may fail or not run properly. They test domain_flow functionality, not KPU-simulator.

CI/CD Integration

GitHub Actions

Add to .github/workflows/*.yml:

- name: Run tests
  run: |
    ctest --test-dir build -E "^(dsp_|nla_|dfa_|dnn_|ctl_|cnn_)" --output-on-failure

Or use the exclude pattern file:

- name: Run tests
  run: |
    ctest --test-dir build -E "$(cat .github/workflows/test-exclude-pattern.txt)" --output-on-failure

Test Results Summary

# Expected results when excluding domain_flow tests:
100% tests passed, 0 tests failed out of 30

Total Test time (real) =  ~15 sec

Troubleshooting

All Tests Fail

  • Check build succeeded: cmake --build build
  • Verify working directory: run from project root or use --test-dir build

Graph Loader Tests Skip

  • Run: scripts/copy_domain_flow_graphs.sh
  • This copies .dfg test files from domain_flow dependency

Python Tests Fail

  • Verify Python bindings built: check for stillwater_kpu.*.so in build output
  • Check Python environment matches build (Python 3.12 expected)

Memory Tests Fail

  • May need larger system memory
  • Some tests validate sparse memory allocation

Performance Benchmarks

Some tests include performance benchmarks:

  • dma_performance_test: DMA throughput
  • storage_scheduler_performance_test: EDDO command processing
  • end_to_end: Full system performance

Run with:

ctest --test-dir build -R "performance" -V

License

This project is released under the MIT License. See LICENSE file for details.


Stillwater Computing, Inc. Accelerating Innovation (TM)

Version: 0.1.0 Build System: CMake 3.20+ with presets Language: C++20 Python Support: Python 3.8-3.12

About

Simulator for a KPU-based accelerator

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •