High Performance Silicon Whispering Computer Mathematics
STAΛKER is a high-performance, header-only, C++ 17 linear algebra library. It combines data-level (SIMD) and thread-level concurrency with modern template metaprogramming techniques to obtain excellent performance, especially in memory-bound data operations. It can perform similarly or significantly better than STL, OpenBLAS, and Eigen across a broad range of operations and sizes. Users are free to tune and experiment with almost every aspect of the performance optimizations, from compile-time, according to their needs and hardware.
The motivation behind STAΛKER is to navigate The Zone of low-level, high-performance computing and foundational mathematics. It aims to obtain the maximum single CPU core performance possible, using a high-level language without resorting to assembly or chip specific kernels and total madness. Great effort has been given in order to eliminate runtime dispatch and overhead and offer extended configurability, almost exclusively at compile-time. These design principles, under the correct settings, offer portable and predictable performance from expensive modern CPUs to embedded systems.
It is a personal playground and learning space that offers easy to use experimentation with HPC concepts.
- STAΛKER
- SIMD (Single Instruction, Multiple Data): Direct hardware-level parallelism using 256-bit (AVX2) and 512-bit (AVX-512) vector registers. Enables concurrent arithmetic on multiple elements per instruction. Instruction set extensions, store policies (cached/streamed stores), load policies (aligned/not-aligned loads) and prefetch hint can be configured in compile-time. Supported data types:
double,float,int,unsigned int,short. - Template Metaprogramming: Metaprogramming techniques and templating (CRTP, template specialization,
std::index_sequence, etc.) enable zero-overhead compile-time dispatching and static polymorphism (no vtable lookups) with type safe, almost branchless code. Users can enable and configure compile-time unrolling for fully unrolled small loops and large loops with reduced check overhead and perform arithmetic operations exclusively with the compiler. - Hardware Concurrency: Dual multithreading backend support for operations on contiguous memory blocks:
std::threadCross-platform, simple but only number of threads can be configured.pthreadLinux only backend wrapping POSIX thread. High configurability based on the exact topology of the machine. Users can enable Simultaneous Multi-Threading (SMT), set thread number and thread affinity with custom CPU pools that share resources.Parallel operations due to their nature (thread creation & launching, safety checks, synchronization) increase branching and overhead. Of course compile-time evaluations cannot be performed.
-
MemoryOperations API:copy,setValue,setZero:- Sequential and parallel variants
- Vectorized with AVX2 and AVX512 instruction sets (with / without unroll)
- Loops (with / without unroll)
- Standard Template Library algorithms
-
Allocators: Dynamically allocate and deallocate data with custom alignment: raw pointers,
std::vector,std::shared_ptr,std::unique_ptr
All operations marked with [constexpr] can be evaluated at compile-time.
-
VectorMath:
add,axpy,subtract,multiply,divide,scale,addConstant,dot- Sequential and parallel variants
- Weighted and unweighted variants [where applicable]
- Vectorized with AVX2 and AVX512 instruction sets (with / without unroll) [except for divide]
- Loops (with / without unroll)
-
MatrixMath:
add,subtract,multiply,scale,addConstant,dot,matrixVectorMultiply,vectorMatrixMultiply- Support for full, row and column major matrices and subblocks.
- Vectorized with AVX2 and AVX512 instruction sets (with / without unroll)
- Sequential and parallel variants
- Weighted and unweighted variants [where applicable]
- Loops (with / without unroll)
-
Differentiation [constexpr]: Numerical calculation of 1st and 2nd order derivatives (Finite Difference Method). Forward, backward and central FD schemes with error order up to
$\Delta x^6$ . - Integration [constexpr]: Numerical integration algorithms: Trapezoidal, Simpson1, Simpson2.
-
Meta Mathematics [constexpr]: Compile-time evaluated fundamental mathematical operations:
power,sumOfPower,factorial,fibonacci.
- Modern C++ compiler supporting C++17: GCC 7+, Clang 7+, or MSVC 19.12+ (VS2017 15.5) or newer.
- CMake (3.18 or later)
- [Optional] CPU supporting SIMD instructions (AVX2, AVX512).
# Check for AVX2 and FMA support (returns flags if supported) grep -q "avx2" /proc/cpuinfo && grep -q "fma" /proc/cpuinfo && echo "-mavx2 -mfma" # Check for AVX512 support (returns flags if supported) grep -q "avx512f" /proc/cpuinfo && grep -q "avx512dq" /proc/cpuinfo && grep -q "avx512bw" /proc/cpuinfo && echo "-mavx512f -mavx512dq -mavx512bw -mfma"
-
Go to the installation directory:
cd /path/to/your/desired/location -
Get the source code:
HTTPS:
git clone https://github.com/orestisPPS/Stalker.git
SSH:
git clone git@github.com:orestisPPS/Stalker.git
-
[optional] If you later add as a submodule and need nested content:
git submodule update --init --recursive
This project uses CMake as its primary build system and the recommended tool for configuration and integration. It offers a lot of performance tuning and almost all of it can be performed from configure or compile time. Data operations are configured with templated execution traits. Their default values are set by cmake configuration but they can be overridden.
The following CMake cache variables are available to configure the library:
| Variable | Description | Default |
|---|---|---|
STALKER_BUILD_TESTS |
Build tests target | OFF |
STALKER_BUILD_BENCHMARKS |
Build benchmarks target | OFF |
| Variable | Description | Default |
|---|---|---|
STALKER_ALIGNMENT |
Byte boundary for aligned allocations | 64 |
STALKER_UNROLL_FACTOR |
Loops unrolled per block | 1 |
| Variable | Description | Default |
|---|---|---|
STALKER_SIMD_ENABLE |
Enable SIMD vectorization | ON |
STALKER_SIMD_INSTRUCTION_SET |
SIMD ISA: auto, avx2, avx512, none (auto prefers AVX2) |
auto |
STALKER_SIMD_STORE_POLICY |
SIMD store policy: stream (non-temporal) or cache |
stream |
STALKER_SIMD_PREFETCH_HINT |
Prefetch hint: HintNone,HintT0, HintT1, HintT2, HintNTA |
HintNone |
| Variable | Description | Default |
|---|---|---|
STALKER_THREADING_ENABLE |
Enable CPU parallelization | ON |
STALKER_THREADING_MAX_THREADS_ENABLE |
Use maximum hardware threads (overrides manual thread count) | OFF |
STALKER_THREADING_NUM_THREADS |
Thread count (0 = auto-detect) |
0 |
STALKER_THREADING_STD_ENABLE |
Use std::thread backend |
ON |
STALKER_THREADING_POSIX_ENABLE |
Use pthread backend (Linux/POSIX only) |
OFF |
STALKER_THREADING_POSIX_SMT_ENABLE |
Enable Simultaneous Multithreading (only with pthread backend on Linux) |
OFF |
| Variable | Description | Default |
|---|---|---|
STALKER_BUILD_PROFILE |
Unified build profile selector: debug, release, perf, relwithdebinfo, custom. |
release |
STALKER_BUILD_CUSTOM_FLAGS |
Semicolon list of custom compiler flags (active only when profile=custom). |
`` |
Profile semantics (GNU/Clang):
| Profile | Flags | Notes |
|---|---|---|
debug |
-O0 -g |
Full symbols; no optimization. |
release |
-O3 -DNDEBUG |
Standard optimized build. |
perf |
-O3 -DNDEBUG -ffast-math -funroll-loops -march=native -flto -fomit-frame-pointer -falign-functions=32 -falign-loops=32 -fno-math-errno |
Maximum speed; sacrifices portability & strict IEEE semantics. |
relwithdebinfo |
-O2 -g -DNDEBUG |
Mirrors CMake built-in variant. |
custom |
(user supplied) | Use STALKER_BUILD_CUSTOM_FLAGS. No automatic safety or perf flags added. |
MSVC equivalents map similarly (/Od /Zi, /O2 /DNDEBUG, perf adds /Ox /GL /Ot /fp:fast).
-
Loop unrolling factor = 2
-
64-byte bound allocations (alignment = 64)
-
SIMD is enabled and configured to use AVX2 instructions with aggressive (L1, L2) prefetching. Data will be streamed to memory after the operation.
-
Multithreading is enabled and the thread backend uses
pthread. Jobs will be distributed among 4 threads which belong to the same 2 physical core units since SMT is enabled.cmake -S . -B build \ -DSTALKER_BUILD_PROFILE=release \ -DSTALKER_ALIGNMENT=64 \ -DSTALKER_UNROLL_FACTOR=2 \ -DSTALKER_SIMD_ENABLE=ON \ -DSTALKER_SIMD_INSTRUCTION_SET=avx2 \ -DSTALKER_SIMD_PREFETCH_HINT=HintT0 \ -DSTALKER_SIMD_STORE_POLICY=stream \ -DSTALKER_THREADING_ENABLE=ON \ -DSTALKER_THREADING_NUM_THREADS=4 \ -DSTALKER_THREADING_POSIX_ENABLE=ON \ -DSTALKER_THREADING_POSIX_SMT_ENABLE=ON \
Stalker is header-only and there are two ways to integrate:
This produces an installable CMake package so downstream projects can use find_package(Stalker).
- Configure
-
Set the values of the parameters in
stalker-build.json. Configure using the Python helper:python3 Tools/build.py configure
-
CMake (equivalent):
cmake -S . -B build -DSTALKER_BUILD_PROFILE=release -DSTALKER_UNROLL_FACTOR=2 -DSTALKER_SIMD_ENABLE=ON -DSTALKER_SIMD_INSTRUCTION_SET=avx2
- Install (header-only; optional)
-
Use the Python helper:
python3 Tools/build.py install --install-prefix /opt/stalker
-
CMake:
cmake --install build --prefix /opt/stalker
-
Use in your CMake project
find_package(Stalker REQUIRED) target_link_libraries(my_target PRIVATE Stalker::Stalker)
NOTE: If you installed to a non-standard location, point CMake to the prefix:
export CMAKE_PREFIX_PATH=/opt/stalker:$CMAKE_PREFIX_PATH
Your chosen configuration at install time (SIMD, alignment, threading flags) is exported through the INTERFACE target and propagated to dependents. To package multiple variants install to separate prefixes.
# Variant 1 cmake -S . -B build-avx2 -DSTALKER_BUILD_PROFILE=release -DSTALKER_SIMD_ENABLE=ON -DSTALKER_SIMD_INSTRUCTION_SET=avx2 cmake --install build-avx2 --prefix /opt/stalker-v1 # Variant 2 cmake -S . -B build-avx512 -DSTALKER_BUILD_PROFILE=release -DSTALKER_SIMD_ENABLE=ON -DSTALKER_SIMD_INSTRUCTION_SET=avx512 cmake --install build-avx512 --prefix /opt/stalker-v2
Consumer projects can then point to the desired variant:
export CMAKE_PREFIX_PATH=/opt/stalker-v2:$CMAKE_PREFIX_PATHVendor Stalker in your source tree as a submodule and configure it as part of your project.
-
Add the source
git submodule add https://github.com/orestisPPS/Stalker external/Stalker git submodule update --init --recursive
-
In your CMakeLists.txt
add_subdirectory(external/Stalker) target_link_libraries(my_target PRIVATE Stalker::Stalker)
-
Configure (Optional)
Override Stalker options from your project (if desired):
cmake -S . -B build -DSTALKER_SIMD_ENABLE=ON -DSTALKER_SIMD_INSTRUCTION_SET=avx2 #configure more
This section provides some copy-pastable C++ snippets that showcase how to configure and perform some basic memory and math operations. For more info and detailed examples with buildable and executable code visit Stalker/Examples and EXAMPLES.md.
Almost all operations can be performed with one of three backends (SIMD, Unrolled, Loops/STL). Their parameters can be set as template parameters at compile time or left defaulted to the cmake configuration defined values. Every aspect of a single thread operation can be configured at compile-time as template parameters with ExecutionTraits<>.
Example 1: Vectorized and unrolled copy of 32-bit alignened std::vector with AVX2 instructions and streamed store.
#include <Stalker/Memory/MemoryOperations.h>
#include <Stalker/Memory/Allocators.h>
#include <Stalker/Mathematics/Random.h>
int main() {
//Create a SIMD execution trait SIMD Instruction Set, Unroll Factor and SIMD Store Policy
constexpr bool IsAligned = true;
constexpr Stalker::Core::Config::T_SIMD Instructions = Stalker::Core::Config::T_SIMD::AVX2;
constexpr Stalker::Core::Config::T_SIMDStore StorePolicy = Stalker::Core::Config::T_SIMDStore::Streamed;
constexpr Stalker::Core::Config::T_PrefetchHints PrefetchHint = Stalker::Core::Config::T_PrefetchHints::HintT0;
constexpr size_t Unroll = 2;
using SimdTrait = Stalker::Core::ExecutionTraitSIMD<Instructions, IsAligned, StorePolicy, Unroll, PrefetchHint>;
//Create 2 std::vector<double> aligned at 32 bits and fill source with random numbers [0, 10].
size_t size = 20;
constexpr size_t Alignment = 32;
auto src = Stalker::Memory::createAlignedVector<double, Alignment>(size);
Stalker::Mathematics::Random::uniform<double>(size, src.data(), 0, 10);
auto dst = Stalker::Memory::createAlignedVector<double, Alignment>(size);
//Call copy from MemoryOperations API
Stalker::Memory::MemoryOperations::copy<double, SimdTrait>(size, dst.data(), src.data());
return 0;
}
#include <Stalker/Mathematics/Vector/VectorMath.h>
#include <Stalker/Memory/Allocators.h>
#include <Stalker/Mathematics/Random.h>
int main() {
using Trait = Stalker::Core::ExecutionTraitUnrolled<8>;
//Create 2 raw pointers with default alignment and fill with random numbers [0, 10].
size_t size = 20;
auto a = Stalker::MemorcreateAlignedPtrRaw<double>(size);
Stalker::Mathematics::Random::uniform<double>(size, a, 0, 10);
auto b = Stalker::MemorcreateAlignedPtrRaw<double>(size);
Stalker::Mathematics::Random::uniform<double>(size, b, 0, 10);
//Call dot from VectorMath API.
//2 unrolled blocks with size 8 will be calculated in 2 loops and the tail (4 elements) will be handled by a c-style loop.
auto res = Stalker::Mathematics::VectorMath::dot<double, Trait>(size, a, b);
delete a;
delete b;
return 0;
}#include <Stalker/Mathematics/Vector/VectorMath.h>
#include <array>
int main() {
constexpr auto Size = 5;
using Trait = Stalker::Core::ExecutionTraitUnrolled<Size>;
//Create 1 std::array<double, 5> of size 5.
constexpr auto array = std::array<double, Size> {1, 1, 1, 1, 1};
//Call sum from VectorMath API
//Everything is calculated by the compiler!
constexpr auto res = Stalker::Mathematics::VectorMath::sum<double, Trait>(Size, array.data());
//This will not build if the result is not 5;
static_assert(res == 5);
return 0;
}For detailed performance results and benchmark instructions visit Benchmarks/README.MD.
This project is licensed under the Apache License, Version 2.0. See the LICENCE.md file for details.
The core library (Stalker/Core, Memory, Mathematics) is header-only and standard C++17 compliant with no external dependencies.
The Benchmarks module optionally links against:
- Eigen (MPL2)
- OpenBLAS (BSD-3-Clause)
- PAPI (BSD-Style)
These are only required for building the benchmark executable and are not part of the library distribution.