MSc EE/IT at ETH Zürich · AI Research Intern at Huawei Research Center Switzerland
I work on low-level ML systems, specializing in kernel development, benchmarking, and hardware-aware performance optimization for specialized accelerators.
- Custom kernel development for Ascend NPUs
- Benchmarking and performance analysis for ML workloads
- Quantization, fused operators, and efficient inference
- Embedded and resource-constrained ML systems
Public contribution work to huawei-csl/pto-kernels:
- PR #62 — Fast Hadamard fused with dynamic quantization to int4
- PR #49 — Fast Hadamard fused with fp16 → int8 dynamic quantization
- PR #26 — PTO-ISA matmul with L2 cache locality optimization
-
pto-kernels
Active development fork for Ascend NPU kernel work, experiments, benchmarking, and upstream contribution preparation. -
pto-kernels-plots
Benchmark plots and performance analysis for kernel development and PR evaluation. -
health-metrics
Self-hostable Streamlit app for tracking personal health metrics with local SQLite storage and authenticated editing. -
MLonMCU
Embedded ML / microcontroller-related coursework and project work.
ML systems · performance engineering · kernel optimization · compilers · hardware-aware ML
- GitHub: @Mocchibird


