A complete implementation of a multi-layer neural network with backpropagation for the MNIST handwritten digit classification task. Built entirely from scratch without any external ML libraries.
- Overview
- Project Structure
- Features
- Building the Project
- Running the Project
- Network Architecture
- How It Works
- Activation Functions
- Loss Functions
- Data Format
- Customization
- Performance Notes
This project implements a fully-connected neural network (also called a multi-layer perceptron or MLP) trained using backpropagation to classify handwritten digits from the MNIST dataset.
The implementation includes:
- Custom matrix operations
- Dense (fully-connected) layers with learnable weights and biases
- Multiple activation functions
- Cross-entropy and MSE loss functions
- Stochastic Gradient Descent (SGD) optimizer
backpropagation/
βββ include/
β βββ activations.hpp
β βββ layer.hpp
β βββ matrix.hpp
β βββ mnist_loader.hpp
β βββ neural_network.hpp
βββ src/
β βββ activations.cpp
β βββ layer.cpp
β βββ main.cpp
β βββ matrix.cpp
β βββ mnist_loader.cpp
β βββ neural_network.cpp
βββ README.md
- Matrix Operations: Custom Matrix class with dot product, element-wise operations, transposition
- Dense Layers: Fully-connected layers with configurable input/output sizes
- Activation Functions: Sigmoid, ReLU, Leaky ReLU, Softmax, Tanh, Linear
- Loss Functions: Cross-entropy (recommended for classification) and Mean Squared Error
- Backpropagation: Full gradient computation through all layers
- Stochastic Gradient Descent (SGD): Mini-batch training with configurable batch size
- MNIST Loading: Native support for IDX format MNIST files
- C++17 compatible compiler
- CMake 3.10 or higher
mkdir -p build && cd build
cmake ..
make -j$(nproc)./neural_network <data_directory> <epochs> <learning_rate> <batch_size>
# Example: 20 epochs, learning rate 0.005, batch size 64
./neural_network ./data/mnist 20 0.005 64| Parameter | Default Value | Description |
|---|---|---|
| Data Directory | ./data/mnist | Path to MNIST files |
| Epochs | 10 | Number of training passes |
| Learning Rate | 0.01 | Step size for weight updates |
| Batch Size | 32 | Samples per training batch |
graph TD
subgraph Input_Layer
I[Input: 784 neurons<br/>28x28 pixels]
end
subgraph Hidden_Layer_1
H1[128 neurons<br/>ReLU activation]
end
subgraph Hidden_Layer_2
H2[64 neurons<br/>ReLU activation]
end
subgraph Output_Layer
O[10 neurons<br/>Softmax activation<br/>Digits 0-9]
end
I --> H1
H1 --> H2
H2 --> O
| Layer | Input | Output | Parameters | Activation |
|---|---|---|---|---|
| Layer 1 | 784 | 128 | 100,608 | ReLU |
| Layer 2 | 128 | 64 | 8,256 | ReLU |
| Output | 64 | 10 | 650 | Softmax |
| Total | 109,514 |
flowchart LR
subgraph "Forward Pass"
Ap[A_prev] --> W[W]
Ap --> b[b]
W --> Z[Z]
b --> Z
Z --> act[activation]
act --> A[A]
end
Forward propagation passes the input through each layer:
- Linear transformation: Z = X * W + b
- Activation: A = activation(Z)
flowchart TD
Start[Start] --> Output[Output Layer]
Output --> ActDerive[Compute dL/dZ]
ActDerive --> WeightGrad[Compute dL/dW]
ActDerive --> BiasGrad[Compute dL/db]
ActDerive --> PrevGrad[Compute dL/dA_prev]
PrevGrad --> Hidden1{Hidden Layer?}
Hidden1 -->|Yes| Process1[Process this layer]
Process1 --> Hidden1
Hidden1 -->|No| Update[Update Weights]
WeightGrad --> Update
BiasGrad --> Update
Update --> NextBatch[Next Batch]
Hidden1 -->|No| Update
Backpropagation computes gradients recursively through each layer:
- Start with gradient from layer above
- For each layer (from output to input):
- Apply activation derivative
- Compute weight gradient
- Compute bias gradient
- Pass gradient to previous layer
- Update weights
flowchart TD
A([Start]) --> B[Initialize Network]
B --> C{For each epoch}
C --> D[Shuffle Training Data]
D --> E{For each batch}
E --> F[Forward Pass]
F --> G[Compute Loss]
G --> H[Backward Pass]
H --> I[Update Weights]
I --> E
E -->|All batches| J[Evaluate on Test Set]
J --> C
C -->|All epochs| K([Done])
The training uses mini-batch Stochastic Gradient Descent:
- Loop through epochs
- Shuffle training data each epoch
- For each batch: forward pass β compute loss β backward pass β update weights
- Evaluate on test set after each epoch
| Function | Formula | Use Case |
|---|---|---|
| Sigmoid | 1/(1+e^(-x)) | Binary classification |
| ReLU | max(0, x) | Hidden layers |
| Leaky ReLU | x > 0 ? x : 0.01x | Dying ReLU fix |
| Softmax | e^(x_i) / Ξ£e^(x_j) | Multi-class output |
| Tanh | tanh(x) | Hidden layers |
| Linear | x | Regression |
| Function | Derivative |
|---|---|
| Sigmoid | Ο(x) Γ (1 - Ο(x)) |
| ReLU | 1 if x > 0, 0 otherwise |
| Leaky ReLU | 1 if x > 0, Ξ± otherwise |
| Softmax | Jacobian-based |
| Tanh | 1 - tanhΒ²(x) |
| Linear | 1 |
L = -Ξ£ y_true Γ log(y_pred)
Used with softmax for multi-class classification.
L = (1/n) Γ Ξ£ (y_true - y_pred)Β²
- Images: 28Γ28 β flattened to 784 values
- Normalized: [0, 255] β [0, 1]
- Labels: one-hot encoded (10 classes)
- Predictions: probability distribution
Layer newLayer(inputSize, outputSize, "activation_name");
nn.addLayer(newLayer);nn.setLossFunction("mse");
nn.setLossFunction("cross_entropy");- Training accuracy: ~95%+ after 10 epochs
- Test accuracy: ~94% after 10 epochs
- Basic SGD (no momentum/Adam)
- No regularization
- No batch normalization
- CPU only