nanograd
A lightweight automatic differentiation and neural network framework for CHICKEN Scheme, featuring BLAS-accelerated operations and YASOS-based object abstractions.
Description
NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features:
- Reverse-mode automatic differentiation with gradient computation
- BLAS-accelerated linear algebra operations
- YASOS-based polymorphic object system
- Support for both 32-bit and 64-bit floating-point precision
- Common neural network layers (Dense, Convolutional, Batch Normalization)
- Common optimization algorithms (SGD, Adam, RMSprop)
- Standard activation functions and loss functions
- Tensor manipulation with reduction operations and slicing
- Training/evaluation mode support for layers
Requirements
Modules
nanograd-autograd
Core automatic differentiation engine with tensor operations.
Tensor Constructors
[procedure] (make-tensor32 data shape #!key (requires-grad? #t)) -> tensorCreates a 32-bit floating-point tensor with automatic differentiation support.
- data
- f32vector containing the tensor data
- shape
- list of dimensions, e.g., '(2 3) for a 2x3 matrix
- requires-grad?
- whether to track gradients (default #t)
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t))[procedure] (make-tensor64 data shape #!key (requires-grad? #t)) -> tensor
Creates a 64-bit floating-point tensor with automatic differentiation support.
(define x (make-tensor64 (f64vector 1.0 2.0 3.0 4.0) '(2 2)))
Tensor Predicates
[procedure] (tensor? obj) -> boolean[procedure] (tensor32? obj) -> boolean
[procedure] (tensor64? obj) -> boolean
Type predicates for tensors.
Tensor Accessors
[procedure] (tensor-data tensor) -> vectorReturns the underlying f32vector or f64vector containing the tensor's data.
[procedure] (tensor-grad tensor) -> vector or #fReturns the gradient vector if gradients are enabled, #f otherwise.
[procedure] (tensor-shape tensor) -> listReturns the shape as a list of dimensions.
[procedure] (tensor-dtype tensor) -> symbolReturns the data type: 'f32 or 'f64.
[procedure] (tensor-requires-grad? tensor) -> booleanReturns #t if the tensor tracks gradients.
Arithmetic Operations
[procedure] (add a b) -> tensorElement-wise addition of tensors a and b. Both tensors must have the same shape and dtype.
(define z (add x y)) ; z = x + y
Gradient: dL/da = dL/dz, dL/db = dL/dz
[procedure] (sub a b) -> tensorElement-wise subtraction: a - b.
Gradient: dL/da = dL/dz, dL/db = -dL/dz
[procedure] (mul a b) -> tensorElement-wise multiplication (Hadamard product).
Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a
[procedure] (div a b) -> tensorElement-wise division: a / b.
Gradient: dL/da = dL/dz / b, dL/db = -dL/dz · (a / b²)
[procedure] (safe-div a b #!key (epsilon 1e-8)) -> tensorSafe element-wise division: a / (b + epsilon) to avoid division by zero.
Linear Algebra Operations
[procedure] (matmul-op a b) -> tensorMatrix multiplication using BLAS GEMM/GEMV operations. Supports:
- Matrix × Matrix
- Matrix × Vector
- Vector × Matrix
- Vector × Vector (dot product)
(define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2))) (define b (make-tensor32 (f32vector 5.0 6.0) '(2))) (define c (matmul-op A b)) ; 2×2 matrix times 2×1 vector = 2×1 vector
Gradient: dL/dA = dL/dC · B^T, dL/dB = A^T · dL/dC
[procedure] (dot-op a b) -> tensorDot product (inner product) of two 1D vectors using BLAS DOT.
(define result (dot-op x y)) ; scalar result
Gradient: dL/da = (dL/dresult) · b, dL/db = (dL/dresult) · a
[procedure] (scale-op tensor scalar) -> tensorScalar multiplication using BLAS SCAL.
Gradient: dL/dtensor = scalar · dL/dresult
Reduction Operations
[procedure] (reduce-tensor tensor reducer #!key (compute-gradient #f)) -> tensorGeneric reduction operation that maintains gradient flow. The reducer function is applied to each element in the forward pass. An optional compute-gradient function specifies how gradients are distributed in the backward pass.
- tensor
- input tensor to reduce
- reducer
- function (element accumulator) -> new-accumulator
- compute-gradient
- optional function (grad-out index value all-values) -> grad-in
- If not provided, assumes uniform distribution (like sum)
Returns a scalar tensor with the reduced value.
;; Sum all elements (uniform gradient distribution) (define total (reduce-tensor x +)) ;; Product of all elements (gradient uses product rule) (define prod (reduce-tensor x * compute-gradient: (lambda (grad-out idx val all-values) ;; d(prod)/dx_i = prod / x_i (let ((prod (fold * 1.0 all-values))) (if (> val 0.0) (* grad-out (/ prod val)) 0.0))))) ;; Custom maximum with gradient flowing only to max element (define max-val (reduce-tensor x max compute-gradient: (lambda (grad-out idx val all-values) (if (= val (apply max all-values)) grad-out 0.0))))[procedure] (sum-tensor tensor) -> tensor
Sums all elements in the tensor. Gradient is distributed uniformly to all elements.
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3))) (define total (sum-tensor x)) ; Returns scalar tensor with value 6.0 (backward! total) (tensor-grad x) ; Each element receives gradient of 1.0[procedure] (product-tensor tensor) -> tensor
Computes the product of all elements. Gradient uses the product rule: d(prod)/dx_i = prod / x_i.
(define x (make-tensor32 (f32vector 2.0 3.0 4.0) '(3))) (define prod (product-tensor x)) ; Returns 24.0 (backward! prod) (tensor-grad x) ; Gradients: [12.0, 8.0, 6.0][procedure] (mean-tensor tensor) -> tensor
Computes the mean (average) of all elements. Equivalent to (sum-tensor tensor) / n.
(define x (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(4))) (define avg (mean-tensor x)) ; Returns 2.5 (backward! avg) (tensor-grad x) ; Each element receives gradient of 0.25
Tensor Manipulation Operations
[procedure] (slice-tensor tensor start length) -> tensorExtracts a slice of a tensor along the first dimension. Gradients flow back correctly to the original tensor positions.
- tensor
- input tensor with shape (n, ...)
- start
- starting index (0-based)
- length
- number of elements to extract
- Returns
- tensor with shape (length, ...)
;; Slice a batch of data (define batch-data (make-tensor32 (make-f32vector 100) '(10 10))) (define mini-batch (slice-tensor batch-data 2 5)) ; Shape: (5, 10) ;; Gradients flow back to original positions (backward! (sum-tensor mini-batch)) (tensor-grad batch-data) ; Only indices 2-6 have non-zero gradients
Example: Mini-batch training
(define dataset (make-tensor32 training-data '(1000 784))) (do ((i 0 (+ i batch-size))) ((>= i 1000)) (let ((batch (slice-tensor dataset i batch-size))) ;; Process batch (let ((output (forward model batch))) (backward! output) (step! optimizer))))[procedure] (reshape tensor new-shape) -> tensor
Reshapes the tensor. Total number of elements must be preserved. Creates a new tensor with separate gradient buffer but shared underlying data.
(define x (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2))) (define x-flat (reshape x '(4))) ; Flatten to 1D (define x-back (reshape x-flat '(2 2))) ; Reshape back[procedure] (flatten-tensor tensor) -> tensor
Flattens a multi-dimensional tensor to 1D. Equivalent to (reshape tensor (list total-size)).
Activation Functions
[procedure] (relu tensor) -> tensorRectified Linear Unit: max(0, x).
Gradient: 1 if x > 0, else 0
[procedure] (tanh-op tensor) -> tensorHyperbolic tangent activation.
Gradient: 1 - tanh^2(x)
[procedure] (sigmoid tensor) -> tensorSigmoid (logistic) activation: σ(x) = 1 / (1 + e^(-x)).
Gradient: σ(x) · (1 - σ(x))
[procedure] (sigmoid-stable tensor) -> tensorNumerically stable sigmoid implementation for large negative values.
[procedure] (softmax x #!key (dim #f)) -> tensorSoftmax normalization with numerical stability (subtracts max before exp).
(define probs (softmax logits)) ; Converts logits to probabilities[procedure] (log-softmax x #!key (dim #f)) -> tensor
Log-softmax: more numerically stable than log(softmax(x)).
[procedure] (leaky-relu tensor #!key (alpha 0.01)) -> tensorLeaky ReLU: max(alpha * x, x).
[procedure] (softplus tensor #!key (beta 1.0)) -> tensorSoftplus activation: log(1 + e^(beta * x)) / beta.
[procedure] (gelu tensor) -> tensorGaussian Error Linear Unit activation using tanh approximation.
[procedure] (silu tensor) -> tensorSiLU (Sigmoid Linear Unit) activation, also known as Swish: x * σ(x).
Loss Functions
[procedure] (mse-loss pred target) -> tensorMean Squared Error loss: L = (1/n) ∑(pred - target)².
(define loss (mse-loss predictions targets))[procedure] (cross-entropy-loss pred target) -> tensor
Cross-entropy loss: L = -∑(target · log(pred)).
Note: Assumes pred is already normalized (e.g., via softmax).
Gradient Operations
[procedure] (zero-grad! tensor) -> voidSets all gradient values to zero.
[procedure] (backward! tensor) -> voidComputes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found.
(define x (make-tensor32 (f32vector 1.0 2.0) '(2))) (define y (make-tensor32 (f32vector 3.0 4.0) '(2))) (define z (add x y)) (define loss (dot-op z z)) (backward! loss) (print-tensor (tensor-grad x))[procedure] (add-to-grad! tensor delta) -> void
Accumulates delta into the tensor's gradient using BLAS AXPY.
Convolution Operations
[procedure] (conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor2D convolution using im2col + GEMM algorithm.
- input
- tensor of shape (C_in, H, W)
- weight
- tensor of shape (C_out, C_in, KH, KW)
- bias
- tensor of shape (C_out) or #f
- stride
- stride for convolution (default 1)
- padding
- zero-padding (default 0)
(define output (conv2d input weights bias stride: 2 padding: 1))
Normalization Operations
[procedure] (rmsnorm x weight #!key (epsilon 1e-5)) -> tensorRoot Mean Square Layer Normalization.
[procedure] (l2-normalize tensor #!key (epsilon 1e-8)) -> tensorL2 normalization: x / ||x||₂.
[procedure] (cosine-similarity a b) -> tensorCosine similarity: (a · b) / (||a|| · ||b||).
Utility Functions
[procedure] (tensor->list tensor) -> listConverts tensor data to a list.
[procedure] (print-tensor tensor) -> voidPretty-prints tensor information including shape, dtype, data, and gradients.
[procedure] (vector-length-for-dtype vec dtype) -> integerReturns the length of a vector based on its dtype.
nanograd-layer
Neural network layer abstractions and containers.
Layer Predicates
[procedure] (layer? obj) -> boolean[procedure] (dense-layer? obj) -> boolean
[procedure] (conv2d-layer? obj) -> boolean
[procedure] (batch-norm-2d? obj) -> boolean
[procedure] (sequential? obj) -> boolean
Dense Layer
[procedure] (make-dense-layer input-size output-size #!key (activation (make-identity)) (dtype 'f32) (name "Dense")) -> layerCreates a fully-connected (dense) layer with Xavier/Glorot initialization.
- input-size
- number of input features
- output-size
- number of output features
- activation
- activation function object (default identity)
- dtype
- 'f32 or 'f64 (default 'f32)
- name
- layer name for debugging
(define layer (make-dense-layer 784 128
activation: (make-relu)
name: "Hidden1"))
Convolutional Layer
[procedure] (make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layerCreates a 2D convolutional layer with He initialization.
(define conv (make-conv2d-layer 3 32 3
stride: 1
padding: 1
activation: (make-relu)))
Batch Normalization Layer
[procedure] (make-batch-norm-2d num-features #!key (epsilon 1e-5) (momentum 0.1) (dtype 'f32) (name "BatchNorm2d")) -> layerCreates a 2D batch normalization layer. Normalizes activations across the batch dimension:
y = γ * (x - μ) / √(σ² + ε) + β
where μ and σ² are computed from the batch (training mode) or from running statistics (evaluation mode).
- num-features
- number of channels (C)
- epsilon
- small constant for numerical stability (default 1e-5)
- momentum
- momentum for updating running statistics (default 0.1)
- dtype
- 'f32 or 'f64 (default 'f32)
- name
- layer name
;; Create batch norm for 64 channels (define bn (make-batch-norm-2d 64 epsilon: 1e-5 momentum: 0.1)) ;; Training mode: uses batch statistics (set-training-mode! bn #t) (define normalized (forward bn input)) ; Input shape: (64, H, W) ;; Evaluation mode: uses running statistics (set-eval-mode! bn) (define normalized (forward bn input)) ; Deterministic output
Batch normalization improves training stability and convergence by:
- Reducing internal covariate shift
- Allowing higher learning rates
- Acting as a form of regularization
- Making networks less sensitive to initialization
Key features:
- Learnable scale (gamma) and shift (beta) parameters
- Running mean and variance maintained for evaluation
- Automatic mode switching between training and evaluation
- Numerical stability with epsilon parameter
Example: ResNet-style block with batch normalization
(define (make-resnet-block in-channels out-channels)
(make-sequential
(list
(make-conv2d-layer in-channels out-channels 3
padding: 1 activation: (make-identity))
(make-batch-norm-2d out-channels)
;; Apply ReLU activation here
(make-conv2d-layer out-channels out-channels 3
padding: 1 activation: (make-identity))
(make-batch-norm-2d out-channels))
name: "ResNetBlock"))
Global Average Pooling
[procedure] (global-avg-pool2d input) -> tensorGlobal average pooling over spatial dimensions. Reduces spatial dimensions to 1x1 by averaging.
- Input shape
- (C, H, W)
- Output shape
- (C,)
Gradient: Distributed uniformly over all spatial positions for each channel.
;; Input: 128 channels, 8x8 spatial dimensions (define feature-maps (make-tensor32 (make-f32vector (* 128 8 8)) '(128 8 8))) ;; Output: 128-dimensional feature vector (define pooled (global-avg-pool2d feature-maps)) ; Shape: (128,) ;; Use in classification network (define logits (forward fc-layer pooled))
Global average pooling is commonly used to replace large fully-connected layers at the end of CNNs:
- Reduces number of parameters dramatically
- Improves generalization
- Makes networks translation-invariant
- Standard in modern architectures (ResNet, MobileNet, EfficientNet)
Example: Replacing FC layers with global pooling
;; Traditional approach: flatten + dense (many parameters) (define old-cnn (make-sequential (list (make-conv2d-layer 64 128 3) ;; Must flatten: (128, 8, 8) -> (8192,) (make-dense-layer 8192 10)))) ; 81,920 parameters! ;; Modern approach: global pooling + dense (fewer parameters) (define new-cnn (make-sequential (list (make-conv2d-layer 64 128 3) ;; Global pooling: (128, 8, 8) -> (128,) (make-dense-layer 128 10)))) ; Only 1,280 parameters!
Sequential Container
[procedure] (make-sequential layers #!key (name "Sequential")) -> layerCreates a sequential container that chains multiple layers.
(define model
(make-sequential
(list
(make-dense-layer 784 128 activation: (make-relu))
(make-dense-layer 128 64 activation: (make-relu))
(make-dense-layer 64 10 activation: (make-identity)))
name: "MLP"))
Layer Operations
[procedure] (forward layer input) -> tensorPerforms a forward pass through the layer.
[procedure] (parameters layer) -> listReturns a list of all trainable parameter tensors.
[procedure] (zero-grad-layer! layer) -> voidZeros gradients for all parameters in the layer.
[procedure] (set-training-mode! layer training?) -> voidSets the training mode for the layer. When training? is #t, the layer uses training-specific behavior (e.g., batch statistics for batch norm). When #f, uses evaluation behavior.
;; Set model to training mode (set-training-mode! model #t) ;; Set model to evaluation mode (set-training-mode! model #f)[procedure] (set-eval-mode! layer) -> void
Shorthand for (set-training-mode! layer #f). Sets the layer to evaluation mode.
;; Evaluation mode (shorthand) (set-eval-mode! model)
Training vs Evaluation Mode:
Training Mode ((set-training-mode! layer #t)):
- Batch normalization uses batch statistics (mean and variance computed from current batch)
- Dropout is active (if implemented)
- Stochastic behavior enabled
- Running statistics updated
Evaluation Mode ((set-eval-mode! layer)):
- Batch normalization uses running statistics (accumulated during training)
- Dropout is disabled
- Deterministic behavior
- Running statistics frozen
;; Complete training/evaluation workflow (define (train-epoch model optimizer train-data) ;; Enable training mode (set-training-mode! model #t) (for-each (lambda (batch) (let* ((x (car batch)) (y (cdr batch)) (pred (forward model x)) (loss (cross-entropy-loss pred y))) (backward! loss) (step! optimizer) (zero-grad-layer! model))) train-data)) (define (evaluate-epoch model test-data) ;; Enable evaluation mode (set-eval-mode! model) (let ((total-correct 0)) (for-each (lambda (batch) (let* ((x (car batch)) (y (cdr batch)) (pred (forward model x))) ;; Count correct predictions (when (= (argmax pred) (argmax y)) (set! total-correct (+ total-correct 1))))) test-data) total-correct)) ;; Main loop (do ((epoch 1 (+ epoch 1))) ((> epoch 100)) (train-epoch model optimizer train-data) (let ((accuracy (evaluate-epoch model test-data))) (printf "Epoch ~A: Test Accuracy = ~A%\n" epoch (* 100 accuracy))))[procedure] (layer-input-size layer) -> integer
[procedure] (layer-output-size layer) -> integer
[procedure] (layer-activation layer) -> activation
[procedure] (layer-name layer) -> string
Accessor functions for layer properties.
Activation Function Objects
[procedure] (make-relu) -> activation[procedure] (make-tanh) -> activation
[procedure] (make-sigmoid) -> activation
[procedure] (make-gelu) -> activation
[procedure] (make-silu) -> activation
[procedure] (make-identity) -> activation
Creates activation function objects for use in layers.
[procedure] (activation? obj) -> boolean[procedure] (activation-forward act x) -> tensor
[procedure] (activation-name act) -> string
Utility Functions
[procedure] (print-layer layer #!optional (indent 0)) -> voidPrints layer information with optional indentation.
[procedure] (summary model) -> voidPrints a model summary including all layers and parameter counts.
(summary model) ; === Model Summary === ; Model: MLP ; Input size: 784 ; Output size: 10 ; ; Total parameters: 101770
nanograd-optimizer
Optimization algorithms for neural network training.
Optimizer Predicates
[procedure] (optimizer? obj) -> boolean[procedure] (sgd? obj) -> boolean
[procedure] (adam? obj) -> boolean
[procedure] (rmsprop? obj) -> boolean
SGD Optimizer
[procedure] (make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizerStochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration.
- parameters
- list of parameter tensors to optimize
- learning-rate
- step size (default 0.01)
- momentum
- momentum factor (default 0.0, no momentum)
- weight-decay
- L2 regularization factor (default 0.0)
- nesterov
- use Nesterov momentum (default #f)
(define opt (make-sgd (parameters model)
learning-rate: 0.01
momentum: 0.9))
Adam Optimizer
[procedure] (make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizerAdam (Adaptive Moment Estimation) optimizer with bias correction.
- beta1
- exponential decay rate for first moment (default 0.9)
- beta2
- exponential decay rate for second moment (default 0.999)
- epsilon
- numerical stability constant (default 1e-8)
(define opt (make-adam (parameters model) learning-rate: 0.001))
RMSprop Optimizer
[procedure] (make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizerRMSprop optimizer with optional momentum.
- alpha
- smoothing constant (default 0.99)
(define opt (make-rmsprop (parameters model)
learning-rate: 0.01
alpha: 0.99))
Optimizer Operations
[procedure] (step! optimizer) -> voidApplies parameter updates based on accumulated gradients.
[procedure] (get-learning-rate optimizer) -> numberReturns the current learning rate.
[procedure] (set-learning-rate! optimizer lr) -> voidUpdates the learning rate (useful for learning rate scheduling).
; Learning rate decay (do ((epoch 1 (+ epoch 1))) ((> epoch 100)) (set-learning-rate! opt (/ 0.1 (+ 1.0 (* 0.01 epoch)))) ; ... training code ... )[procedure] (optimizer-state optimizer) -> alist
Returns an association list of optimizer configuration parameters.
Examples
Basic Tensor Operations
(import nanograd-autograd) ; Create tensors (define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3))) (define y (make-tensor32 (f32vector 4.0 5.0 6.0) '(3))) ; Operations (define z (add x y)) (define w (mul x y)) ; Compute gradients (backward! w) (print-tensor (tensor-grad x))
Reduction Operations
(import nanograd-autograd) ;; Sum all elements (define x (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(4))) (define total (sum-tensor x)) ; 10.0 (backward! total) (print-tensor (tensor-grad x)) ; Each element: 1.0 ;; Mean of elements (define avg (mean-tensor x)) ; 2.5 (backward! avg) (print-tensor (tensor-grad x)) ; Each element: 0.25 ;; Product of elements (define prod (product-tensor x)) ; 24.0 (backward! prod) (print-tensor (tensor-grad x)) ; [12.0, 8.0, 6.0, 4.0]
Tensor Slicing for Mini-Batch Training
(import nanograd-autograd) ;; Create dataset tensor (define dataset (make-tensor32 training-data '(1000 784))) ;; Process in mini-batches (define batch-size 32) (do ((i 0 (+ i batch-size))) ((>= i 1000)) ;; Extract batch (let* ((batch (slice-tensor dataset i batch-size)) (output (forward model batch)) (loss (mse-loss output targets))) ;; Backprop and optimize (backward! loss) (step! optimizer) (zero-grad-layer! model)))
Training a Neural Network
(import nanograd-autograd nanograd-layer nanograd-optimizer) ; Define model (define model (make-sequential (list (make-dense-layer 2 8 activation: (make-relu)) (make-dense-layer 8 1 activation: (make-identity))) name: "Regression")) ; Create optimizer (define optimizer (make-adam (parameters model) learning-rate: 0.01)) ; Training loop (do ((epoch 1 (+ epoch 1))) ((> epoch 100)) (for-each (lambda (sample) (let* ((x (make-tensor32 (car sample) '(2))) (target (make-tensor32 (f32vector (cdr sample)) '(1))) (pred (forward model x)) (loss (mse-loss pred target))) (backward! loss) (step! optimizer) (zero-grad-layer! model))) training-data))
Convolutional Neural Network with Batch Normalization
(import nanograd-autograd nanograd-layer nanograd-optimizer) ;; Modern CNN architecture with batch normalization (define cnn (make-sequential (list ;; Convolutional block 1 (make-conv2d-layer 3 32 3 stride: 1 padding: 1 activation: (make-identity)) (make-batch-norm-2d 32) ;; Convolutional block 2 (make-conv2d-layer 32 64 3 stride: 1 padding: 1 activation: (make-identity)) (make-batch-norm-2d 64) ;; Global average pooling instead of flatten ;; (64, H, W) -> (64,) (make-dense-layer 64 128 activation: (make-relu)) (make-dense-layer 128 10 activation: (make-identity))) name: "CNN")) ;; Training with proper mode switching (define optimizer (make-adam (parameters cnn) learning-rate: 0.001)) (define (train-one-epoch) ;; Set training mode for batch norm (set-training-mode! cnn #t) (for-each (lambda (batch) (let* ((images (car batch)) ; Shape: (batch, 3, 32, 32) (labels (cdr batch)) ;; Process each image in batch (predictions (map (lambda (img) (forward cnn img)) images)) (loss (compute-loss predictions labels))) (backward! loss) (step! optimizer) (zero-grad-layer! cnn))) train-batches)) (define (evaluate) ;; Set evaluation mode for batch norm (set-eval-mode! cnn) (let ((correct 0) (total 0)) (for-each (lambda (batch) (let* ((images (car batch)) (labels (cdr batch))) (for-each (lambda (img label) (let ((pred (forward cnn img))) (when (= (argmax (tensor->list pred)) (argmax (tensor->list label))) (set! correct (+ correct 1))) (set! total (+ total 1)))) images labels))) test-batches) (/ correct total))) ;; Main training loop (do ((epoch 1 (+ epoch 1))) ((> epoch 50)) (train-one-epoch) (printf "Epoch ~A: Test Accuracy = ~A%\n" epoch (* 100 (evaluate))))
ResNet-Style Architecture
;; ResNet block with batch normalization (define (make-resnet-block in-channels out-channels stride) (make-sequential (list (make-conv2d-layer in-channels out-channels 3 stride: stride padding: 1 activation: (make-identity)) (make-batch-norm-2d out-channels) ;; ReLU activation (make-conv2d-layer out-channels out-channels 3 stride: 1 padding: 1 activation: (make-identity)) (make-batch-norm-2d out-channels)) name: "ResBlock")) ;; Full ResNet-18 style model (define resnet (make-sequential (list ;; Initial convolution (make-conv2d-layer 3 64 7 stride: 2 padding: 3) (make-batch-norm-2d 64) ;; Residual blocks (make-resnet-block 64 64 1) (make-resnet-block 64 128 2) (make-resnet-block 128 256 2) (make-resnet-block 256 512 2) ;; Global average pooling: (512, H, W) -> (512,) (make-dense-layer 512 1000)) name: "ResNet18"))
Performance Notes
- NanoGrad uses BLAS for matrix operations
- Use f32 (32-bit) tensors when 64-bit precision is not required for better performance
- The framework detects computation graph cycles and prevents infinite loops during backpropagation
- Memory is managed manually; call zero-grad-layer! after each optimization step
- Batch normalization adds minimal computational overhead but significantly improves training
- Global average pooling reduces parameters without sacrificing performance
Limitations
- CPU-only (no GPU support)
- No automatic batching
- Limited built-in layer types (dense, convolutional, batch norm)
- Single-threaded execution
- Batch normalization requires proper training/eval mode switching
Advanced Usage
Custom Reduction Operations
;; L-infinity norm (maximum absolute value) (define (l-inf-norm tensor) (reduce-tensor (abs tensor) max compute-gradient: (lambda (grad-out idx val all-values) (let ((max-val (apply max all-values))) (if (= val max-val) grad-out 0.0))))) ;; Weighted sum (define (weighted-sum tensor weights) (let ((weighted (mul tensor weights))) (sum-tensor weighted))) ;; Geometric mean (define (geometric-mean tensor) (let* ((n (apply * (tensor-shape tensor))) (log-vals (log-tensor tensor)) (sum (sum-tensor log-vals)) (mean-log (scale-op sum (/ 1.0 n)))) (exp mean-log)))
Gradient Clipping
; Clip gradients by norm (define (clip-grad-norm! parameters max-norm) (let ((total-norm 0.0)) ; Compute total norm (for-each (lambda (param) (let ((grad (tensor-grad param))) (when grad (let ((dtype (tensor-dtype param)) (n (vector-length-for-dtype grad dtype))) (case dtype ((f32) (do ((i 0 (+ i 1))) ((= i n)) (let ((g (f32vector-ref grad i))) (set! total-norm (+ total-norm (* g g)))))) ((f64) (do ((i 0 (+ i 1))) ((= i n)) (let ((g (f64vector-ref grad i))) (set! total-norm (+ total-norm (* g g))))))))))) parameters) (let ((total-norm (sqrt total-norm))) (when (> total-norm max-norm) (let ((scale (/ max-norm total-norm))) ; Scale all gradients (for-each (lambda (param) (let ((grad (tensor-grad param))) (when grad (let ((n (vector-length-for-dtype grad (tensor-dtype param)))) (case (tensor-dtype param) ((f32) (sscal! n scale grad)) ((f64) (dscal! n scale grad))))))) parameters)))))) ; Usage (backward! loss) (clip-grad-norm! (parameters model) 1.0) (step! optimizer)
Learning Rate Scheduling
; Step decay (define (step-decay base-lr epoch drop-every drop-rate) (* base-lr (expt drop-rate (floor (/ epoch drop-every))))) ; Exponential decay (define (exp-decay base-lr epoch decay-rate) (* base-lr (exp (- (* decay-rate epoch))))) ; Cosine annealing (define (cosine-annealing base-lr epoch total-epochs) (* 0.5 base-lr (+ 1.0 (cos (* 3.14159 (/ epoch total-epochs)))))) ; Usage in training loop (do ((epoch 1 (+ epoch 1))) ((> epoch 100)) (set-learning-rate! optimizer (step-decay 0.1 epoch 30 0.5)) ; ... training code ... )
Troubleshooting
Common Errors
Shape mismatch errors
Ensure tensor shapes are compatible for operations:
; Matrix multiplication requires compatible dimensions (define A (make-tensor32 (make-f32vector 6) '(2 3))) (define B (make-tensor32 (make-f32vector 6) '(3 2))) (define C (matmul-op A B)) ; OK: (2,3) × (3,2) = (2,2) (define D (make-tensor32 (make-f32vector 4) '(2 2))) (matmul-op A D) ; Error: incompatible dimensions
Gradient computation cycles
Avoid creating cycles in the computation graph:
; Bad: creates a cycle (define x (make-tensor32 (f32vector 1.0) '(1))) (define y (add x x)) (set-backward-fn! x (lambda () (add-to-grad! x (tensor-grad y))) (list y)) (backward! y) ; Error: computation graph contains cycles
Division by zero
Use safe-div when dividing by potentially zero values:
; Instead of (div a b), use: (define result (safe-div a b epsilon: 1e-8))
Batch normalization not switching modes
Always set training/eval mode explicitly:
; Training (set-training-mode! model #t) (train-epoch model) ; Evaluation (set-eval-mode! model) (evaluate model)
Author
Repository
https://github.com/iraikov/nanograd
Version History
- 1.2
- Recent additions
: * Reduction operations (sum-tensor, mean-tensor, product-tensor, reduce-tensor) : * Tensor slicing (slice-tensor) : * Batch normalization (make-batch-norm-2d) : * Global average pooling (global-avg-pool2d) : * Training/evaluation mode control (set-training-mode!, set-eval-mode!)
- 1.1
- Bug fix in mul layer operation
- 1.0
- Initial release
: * Core autograd engine : * Dense and convolutional layers : * SGD, Adam, and RMSprop optimizers : * Basic activation and loss functions
See Also
License
LPGL-3
References
- PyTorch: Dynamic computation graphs and autograd
- micrograd: Minimalist autograd engine by Andrej Karpathy
- "Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018)
- "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (Ioffe & Szegedy, 2015)
- "BLAS (Basic Linear Algebra Subprograms)" documentation