You are looking at historical revision 44985 of this page. It may differ significantly from its current revision.

nanograd

A lightweight automatic differentiation and neural network framework for CHICKEN Scheme, featuring BLAS-accelerated operations and YASOS-based object abstractions.

Description

NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features:

Requirements

Modules

nanograd-autograd

Core automatic differentiation engine with tensor operations.

Tensor Constructors
[procedure] (make-tensor32 data shape #!key (requires-grad? #t)) -> tensor

Creates a 32-bit floating-point tensor with automatic differentiation support.

data
f32vector containing the tensor data
shape
list of dimensions, e.g., '(2 3) for a 2x3 matrix
requires-grad?
whether to track gradients (default #t)
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t))
[procedure] (make-tensor64 data shape #!key (requires-grad? #t)) -> tensor

Creates a 64-bit floating-point tensor with automatic differentiation support.

(define x (make-tensor64 (f64vector 1.0 2.0 3.0 4.0) '(2 2)))
Tensor Predicates
[procedure] (tensor? obj) -> boolean
[procedure] (tensor32? obj) -> boolean
[procedure] (tensor64? obj) -> boolean

Type predicates for tensors.

Tensor Accessors
[procedure] (tensor-data tensor) -> vector

Returns the underlying f32vector or f64vector containing the tensor's data.

[procedure] (tensor-grad tensor) -> vector or #f

Returns the gradient vector if gradients are enabled, #f otherwise.

[procedure] (tensor-shape tensor) -> list

Returns the shape as a list of dimensions.

[procedure] (tensor-dtype tensor) -> symbol

Returns the data type: 'f32 or 'f64.

[procedure] (tensor-requires-grad? tensor) -> boolean

Returns #t if the tensor tracks gradients.

Arithmetic Operations
[procedure] (add a b) -> tensor

Element-wise addition of tensors a and b. Both tensors must have the same shape and dtype.

(define z (add x y))  ; z = x + y

Gradient: dL/da = dL/dz, dL/db = dL/dz

[procedure] (sub a b) -> tensor

Element-wise subtraction: a - b.

Gradient: dL/da = dL/dz, dL/db = -dL/dz

[procedure] (mul a b) -> tensor

Element-wise multiplication (Hadamard product).

Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a

[procedure] (div a b) -> tensor

Element-wise division: a / b.

Gradient: dL/da = dL/dz / b, dL/db = -dL/dz · (a / b²)

[procedure] (safe-div a b #!key (epsilon 1e-8)) -> tensor

Safe element-wise division: a / (b + epsilon) to avoid division by zero.

Linear Algebra Operations
[procedure] (matmul-op a b) -> tensor

Matrix multiplication using BLAS GEMM/GEMV operations. Supports:

(define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define b (make-tensor32 (f32vector 5.0 6.0) '(2)))
(define c (matmul-op A b))  ; 2×2 matrix times 2×1 vector = 2×1 vector

Gradient: dL/dA = dL/dC · B^T, dL/dB = A^T · dL/dC

[procedure] (dot-op a b) -> tensor

Dot product (inner product) of two 1D vectors using BLAS DOT.

(define result (dot-op x y))  ; scalar result

Gradient: dL/da = (dL/dresult) · b, dL/db = (dL/dresult) · a

[procedure] (scale-op tensor scalar) -> tensor

Scalar multiplication using BLAS SCAL.

Gradient: dL/dtensor = scalar · dL/dresult

Reduction Operations
[procedure] (reduce-tensor tensor reducer #!key (compute-gradient #f)) -> tensor

Generic reduction operation that maintains gradient flow. The reducer function is applied to each element in the forward pass. An optional compute-gradient function specifies how gradients are distributed in the backward pass.

tensor
input tensor to reduce
reducer
function (element accumulator) -> new-accumulator
compute-gradient
optional function (grad-out index value all-values) -> grad-in
If not provided, assumes uniform distribution (like sum)

Returns a scalar tensor with the reduced value.

;; Sum all elements (uniform gradient distribution)
(define total (reduce-tensor x +))

;; Product of all elements (gradient uses product rule)
(define prod (reduce-tensor x *
  compute-gradient: (lambda (grad-out idx val all-values)
                     ;; d(prod)/dx_i = prod / x_i
                     (let ((prod (fold * 1.0 all-values)))
                       (if (> val 0.0)
                           (* grad-out (/ prod val))
                           0.0)))))

;; Custom maximum with gradient flowing only to max element
(define max-val (reduce-tensor x max
  compute-gradient: (lambda (grad-out idx val all-values)
                     (if (= val (apply max all-values))
                         grad-out
                         0.0))))
[procedure] (sum-tensor tensor) -> tensor

Sums all elements in the tensor. Gradient is distributed uniformly to all elements.

(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3)))
(define total (sum-tensor x))  ; Returns scalar tensor with value 6.0

(backward! total)
(tensor-grad x)  ; Each element receives gradient of 1.0
[procedure] (product-tensor tensor) -> tensor

Computes the product of all elements. Gradient uses the product rule: d(prod)/dx_i = prod / x_i.

(define x (make-tensor32 (f32vector 2.0 3.0 4.0) '(3)))
(define prod (product-tensor x))  ; Returns 24.0

(backward! prod)
(tensor-grad x)  ; Gradients: [12.0, 8.0, 6.0]
[procedure] (mean-tensor tensor) -> tensor

Computes the mean (average) of all elements. Equivalent to (sum-tensor tensor) / n.

(define x (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(4)))
(define avg (mean-tensor x))  ; Returns 2.5

(backward! avg)
(tensor-grad x)  ; Each element receives gradient of 0.25
Tensor Manipulation Operations
[procedure] (slice-tensor tensor start length) -> tensor

Extracts a slice of a tensor along the first dimension. Gradients flow back correctly to the original tensor positions.

tensor
input tensor with shape (n, ...)
start
starting index (0-based)
length
number of elements to extract
Returns
tensor with shape (length, ...)
;; Slice a batch of data
(define batch-data (make-tensor32 (make-f32vector 100) '(10 10)))
(define mini-batch (slice-tensor batch-data 2 5))  ; Shape: (5, 10)

;; Gradients flow back to original positions
(backward! (sum-tensor mini-batch))
(tensor-grad batch-data)  ; Only indices 2-6 have non-zero gradients

Example: Mini-batch training

(define dataset (make-tensor32 training-data '(1000 784)))

(do ((i 0 (+ i batch-size)))
    ((>= i 1000))
  (let ((batch (slice-tensor dataset i batch-size)))
    ;; Process batch
    (let ((output (forward model batch)))
      (backward! output)
      (step! optimizer))))
[procedure] (reshape tensor new-shape) -> tensor

Reshapes the tensor. Total number of elements must be preserved. Creates a new tensor with separate gradient buffer but shared underlying data.

(define x (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define x-flat (reshape x '(4)))  ; Flatten to 1D
(define x-back (reshape x-flat '(2 2)))  ; Reshape back
[procedure] (flatten-tensor tensor) -> tensor

Flattens a multi-dimensional tensor to 1D. Equivalent to (reshape tensor (list total-size)).

Activation Functions
[procedure] (relu tensor) -> tensor

Rectified Linear Unit: max(0, x).

Gradient: 1 if x > 0, else 0

[procedure] (tanh-op tensor) -> tensor

Hyperbolic tangent activation.

Gradient: 1 - tanh^2(x)

[procedure] (sigmoid tensor) -> tensor

Sigmoid (logistic) activation: σ(x) = 1 / (1 + e^(-x)).

Gradient: σ(x) · (1 - σ(x))

[procedure] (sigmoid-stable tensor) -> tensor

Numerically stable sigmoid implementation for large negative values.

[procedure] (softmax x #!key (dim #f)) -> tensor

Softmax normalization with numerical stability (subtracts max before exp).

(define probs (softmax logits))  ; Converts logits to probabilities
[procedure] (log-softmax x #!key (dim #f)) -> tensor

Log-softmax: more numerically stable than log(softmax(x)).

[procedure] (leaky-relu tensor #!key (alpha 0.01)) -> tensor

Leaky ReLU: max(alpha * x, x).

[procedure] (softplus tensor #!key (beta 1.0)) -> tensor

Softplus activation: log(1 + e^(beta * x)) / beta.

[procedure] (gelu tensor) -> tensor

Gaussian Error Linear Unit activation using tanh approximation.

[procedure] (silu tensor) -> tensor

SiLU (Sigmoid Linear Unit) activation, also known as Swish: x * σ(x).

Loss Functions
[procedure] (mse-loss pred target) -> tensor

Mean Squared Error loss: L = (1/n) ∑(pred - target)².

(define loss (mse-loss predictions targets))
[procedure] (cross-entropy-loss pred target) -> tensor

Cross-entropy loss: L = -∑(target · log(pred)).

Note: Assumes pred is already normalized (e.g., via softmax).

Gradient Operations
[procedure] (zero-grad! tensor) -> void

Sets all gradient values to zero.

[procedure] (backward! tensor) -> void

Computes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found.

(define x (make-tensor32 (f32vector 1.0 2.0) '(2)))
(define y (make-tensor32 (f32vector 3.0 4.0) '(2)))
(define z (add x y))
(define loss (dot-op z z))

(backward! loss)
(print-tensor (tensor-grad x))
[procedure] (add-to-grad! tensor delta) -> void

Accumulates delta into the tensor's gradient using BLAS AXPY.

Convolution Operations
[procedure] (conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor

2D convolution using im2col + GEMM algorithm.

input
tensor of shape (C_in, H, W)
weight
tensor of shape (C_out, C_in, KH, KW)
bias
tensor of shape (C_out) or #f
stride
stride for convolution (default 1)
padding
zero-padding (default 0)
(define output (conv2d input weights bias stride: 2 padding: 1))
Normalization Operations
[procedure] (rmsnorm x weight #!key (epsilon 1e-5)) -> tensor

Root Mean Square Layer Normalization.

[procedure] (l2-normalize tensor #!key (epsilon 1e-8)) -> tensor

L2 normalization: x / ||x||₂.

[procedure] (cosine-similarity a b) -> tensor

Cosine similarity: (a · b) / (||a|| · ||b||).

Utility Functions
[procedure] (tensor->list tensor) -> list

Converts tensor data to a list.

[procedure] (print-tensor tensor) -> void

Pretty-prints tensor information including shape, dtype, data, and gradients.

[procedure] (vector-length-for-dtype vec dtype) -> integer

Returns the length of a vector based on its dtype.

nanograd-layer

Neural network layer abstractions and containers.

Layer Predicates
[procedure] (layer? obj) -> boolean
[procedure] (dense-layer? obj) -> boolean
[procedure] (conv2d-layer? obj) -> boolean
[procedure] (batch-norm-2d? obj) -> boolean
[procedure] (sequential? obj) -> boolean
Dense Layer
[procedure] (make-dense-layer input-size output-size #!key (activation (make-identity)) (dtype 'f32) (name "Dense")) -> layer

Creates a fully-connected (dense) layer with Xavier/Glorot initialization.

input-size
number of input features
output-size
number of output features
activation
activation function object (default identity)
dtype
'f32 or 'f64 (default 'f32)
name
layer name for debugging
(define layer (make-dense-layer 784 128 
                                activation: (make-relu)
                                name: "Hidden1"))
Convolutional Layer
[procedure] (make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layer

Creates a 2D convolutional layer with He initialization.

(define conv (make-conv2d-layer 3 32 3 
                                stride: 1 
                                padding: 1
                                activation: (make-relu)))
Batch Normalization Layer
[procedure] (make-batch-norm-2d num-features #!key (epsilon 1e-5) (momentum 0.1) (dtype 'f32) (name "BatchNorm2d")) -> layer

Creates a 2D batch normalization layer. Normalizes activations across the batch dimension:

y = γ * (x - μ) / √(σ² + ε) + β

where μ and σ² are computed from the batch (training mode) or from running statistics (evaluation mode).

num-features
number of channels (C)
epsilon
small constant for numerical stability (default 1e-5)
momentum
momentum for updating running statistics (default 0.1)
dtype
'f32 or 'f64 (default 'f32)
name
layer name
;; Create batch norm for 64 channels
(define bn (make-batch-norm-2d 64 epsilon: 1e-5 momentum: 0.1))

;; Training mode: uses batch statistics
(set-training-mode! bn #t)
(define normalized (forward bn input))  ; Input shape: (64, H, W)

;; Evaluation mode: uses running statistics
(set-eval-mode! bn)
(define normalized (forward bn input))  ; Deterministic output

Batch normalization improves training stability and convergence by:

Key features:

Example: ResNet-style block with batch normalization

(define (make-resnet-block in-channels out-channels)
  (make-sequential
   (list
    (make-conv2d-layer in-channels out-channels 3 
                       padding: 1 activation: (make-identity))
    (make-batch-norm-2d out-channels)
    ;; Apply ReLU activation here
    (make-conv2d-layer out-channels out-channels 3
                       padding: 1 activation: (make-identity))
    (make-batch-norm-2d out-channels))
   name: "ResNetBlock"))
Global Average Pooling
[procedure] (global-avg-pool2d input) -> tensor

Global average pooling over spatial dimensions. Reduces spatial dimensions to 1x1 by averaging.

Input shape
(C, H, W)
Output shape
(C,)

Gradient: Distributed uniformly over all spatial positions for each channel.

;; Input: 128 channels, 8x8 spatial dimensions
(define feature-maps (make-tensor32 (make-f32vector (* 128 8 8)) '(128 8 8)))

;; Output: 128-dimensional feature vector
(define pooled (global-avg-pool2d feature-maps))  ; Shape: (128,)

;; Use in classification network
(define logits (forward fc-layer pooled))

Global average pooling is commonly used to replace large fully-connected layers at the end of CNNs:

Example: Replacing FC layers with global pooling

;; Traditional approach: flatten + dense (many parameters)
(define old-cnn
  (make-sequential
   (list
    (make-conv2d-layer 64 128 3)
    ;; Must flatten: (128, 8, 8) -> (8192,)
    (make-dense-layer 8192 10))))  ; 81,920 parameters!

;; Modern approach: global pooling + dense (fewer parameters)
(define new-cnn
  (make-sequential
   (list
    (make-conv2d-layer 64 128 3)
    ;; Global pooling: (128, 8, 8) -> (128,)
    (make-dense-layer 128 10))))  ; Only 1,280 parameters!
Sequential Container
[procedure] (make-sequential layers #!key (name "Sequential")) -> layer

Creates a sequential container that chains multiple layers.

(define model
  (make-sequential
   (list
    (make-dense-layer 784 128 activation: (make-relu))
    (make-dense-layer 128 64 activation: (make-relu))
    (make-dense-layer 64 10 activation: (make-identity)))
   name: "MLP"))
Layer Operations
[procedure] (forward layer input) -> tensor

Performs a forward pass through the layer.

[procedure] (parameters layer) -> list

Returns a list of all trainable parameter tensors.

[procedure] (zero-grad-layer! layer) -> void

Zeros gradients for all parameters in the layer.

[procedure] (set-training-mode! layer training?) -> void

Sets the training mode for the layer. When training? is #t, the layer uses training-specific behavior (e.g., batch statistics for batch norm). When #f, uses evaluation behavior.

;; Set model to training mode
(set-training-mode! model #t)

;; Set model to evaluation mode
(set-training-mode! model #f)
[procedure] (set-eval-mode! layer) -> void

Shorthand for (set-training-mode! layer #f). Sets the layer to evaluation mode.

;; Evaluation mode (shorthand)
(set-eval-mode! model)

Training vs Evaluation Mode:

Training Mode ((set-training-mode! layer #t)):

Evaluation Mode ((set-eval-mode! layer)):

;; Complete training/evaluation workflow
(define (train-epoch model optimizer train-data)
  ;; Enable training mode
  (set-training-mode! model #t)
  
  (for-each
   (lambda (batch)
     (let* ((x (car batch))
            (y (cdr batch))
            (pred (forward model x))
            (loss (cross-entropy-loss pred y)))
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   train-data))

(define (evaluate-epoch model test-data)
  ;; Enable evaluation mode
  (set-eval-mode! model)
  
  (let ((total-correct 0))
    (for-each
     (lambda (batch)
       (let* ((x (car batch))
              (y (cdr batch))
              (pred (forward model x)))
         ;; Count correct predictions
         (when (= (argmax pred) (argmax y))
           (set! total-correct (+ total-correct 1)))))
     test-data)
    total-correct))

;; Main loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (train-epoch model optimizer train-data)
  (let ((accuracy (evaluate-epoch model test-data)))
    (printf "Epoch ~A: Test Accuracy = ~A%\n" 
            epoch (* 100 accuracy))))
[procedure] (layer-input-size layer) -> integer
[procedure] (layer-output-size layer) -> integer
[procedure] (layer-activation layer) -> activation
[procedure] (layer-name layer) -> string

Accessor functions for layer properties.

Activation Function Objects
[procedure] (make-relu) -> activation
[procedure] (make-tanh) -> activation
[procedure] (make-sigmoid) -> activation
[procedure] (make-gelu) -> activation
[procedure] (make-silu) -> activation
[procedure] (make-identity) -> activation

Creates activation function objects for use in layers.

[procedure] (activation? obj) -> boolean
[procedure] (activation-forward act x) -> tensor
[procedure] (activation-name act) -> string
Utility Functions
[procedure] (print-layer layer #!optional (indent 0)) -> void

Prints layer information with optional indentation.

[procedure] (summary model) -> void

Prints a model summary including all layers and parameter counts.

(summary model)
; === Model Summary ===
; Model: MLP
; Input size: 784
; Output size: 10
; 
; Total parameters: 101770

nanograd-optimizer

Optimization algorithms for neural network training.

Optimizer Predicates
[procedure] (optimizer? obj) -> boolean
[procedure] (sgd? obj) -> boolean
[procedure] (adam? obj) -> boolean
[procedure] (rmsprop? obj) -> boolean
SGD Optimizer
[procedure] (make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizer

Stochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration.

parameters
list of parameter tensors to optimize
learning-rate
step size (default 0.01)
momentum
momentum factor (default 0.0, no momentum)
weight-decay
L2 regularization factor (default 0.0)
nesterov
use Nesterov momentum (default #f)
(define opt (make-sgd (parameters model) 
                      learning-rate: 0.01
                      momentum: 0.9))
Adam Optimizer
[procedure] (make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizer

Adam (Adaptive Moment Estimation) optimizer with bias correction.

beta1
exponential decay rate for first moment (default 0.9)
beta2
exponential decay rate for second moment (default 0.999)
epsilon
numerical stability constant (default 1e-8)
(define opt (make-adam (parameters model) learning-rate: 0.001))
RMSprop Optimizer
[procedure] (make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizer

RMSprop optimizer with optional momentum.

alpha
smoothing constant (default 0.99)
(define opt (make-rmsprop (parameters model) 
                          learning-rate: 0.01
                          alpha: 0.99))
Optimizer Operations
[procedure] (step! optimizer) -> void

Applies parameter updates based on accumulated gradients.

[procedure] (get-learning-rate optimizer) -> number

Returns the current learning rate.

[procedure] (set-learning-rate! optimizer lr) -> void

Updates the learning rate (useful for learning rate scheduling).

; Learning rate decay
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (set-learning-rate! opt (/ 0.1 (+ 1.0 (* 0.01 epoch))))
  ; ... training code ...
  )
[procedure] (optimizer-state optimizer) -> alist

Returns an association list of optimizer configuration parameters.

Examples

Basic Tensor Operations

(import nanograd-autograd)

; Create tensors
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3)))
(define y (make-tensor32 (f32vector 4.0 5.0 6.0) '(3)))

; Operations
(define z (add x y))
(define w (mul x y))

; Compute gradients
(backward! w)
(print-tensor (tensor-grad x))

Reduction Operations

(import nanograd-autograd)

;; Sum all elements
(define x (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(4)))
(define total (sum-tensor x))  ; 10.0
(backward! total)
(print-tensor (tensor-grad x))  ; Each element: 1.0

;; Mean of elements
(define avg (mean-tensor x))  ; 2.5
(backward! avg)
(print-tensor (tensor-grad x))  ; Each element: 0.25

;; Product of elements
(define prod (product-tensor x))  ; 24.0
(backward! prod)
(print-tensor (tensor-grad x))  ; [12.0, 8.0, 6.0, 4.0]

Tensor Slicing for Mini-Batch Training

(import nanograd-autograd)

;; Create dataset tensor
(define dataset (make-tensor32 training-data '(1000 784)))

;; Process in mini-batches
(define batch-size 32)

(do ((i 0 (+ i batch-size)))
    ((>= i 1000))
  ;; Extract batch
  (let* ((batch (slice-tensor dataset i batch-size))
         (output (forward model batch))
         (loss (mse-loss output targets)))
    
    ;; Backprop and optimize
    (backward! loss)
    (step! optimizer)
    (zero-grad-layer! model)))

Training a Neural Network

(import nanograd-autograd nanograd-layer nanograd-optimizer)

; Define model
(define model
  (make-sequential
   (list
    (make-dense-layer 2 8 activation: (make-relu))
    (make-dense-layer 8 1 activation: (make-identity)))
   name: "Regression"))

; Create optimizer
(define optimizer (make-adam (parameters model) learning-rate: 0.01))

; Training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  
  (for-each
   (lambda (sample)
     (let* ((x (make-tensor32 (car sample) '(2)))
            (target (make-tensor32 (f32vector (cdr sample)) '(1)))
            (pred (forward model x))
            (loss (mse-loss pred target)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   training-data))

Convolutional Neural Network with Batch Normalization

(import nanograd-autograd nanograd-layer nanograd-optimizer)

;; Modern CNN architecture with batch normalization
(define cnn
  (make-sequential
   (list
    ;; Convolutional block 1
    (make-conv2d-layer 3 32 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 32)
    ;; Convolutional block 2
    (make-conv2d-layer 32 64 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 64)
    ;; Global average pooling instead of flatten
    ;; (64, H, W) -> (64,)
    (make-dense-layer 64 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "CNN"))

;; Training with proper mode switching
(define optimizer (make-adam (parameters cnn) learning-rate: 0.001))

(define (train-one-epoch)
  ;; Set training mode for batch norm
  (set-training-mode! cnn #t)
  
  (for-each
   (lambda (batch)
     (let* ((images (car batch))  ; Shape: (batch, 3, 32, 32)
            (labels (cdr batch))
            ;; Process each image in batch
            (predictions (map (lambda (img)
                               (forward cnn img))
                             images))
            (loss (compute-loss predictions labels)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! cnn)))
   train-batches))

(define (evaluate)
  ;; Set evaluation mode for batch norm
  (set-eval-mode! cnn)
  
  (let ((correct 0)
        (total 0))
    (for-each
     (lambda (batch)
       (let* ((images (car batch))
              (labels (cdr batch)))
         (for-each
          (lambda (img label)
            (let ((pred (forward cnn img)))
              (when (= (argmax (tensor->list pred))
                      (argmax (tensor->list label)))
                (set! correct (+ correct 1)))
              (set! total (+ total 1))))
          images labels)))
     test-batches)
    
    (/ correct total)))

;; Main training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 50))
  (train-one-epoch)
  (printf "Epoch ~A: Test Accuracy = ~A%\n" 
          epoch (* 100 (evaluate))))

ResNet-Style Architecture

;; ResNet block with batch normalization
(define (make-resnet-block in-channels out-channels stride)
  (make-sequential
   (list
    (make-conv2d-layer in-channels out-channels 3 
                       stride: stride padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels)
    ;; ReLU activation
    (make-conv2d-layer out-channels out-channels 3
                       stride: 1 padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels))
   name: "ResBlock"))

;; Full ResNet-18 style model
(define resnet
  (make-sequential
   (list
    ;; Initial convolution
    (make-conv2d-layer 3 64 7 stride: 2 padding: 3)
    (make-batch-norm-2d 64)
    
    ;; Residual blocks
    (make-resnet-block 64 64 1)
    (make-resnet-block 64 128 2)
    (make-resnet-block 128 256 2)
    (make-resnet-block 256 512 2)
    
    ;; Global average pooling: (512, H, W) -> (512,)
    (make-dense-layer 512 1000))
   name: "ResNet18"))

Performance Notes

Limitations

Advanced Usage

Custom Reduction Operations

;; L-infinity norm (maximum absolute value)
(define (l-inf-norm tensor)
  (reduce-tensor (abs tensor) max
    compute-gradient: (lambda (grad-out idx val all-values)
                       (let ((max-val (apply max all-values)))
                         (if (= val max-val) grad-out 0.0)))))

;; Weighted sum
(define (weighted-sum tensor weights)
  (let ((weighted (mul tensor weights)))
    (sum-tensor weighted)))

;; Geometric mean
(define (geometric-mean tensor)
  (let* ((n (apply * (tensor-shape tensor)))
         (log-vals (log-tensor tensor))
         (sum (sum-tensor log-vals))
         (mean-log (scale-op sum (/ 1.0 n))))
    (exp mean-log)))

Gradient Clipping

; Clip gradients by norm
(define (clip-grad-norm! parameters max-norm)
  (let ((total-norm 0.0))
    ; Compute total norm
    (for-each
     (lambda (param)
       (let ((grad (tensor-grad param)))
         (when grad
           (let ((dtype (tensor-dtype param))
                 (n (vector-length-for-dtype grad dtype)))
             (case dtype
               ((f32)
                (do ((i 0 (+ i 1)))
                    ((= i n))
                  (let ((g (f32vector-ref grad i)))
                    (set! total-norm (+ total-norm (* g g))))))
               ((f64)
                (do ((i 0 (+ i 1)))
                    ((= i n))
                  (let ((g (f64vector-ref grad i)))
                    (set! total-norm (+ total-norm (* g g)))))))))))
     parameters)
    
    (let ((total-norm (sqrt total-norm)))
      (when (> total-norm max-norm)
        (let ((scale (/ max-norm total-norm)))
          ; Scale all gradients
          (for-each
           (lambda (param)
             (let ((grad (tensor-grad param)))
               (when grad
                 (let ((n (vector-length-for-dtype 
                          grad 
                          (tensor-dtype param))))
                   (case (tensor-dtype param)
                     ((f32) (sscal! n scale grad))
                     ((f64) (dscal! n scale grad)))))))
           parameters))))))

; Usage
(backward! loss)
(clip-grad-norm! (parameters model) 1.0)
(step! optimizer)

Learning Rate Scheduling

; Step decay
(define (step-decay base-lr epoch drop-every drop-rate)
  (* base-lr (expt drop-rate (floor (/ epoch drop-every)))))

; Exponential decay
(define (exp-decay base-lr epoch decay-rate)
  (* base-lr (exp (- (* decay-rate epoch)))))

; Cosine annealing
(define (cosine-annealing base-lr epoch total-epochs)
  (* 0.5 base-lr (+ 1.0 (cos (* 3.14159 (/ epoch total-epochs))))))

; Usage in training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (set-learning-rate! optimizer (step-decay 0.1 epoch 30 0.5))
  ; ... training code ...
  )

Troubleshooting

Common Errors

Shape mismatch errors

Ensure tensor shapes are compatible for operations:

; Matrix multiplication requires compatible dimensions
(define A (make-tensor32 (make-f32vector 6) '(2 3)))
(define B (make-tensor32 (make-f32vector 6) '(3 2)))
(define C (matmul-op A B))  ; OK: (2,3) × (3,2) = (2,2)

(define D (make-tensor32 (make-f32vector 4) '(2 2)))
(matmul-op A D)  ; Error: incompatible dimensions

Gradient computation cycles

Avoid creating cycles in the computation graph:

; Bad: creates a cycle
(define x (make-tensor32 (f32vector 1.0) '(1)))
(define y (add x x))
(set-backward-fn! x (lambda () (add-to-grad! x (tensor-grad y))) (list y))
(backward! y)  ; Error: computation graph contains cycles

Division by zero

Use safe-div when dividing by potentially zero values:

; Instead of (div a b), use:
(define result (safe-div a b epsilon: 1e-8))

Batch normalization not switching modes

Always set training/eval mode explicitly:

; Training
(set-training-mode! model #t)
(train-epoch model)

; Evaluation
(set-eval-mode! model)
(evaluate model)

Author

Ivan Raikov

Repository

https://github.com/iraikov/nanograd

Version History

1.2
Recent additions

: * Reduction operations (sum-tensor, mean-tensor, product-tensor, reduce-tensor) : * Tensor slicing (slice-tensor) : * Batch normalization (make-batch-norm-2d) : * Global average pooling (global-avg-pool2d) : * Training/evaluation mode control (set-training-mode!, set-eval-mode!)

1.1
Bug fix in mul layer operation
1.0
Initial release

: * Core autograd engine : * Dense and convolutional layers : * SGD, Adam, and RMSprop optimizers : * Basic activation and loss functions

See Also

License

LPGL-3

References