nanograd (historical revision 44985) - The CHICKEN Scheme wiki

You are looking at historical revision 44985 of this page. It may differ significantly from its current revision.

nanograd

A lightweight automatic differentiation and neural network framework for CHICKEN Scheme, featuring BLAS-accelerated operations and YASOS-based object abstractions.

Description

NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features:

Reverse-mode automatic differentiation with gradient computation
BLAS-accelerated linear algebra operations
YASOS-based polymorphic object system
Support for both 32-bit and 64-bit floating-point precision
Common neural network layers (Dense, Convolutional, Batch Normalization)
Common optimization algorithms (SGD, Adam, RMSprop)
Standard activation functions and loss functions
Tensor manipulation with reduction operations and slicing
Training/evaluation mode support for layers

Modules

nanograd-autograd

Core automatic differentiation engine with tensor operations.

Tensor Constructors

[procedure] (make-tensor32 data shape #!key (requires-grad? #t)) -> tensor

Creates a 32-bit floating-point tensor with automatic differentiation support.

data: f32vector containing the tensor data
shape: list of dimensions, e.g., '(2 3) for a 2x3 matrix
requires-grad?: whether to track gradients (default #t)

(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t))

[procedure] (make-tensor64 data shape #!key (requires-grad? #t)) -> tensor

Creates a 64-bit floating-point tensor with automatic differentiation support.

(define x (make-tensor64 (f64vector 1.0 2.0 3.0 4.0) '(2 2)))

Tensor Predicates

[procedure] (tensor? obj) -> boolean
[procedure] (tensor32? obj) -> boolean
[procedure] (tensor64? obj) -> boolean

Type predicates for tensors.

Tensor Accessors

[procedure] (tensor-data tensor) -> vector

Returns the underlying f32vector or f64vector containing the tensor's data.

[procedure] (tensor-grad tensor) -> vector or #f

Returns the gradient vector if gradients are enabled, #f otherwise.

[procedure] (tensor-shape tensor) -> list

Returns the shape as a list of dimensions.

[procedure] (tensor-dtype tensor) -> symbol

Returns the data type: 'f32 or 'f64.

[procedure] (tensor-requires-grad? tensor) -> boolean

Returns #t if the tensor tracks gradients.

Arithmetic Operations

[procedure] (add a b) -> tensor

Element-wise addition of tensors a and b. Both tensors must have the same shape and dtype.

(define z (add x y))  ; z = x + y

Gradient: dL/da = dL/dz, dL/db = dL/dz

[procedure] (sub a b) -> tensor

Element-wise subtraction: a - b.

Gradient: dL/da = dL/dz, dL/db = -dL/dz

[procedure] (mul a b) -> tensor

Element-wise multiplication (Hadamard product).

Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a

[procedure] (div a b) -> tensor

Element-wise division: a / b.

Gradient: dL/da = dL/dz / b, dL/db = -dL/dz · (a / b²)

[procedure] (safe-div a b #!key (epsilon 1e-8)) -> tensor

Safe element-wise division: a / (b + epsilon) to avoid division by zero.

Linear Algebra Operations

[procedure] (matmul-op a b) -> tensor

Matrix multiplication using BLAS GEMM/GEMV operations. Supports:

Matrix × Matrix
Matrix × Vector
Vector × Matrix
Vector × Vector (dot product)

(define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define b (make-tensor32 (f32vector 5.0 6.0) '(2)))
(define c (matmul-op A b))  ; 2×2 matrix times 2×1 vector = 2×1 vector

Gradient: dL/dA = dL/dC · B^T, dL/dB = A^T · dL/dC

[procedure] (dot-op a b) -> tensor

Dot product (inner product) of two 1D vectors using BLAS DOT.

(define result (dot-op x y))  ; scalar result

Gradient: dL/da = (dL/dresult) · b, dL/db = (dL/dresult) · a

[procedure] (scale-op tensor scalar) -> tensor

Scalar multiplication using BLAS SCAL.

Gradient: dL/dtensor = scalar · dL/dresult

Reduction Operations

[procedure] (reduce-tensor tensor reducer #!key (compute-gradient #f)) -> tensor

Generic reduction operation that maintains gradient flow. The reducer function is applied to each element in the forward pass. An optional compute-gradient function specifies how gradients are distributed in the backward pass.

tensor: input tensor to reduce
reducer: function (element accumulator) -> new-accumulator
compute-gradient: optional function (grad-out index value all-values) -> grad-in
If not provided, assumes uniform distribution (like sum)

Returns a scalar tensor with the reduced value.

;; Sum all elements (uniform gradient distribution)
(define total (reduce-tensor x +))

;; Product of all elements (gradient uses product rule)
(define prod (reduce-tensor x *
  compute-gradient: (lambda (grad-out idx val all-values)
                     ;; d(prod)/dx_i = prod / x_i
                     (let ((prod (fold * 1.0 all-values)))
                       (if (> val 0.0)
                           (* grad-out (/ prod val))
                           0.0)))))

;; Custom maximum with gradient flowing only to max element
(define max-val (reduce-tensor x max
  compute-gradient: (lambda (grad-out idx val all-values)
                     (if (= val (apply max all-values))
                         grad-out
                         0.0))))

[procedure] (sum-tensor tensor) -> tensor

Sums all elements in the tensor. Gradient is distributed uniformly to all elements.

(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3)))
(define total (sum-tensor x))  ; Returns scalar tensor with value 6.0

(backward! total)
(tensor-grad x)  ; Each element receives gradient of 1.0

[procedure] (product-tensor tensor) -> tensor

Computes the product of all elements. Gradient uses the product rule: d(prod)/dx_i = prod / x_i.

(define x (make-tensor32 (f32vector 2.0 3.0 4.0) '(3)))
(define prod (product-tensor x))  ; Returns 24.0

(backward! prod)
(tensor-grad x)  ; Gradients: [12.0, 8.0, 6.0]

[procedure] (mean-tensor tensor) -> tensor

Computes the mean (average) of all elements. Equivalent to (sum-tensor tensor) / n.

(define x (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(4)))
(define avg (mean-tensor x))  ; Returns 2.5

(backward! avg)
(tensor-grad x)  ; Each element receives gradient of 0.25

Tensor Manipulation Operations

[procedure] (slice-tensor tensor start length) -> tensor

Extracts a slice of a tensor along the first dimension. Gradients flow back correctly to the original tensor positions.

tensor: input tensor with shape (n, ...)
start: starting index (0-based)
length: number of elements to extract
Returns: tensor with shape (length, ...)

;; Slice a batch of data
(define batch-data (make-tensor32 (make-f32vector 100) '(10 10)))
(define mini-batch (slice-tensor batch-data 2 5))  ; Shape: (5, 10)

;; Gradients flow back to original positions
(backward! (sum-tensor mini-batch))
(tensor-grad batch-data)  ; Only indices 2-6 have non-zero gradients

Example: Mini-batch training

(define dataset (make-tensor32 training-data '(1000 784)))

(do ((i 0 (+ i batch-size)))
    ((>= i 1000))
  (let ((batch (slice-tensor dataset i batch-size)))
    ;; Process batch
    (let ((output (forward model batch)))
      (backward! output)
      (step! optimizer))))

[procedure] (reshape tensor new-shape) -> tensor

Reshapes the tensor. Total number of elements must be preserved. Creates a new tensor with separate gradient buffer but shared underlying data.

(define x (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define x-flat (reshape x '(4)))  ; Flatten to 1D
(define x-back (reshape x-flat '(2 2)))  ; Reshape back

[procedure] (flatten-tensor tensor) -> tensor

Flattens a multi-dimensional tensor to 1D. Equivalent to (reshape tensor (list total-size)).

Activation Functions

[procedure] (relu tensor) -> tensor

Rectified Linear Unit: max(0, x).

Gradient: 1 if x > 0, else 0

[procedure] (tanh-op tensor) -> tensor

Hyperbolic tangent activation.

Gradient: 1 - tanh^2(x)

[procedure] (sigmoid tensor) -> tensor

Sigmoid (logistic) activation: σ(x) = 1 / (1 + e^(-x)).

Gradient: σ(x) · (1 - σ(x))

[procedure] (sigmoid-stable tensor) -> tensor

Numerically stable sigmoid implementation for large negative values.

[procedure] (softmax x #!key (dim #f)) -> tensor

Softmax normalization with numerical stability (subtracts max before exp).

(define probs (softmax logits))  ; Converts logits to probabilities

[procedure] (log-softmax x #!key (dim #f)) -> tensor

Log-softmax: more numerically stable than log(softmax(x)).

[procedure] (leaky-relu tensor #!key (alpha 0.01)) -> tensor

Leaky ReLU: max(alpha * x, x).

[procedure] (softplus tensor #!key (beta 1.0)) -> tensor

Softplus activation: log(1 + e^(beta * x)) / beta.

[procedure] (gelu tensor) -> tensor

Gaussian Error Linear Unit activation using tanh approximation.

[procedure] (silu tensor) -> tensor

SiLU (Sigmoid Linear Unit) activation, also known as Swish: x * σ(x).

Loss Functions

[procedure] (mse-loss pred target) -> tensor

Mean Squared Error loss: L = (1/n) ∑(pred - target)².

(define loss (mse-loss predictions targets))

[procedure] (cross-entropy-loss pred target) -> tensor

Cross-entropy loss: L = -∑(target · log(pred)).

Note: Assumes pred is already normalized (e.g., via softmax).

Gradient Operations

[procedure] (zero-grad! tensor) -> void

Sets all gradient values to zero.

[procedure] (backward! tensor) -> void

Computes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found.

(define x (make-tensor32 (f32vector 1.0 2.0) '(2)))
(define y (make-tensor32 (f32vector 3.0 4.0) '(2)))
(define z (add x y))
(define loss (dot-op z z))

(backward! loss)
(print-tensor (tensor-grad x))

[procedure] (add-to-grad! tensor delta) -> void

Accumulates delta into the tensor's gradient using BLAS AXPY.

Convolution Operations

[procedure] (conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor

2D convolution using im2col + GEMM algorithm.

input: tensor of shape (C_in, H, W)
weight: tensor of shape (C_out, C_in, KH, KW)
bias: tensor of shape (C_out) or #f
stride: stride for convolution (default 1)
padding: zero-padding (default 0)

(define output (conv2d input weights bias stride: 2 padding: 1))

Normalization Operations

[procedure] (rmsnorm x weight #!key (epsilon 1e-5)) -> tensor

Root Mean Square Layer Normalization.

[procedure] (l2-normalize tensor #!key (epsilon 1e-8)) -> tensor

L2 normalization: x / ||x||₂.

[procedure] (cosine-similarity a b) -> tensor

Cosine similarity: (a · b) / (||a|| · ||b||).

Utility Functions

[procedure] (tensor->list tensor) -> list

Converts tensor data to a list.

[procedure] (print-tensor tensor) -> void

Pretty-prints tensor information including shape, dtype, data, and gradients.

[procedure] (vector-length-for-dtype vec dtype) -> integer

Returns the length of a vector based on its dtype.

nanograd-layer

Neural network layer abstractions and containers.

Layer Predicates

[procedure] (layer? obj) -> boolean
[procedure] (dense-layer? obj) -> boolean
[procedure] (conv2d-layer? obj) -> boolean
[procedure] (batch-norm-2d? obj) -> boolean
[procedure] (sequential? obj) -> boolean

Dense Layer

[procedure] (make-dense-layer input-size output-size #!key (activation (make-identity)) (dtype 'f32) (name "Dense")) -> layer

Creates a fully-connected (dense) layer with Xavier/Glorot initialization.

input-size: number of input features
output-size: number of output features
activation: activation function object (default identity)
dtype: 'f32 or 'f64 (default 'f32)
name: layer name for debugging

(define layer (make-dense-layer 784 128 
                                activation: (make-relu)
                                name: "Hidden1"))

Convolutional Layer

[procedure] (make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layer

Creates a 2D convolutional layer with He initialization.

(define conv (make-conv2d-layer 3 32 3 
                                stride: 1 
                                padding: 1
                                activation: (make-relu)))

Batch Normalization Layer

[procedure] (make-batch-norm-2d num-features #!key (epsilon 1e-5) (momentum 0.1) (dtype 'f32) (name "BatchNorm2d")) -> layer

Creates a 2D batch normalization layer. Normalizes activations across the batch dimension:

y = γ * (x - μ) / √(σ² + ε) + β

where μ and σ² are computed from the batch (training mode) or from running statistics (evaluation mode).

num-features: number of channels (C)
epsilon: small constant for numerical stability (default 1e-5)
momentum: momentum for updating running statistics (default 0.1)
dtype: 'f32 or 'f64 (default 'f32)
name: layer name

;; Create batch norm for 64 channels
(define bn (make-batch-norm-2d 64 epsilon: 1e-5 momentum: 0.1))

;; Training mode: uses batch statistics
(set-training-mode! bn #t)
(define normalized (forward bn input))  ; Input shape: (64, H, W)

;; Evaluation mode: uses running statistics
(set-eval-mode! bn)
(define normalized (forward bn input))  ; Deterministic output

Batch normalization improves training stability and convergence by:

Reducing internal covariate shift
Allowing higher learning rates
Acting as a form of regularization
Making networks less sensitive to initialization

Key features:

Learnable scale (gamma) and shift (beta) parameters
Running mean and variance maintained for evaluation
Automatic mode switching between training and evaluation
Numerical stability with epsilon parameter

Example: ResNet-style block with batch normalization

(define (make-resnet-block in-channels out-channels)
  (make-sequential
   (list
    (make-conv2d-layer in-channels out-channels 3 
                       padding: 1 activation: (make-identity))
    (make-batch-norm-2d out-channels)
    ;; Apply ReLU activation here
    (make-conv2d-layer out-channels out-channels 3
                       padding: 1 activation: (make-identity))
    (make-batch-norm-2d out-channels))
   name: "ResNetBlock"))

Global Average Pooling

[procedure] (global-avg-pool2d input) -> tensor

Global average pooling over spatial dimensions. Reduces spatial dimensions to 1x1 by averaging.

Input shape: (C, H, W)
Output shape: (C,)

Gradient: Distributed uniformly over all spatial positions for each channel.

;; Input: 128 channels, 8x8 spatial dimensions
(define feature-maps (make-tensor32 (make-f32vector (* 128 8 8)) '(128 8 8)))

;; Output: 128-dimensional feature vector
(define pooled (global-avg-pool2d feature-maps))  ; Shape: (128,)

;; Use in classification network
(define logits (forward fc-layer pooled))

Global average pooling is commonly used to replace large fully-connected layers at the end of CNNs:

Reduces number of parameters dramatically
Improves generalization
Makes networks translation-invariant
Standard in modern architectures (ResNet, MobileNet, EfficientNet)

Example: Replacing FC layers with global pooling

;; Traditional approach: flatten + dense (many parameters)
(define old-cnn
  (make-sequential
   (list
    (make-conv2d-layer 64 128 3)
    ;; Must flatten: (128, 8, 8) -> (8192,)
    (make-dense-layer 8192 10))))  ; 81,920 parameters!

;; Modern approach: global pooling + dense (fewer parameters)
(define new-cnn
  (make-sequential
   (list
    (make-conv2d-layer 64 128 3)
    ;; Global pooling: (128, 8, 8) -> (128,)
    (make-dense-layer 128 10))))  ; Only 1,280 parameters!

Sequential Container

[procedure] (make-sequential layers #!key (name "Sequential")) -> layer

Creates a sequential container that chains multiple layers.

(define model
  (make-sequential
   (list
    (make-dense-layer 784 128 activation: (make-relu))
    (make-dense-layer 128 64 activation: (make-relu))
    (make-dense-layer 64 10 activation: (make-identity)))
   name: "MLP"))

Layer Operations

[procedure] (forward layer input) -> tensor

Performs a forward pass through the layer.

[procedure] (parameters layer) -> list

Returns a list of all trainable parameter tensors.

[procedure] (zero-grad-layer! layer) -> void

Zeros gradients for all parameters in the layer.

[procedure] (set-training-mode! layer training?) -> void

Sets the training mode for the layer. When training? is #t, the layer uses training-specific behavior (e.g., batch statistics for batch norm). When #f, uses evaluation behavior.

;; Set model to training mode
(set-training-mode! model #t)

;; Set model to evaluation mode
(set-training-mode! model #f)

[procedure] (set-eval-mode! layer) -> void

Shorthand for (set-training-mode! layer #f). Sets the layer to evaluation mode.

;; Evaluation mode (shorthand)
(set-eval-mode! model)

Training vs Evaluation Mode:

Training Mode ((set-training-mode! layer #t)):

Batch normalization uses batch statistics (mean and variance computed from current batch)
Dropout is active (if implemented)
Stochastic behavior enabled
Running statistics updated

Evaluation Mode ((set-eval-mode! layer)):

Batch normalization uses running statistics (accumulated during training)
Dropout is disabled
Deterministic behavior
Running statistics frozen

;; Complete training/evaluation workflow
(define (train-epoch model optimizer train-data)
  ;; Enable training mode
  (set-training-mode! model #t)
  
  (for-each
   (lambda (batch)
     (let* ((x (car batch))
            (y (cdr batch))
            (pred (forward model x))
            (loss (cross-entropy-loss pred y)))
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   train-data))

(define (evaluate-epoch model test-data)
  ;; Enable evaluation mode
  (set-eval-mode! model)
  
  (let ((total-correct 0))
    (for-each
     (lambda (batch)
       (let* ((x (car batch))
              (y (cdr batch))
              (pred (forward model x)))
         ;; Count correct predictions
         (when (= (argmax pred) (argmax y))
           (set! total-correct (+ total-correct 1)))))
     test-data)
    total-correct))

;; Main loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (train-epoch model optimizer train-data)
  (let ((accuracy (evaluate-epoch model test-data)))
    (printf "Epoch ~A: Test Accuracy = ~A%\n" 
            epoch (* 100 accuracy))))

[procedure] (layer-input-size layer) -> integer
[procedure] (layer-output-size layer) -> integer
[procedure] (layer-activation layer) -> activation
[procedure] (layer-name layer) -> string

Accessor functions for layer properties.

Activation Function Objects

[procedure] (make-relu) -> activation
[procedure] (make-tanh) -> activation
[procedure] (make-sigmoid) -> activation
[procedure] (make-gelu) -> activation
[procedure] (make-silu) -> activation
[procedure] (make-identity) -> activation

Creates activation function objects for use in layers.

[procedure] (activation? obj) -> boolean
[procedure] (activation-forward act x) -> tensor
[procedure] (activation-name act) -> string

Utility Functions

[procedure] (print-layer layer #!optional (indent 0)) -> void

Prints layer information with optional indentation.

[procedure] (summary model) -> void

Prints a model summary including all layers and parameter counts.

(summary model)
; === Model Summary ===
; Model: MLP
; Input size: 784
; Output size: 10
; 
; Total parameters: 101770

nanograd-optimizer

Optimization algorithms for neural network training.

Optimizer Predicates

[procedure] (optimizer? obj) -> boolean
[procedure] (sgd? obj) -> boolean
[procedure] (adam? obj) -> boolean
[procedure] (rmsprop? obj) -> boolean

SGD Optimizer

[procedure] (make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizer

Stochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration.

parameters: list of parameter tensors to optimize
learning-rate: step size (default 0.01)
momentum: momentum factor (default 0.0, no momentum)
weight-decay: L2 regularization factor (default 0.0)
nesterov: use Nesterov momentum (default #f)

(define opt (make-sgd (parameters model) 
                      learning-rate: 0.01
                      momentum: 0.9))

Adam Optimizer

[procedure] (make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizer

Adam (Adaptive Moment Estimation) optimizer with bias correction.

beta1: exponential decay rate for first moment (default 0.9)
beta2: exponential decay rate for second moment (default 0.999)
epsilon: numerical stability constant (default 1e-8)

(define opt (make-adam (parameters model) learning-rate: 0.001))

RMSprop Optimizer

[procedure] (make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizer

RMSprop optimizer with optional momentum.

alpha: smoothing constant (default 0.99)

(define opt (make-rmsprop (parameters model) 
                          learning-rate: 0.01
                          alpha: 0.99))

Optimizer Operations

[procedure] (step! optimizer) -> void

Applies parameter updates based on accumulated gradients.

[procedure] (get-learning-rate optimizer) -> number

Returns the current learning rate.

[procedure] (set-learning-rate! optimizer lr) -> void

Updates the learning rate (useful for learning rate scheduling).

; Learning rate decay
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (set-learning-rate! opt (/ 0.1 (+ 1.0 (* 0.01 epoch))))
  ; ... training code ...
  )

[procedure] (optimizer-state optimizer) -> alist

Returns an association list of optimizer configuration parameters.

Examples

Basic Tensor Operations

(import nanograd-autograd)

; Create tensors
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3)))
(define y (make-tensor32 (f32vector 4.0 5.0 6.0) '(3)))

; Operations
(define z (add x y))
(define w (mul x y))

; Compute gradients
(backward! w)
(print-tensor (tensor-grad x))

Reduction Operations

(import nanograd-autograd)

;; Sum all elements
(define x (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(4)))
(define total (sum-tensor x))  ; 10.0
(backward! total)
(print-tensor (tensor-grad x))  ; Each element: 1.0

;; Mean of elements
(define avg (mean-tensor x))  ; 2.5
(backward! avg)
(print-tensor (tensor-grad x))  ; Each element: 0.25

;; Product of elements
(define prod (product-tensor x))  ; 24.0
(backward! prod)
(print-tensor (tensor-grad x))  ; [12.0, 8.0, 6.0, 4.0]

Tensor Slicing for Mini-Batch Training

(import nanograd-autograd)

;; Create dataset tensor
(define dataset (make-tensor32 training-data '(1000 784)))

;; Process in mini-batches
(define batch-size 32)

(do ((i 0 (+ i batch-size)))
    ((>= i 1000))
  ;; Extract batch
  (let* ((batch (slice-tensor dataset i batch-size))
         (output (forward model batch))
         (loss (mse-loss output targets)))
    
    ;; Backprop and optimize
    (backward! loss)
    (step! optimizer)
    (zero-grad-layer! model)))

Training a Neural Network

(import nanograd-autograd nanograd-layer nanograd-optimizer)

; Define model
(define model
  (make-sequential
   (list
    (make-dense-layer 2 8 activation: (make-relu))
    (make-dense-layer 8 1 activation: (make-identity)))
   name: "Regression"))

; Create optimizer
(define optimizer (make-adam (parameters model) learning-rate: 0.01))

; Training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  
  (for-each
   (lambda (sample)
     (let* ((x (make-tensor32 (car sample) '(2)))
            (target (make-tensor32 (f32vector (cdr sample)) '(1)))
            (pred (forward model x))
            (loss (mse-loss pred target)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   training-data))

Convolutional Neural Network with Batch Normalization

(import nanograd-autograd nanograd-layer nanograd-optimizer)

;; Modern CNN architecture with batch normalization
(define cnn
  (make-sequential
   (list
    ;; Convolutional block 1
    (make-conv2d-layer 3 32 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 32)
    ;; Convolutional block 2
    (make-conv2d-layer 32 64 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 64)
    ;; Global average pooling instead of flatten
    ;; (64, H, W) -> (64,)
    (make-dense-layer 64 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "CNN"))

;; Training with proper mode switching
(define optimizer (make-adam (parameters cnn) learning-rate: 0.001))

(define (train-one-epoch)
  ;; Set training mode for batch norm
  (set-training-mode! cnn #t)
  
  (for-each
   (lambda (batch)
     (let* ((images (car batch))  ; Shape: (batch, 3, 32, 32)
            (labels (cdr batch))
            ;; Process each image in batch
            (predictions (map (lambda (img)
                               (forward cnn img))
                             images))
            (loss (compute-loss predictions labels)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! cnn)))
   train-batches))

(define (evaluate)
  ;; Set evaluation mode for batch norm
  (set-eval-mode! cnn)
  
  (let ((correct 0)
        (total 0))
    (for-each
     (lambda (batch)
       (let* ((images (car batch))
              (labels (cdr batch)))
         (for-each
          (lambda (img label)
            (let ((pred (forward cnn img)))
              (when (= (argmax (tensor->list pred))
                      (argmax (tensor->list label)))
                (set! correct (+ correct 1)))
              (set! total (+ total 1))))
          images labels)))
     test-batches)
    
    (/ correct total)))

;; Main training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 50))
  (train-one-epoch)
  (printf "Epoch ~A: Test Accuracy = ~A%\n" 
          epoch (* 100 (evaluate))))

ResNet-Style Architecture

;; ResNet block with batch normalization
(define (make-resnet-block in-channels out-channels stride)
  (make-sequential
   (list
    (make-conv2d-layer in-channels out-channels 3 
                       stride: stride padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels)
    ;; ReLU activation
    (make-conv2d-layer out-channels out-channels 3
                       stride: 1 padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels))
   name: "ResBlock"))

;; Full ResNet-18 style model
(define resnet
  (make-sequential
   (list
    ;; Initial convolution
    (make-conv2d-layer 3 64 7 stride: 2 padding: 3)
    (make-batch-norm-2d 64)
    
    ;; Residual blocks
    (make-resnet-block 64 64 1)
    (make-resnet-block 64 128 2)
    (make-resnet-block 128 256 2)
    (make-resnet-block 256 512 2)
    
    ;; Global average pooling: (512, H, W) -> (512,)
    (make-dense-layer 512 1000))
   name: "ResNet18"))

Performance Notes

NanoGrad uses BLAS for matrix operations
Use f32 (32-bit) tensors when 64-bit precision is not required for better performance
The framework detects computation graph cycles and prevents infinite loops during backpropagation
Memory is managed manually; call zero-grad-layer! after each optimization step
Batch normalization adds minimal computational overhead but significantly improves training
Global average pooling reduces parameters without sacrificing performance

Limitations

CPU-only (no GPU support)
No automatic batching
Limited built-in layer types (dense, convolutional, batch norm)
Single-threaded execution
Batch normalization requires proper training/eval mode switching

Advanced Usage

Custom Reduction Operations

;; L-infinity norm (maximum absolute value)
(define (l-inf-norm tensor)
  (reduce-tensor (abs tensor) max
    compute-gradient: (lambda (grad-out idx val all-values)
                       (let ((max-val (apply max all-values)))
                         (if (= val max-val) grad-out 0.0)))))

;; Weighted sum
(define (weighted-sum tensor weights)
  (let ((weighted (mul tensor weights)))
    (sum-tensor weighted)))

;; Geometric mean
(define (geometric-mean tensor)
  (let* ((n (apply * (tensor-shape tensor)))
         (log-vals (log-tensor tensor))
         (sum (sum-tensor log-vals))
         (mean-log (scale-op sum (/ 1.0 n))))
    (exp mean-log)))

Gradient Clipping

; Clip gradients by norm
(define (clip-grad-norm! parameters max-norm)
  (let ((total-norm 0.0))
    ; Compute total norm
    (for-each
     (lambda (param)
       (let ((grad (tensor-grad param)))
         (when grad
           (let ((dtype (tensor-dtype param))
                 (n (vector-length-for-dtype grad dtype)))
             (case dtype
               ((f32)
                (do ((i 0 (+ i 1)))
                    ((= i n))
                  (let ((g (f32vector-ref grad i)))
                    (set! total-norm (+ total-norm (* g g))))))
               ((f64)
                (do ((i 0 (+ i 1)))
                    ((= i n))
                  (let ((g (f64vector-ref grad i)))
                    (set! total-norm (+ total-norm (* g g)))))))))))
     parameters)
    
    (let ((total-norm (sqrt total-norm)))
      (when (> total-norm max-norm)
        (let ((scale (/ max-norm total-norm)))
          ; Scale all gradients
          (for-each
           (lambda (param)
             (let ((grad (tensor-grad param)))
               (when grad
                 (let ((n (vector-length-for-dtype 
                          grad 
                          (tensor-dtype param))))
                   (case (tensor-dtype param)
                     ((f32) (sscal! n scale grad))
                     ((f64) (dscal! n scale grad)))))))
           parameters))))))

; Usage
(backward! loss)
(clip-grad-norm! (parameters model) 1.0)
(step! optimizer)

Learning Rate Scheduling

; Step decay
(define (step-decay base-lr epoch drop-every drop-rate)
  (* base-lr (expt drop-rate (floor (/ epoch drop-every)))))

; Exponential decay
(define (exp-decay base-lr epoch decay-rate)
  (* base-lr (exp (- (* decay-rate epoch)))))

; Cosine annealing
(define (cosine-annealing base-lr epoch total-epochs)
  (* 0.5 base-lr (+ 1.0 (cos (* 3.14159 (/ epoch total-epochs))))))

; Usage in training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (set-learning-rate! optimizer (step-decay 0.1 epoch 30 0.5))
  ; ... training code ...
  )

Troubleshooting

Common Errors

Shape mismatch errors

Ensure tensor shapes are compatible for operations:

; Matrix multiplication requires compatible dimensions
(define A (make-tensor32 (make-f32vector 6) '(2 3)))
(define B (make-tensor32 (make-f32vector 6) '(3 2)))
(define C (matmul-op A B))  ; OK: (2,3) × (3,2) = (2,2)

(define D (make-tensor32 (make-f32vector 4) '(2 2)))
(matmul-op A D)  ; Error: incompatible dimensions

Gradient computation cycles

Avoid creating cycles in the computation graph:

; Bad: creates a cycle
(define x (make-tensor32 (f32vector 1.0) '(1)))
(define y (add x x))
(set-backward-fn! x (lambda () (add-to-grad! x (tensor-grad y))) (list y))
(backward! y)  ; Error: computation graph contains cycles

Division by zero

Use safe-div when dividing by potentially zero values:

; Instead of (div a b), use:
(define result (safe-div a b epsilon: 1e-8))

Batch normalization not switching modes

Always set training/eval mode explicitly:

; Training
(set-training-mode! model #t)
(train-epoch model)

; Evaluation
(set-eval-mode! model)
(evaluate model)

Author

Ivan Raikov

Repository

https://github.com/iraikov/nanograd

Version History

1.2: Recent additions

: * Reduction operations (sum-tensor, mean-tensor, product-tensor, reduce-tensor) : * Tensor slicing (slice-tensor) : * Batch normalization (make-batch-norm-2d) : * Global average pooling (global-avg-pool2d) : * Training/evaluation mode control (set-training-mode!, set-eval-mode!)

1.1: Bug fix in mul layer operation
1.0: Initial release

: * Core autograd engine : * Dense and convolutional layers : * SGD, Adam, and RMSprop optimizers : * Basic activation and loss functions

License

LPGL-3

References

PyTorch: Dynamic computation graphs and autograd
micrograd: Minimalist autograd engine by Andrej Karpathy
"Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018)
"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (Ioffe & Szegedy, 2015)
"BLAS (Basic Linear Algebra Subprograms)" documentation

nanograd

Description

Requirements

Modules

nanograd-autograd

Tensor Constructors

Tensor Predicates

Tensor Accessors

Arithmetic Operations

Linear Algebra Operations

Reduction Operations

Tensor Manipulation Operations

Activation Functions

Loss Functions

Gradient Operations

Convolution Operations

Normalization Operations

Utility Functions

nanograd-layer

Layer Predicates

Dense Layer

Convolutional Layer

Batch Normalization Layer

Global Average Pooling

Sequential Container

Layer Operations

Activation Function Objects

Utility Functions

nanograd-optimizer

Optimizer Predicates

SGD Optimizer

Adam Optimizer

RMSprop Optimizer

Optimizer Operations

Examples

Basic Tensor Operations

Reduction Operations

Tensor Slicing for Mini-Batch Training

Training a Neural Network

Convolutional Neural Network with Batch Normalization

ResNet-Style Architecture

Performance Notes

Limitations

Advanced Usage

Custom Reduction Operations

Gradient Clipping

Learning Rate Scheduling

Troubleshooting

Common Errors

Author

Repository

Version History

See Also

License

References