1. nanograd
    1. Description
    2. Requirements
    3. Modules
      1. nanograd-autograd
        1. Tensor Constructors
        2. Tensor Predicates
        3. Tensor Accessors
        4. Arithmetic Operations
        5. Linear Algebra Operations
        6. Reduction Operations
        7. Tensor Manipulation Operations
        8. Activation Functions
        9. Loss Functions
        10. Normalization Operations
        11. Convolution Operations
        12. Gradient Operations
        13. Utility Functions
      2. nanograd-layer
        1. Layer Predicates
        2. Dense Layer
        3. Convolutional Layer
        4. Batch Normalization Layer
        5. Flatten Layer
        6. Global Average Pooling
        7. Sequential Container
        8. Layer Operations
        9. Activation Function Objects
        10. Utility Functions
      3. nanograd-optimizer
        1. Optimizer Predicates
        2. SGD Optimizer
        3. Adam Optimizer
        4. RMSprop Optimizer
        5. Optimizer Operations
    4. Examples
      1. Batch Processing with Dense Layers
      2. Batched Softmax and Cross-Entropy
      3. Training with Batches
      4. Convolutional Network with Batch Normalization
      5. ResNet-Style Architecture
    5. Performance Notes
    6. Batch Processing Best Practices
    7. Limitations
    8. Troubleshooting
      1. Common Errors
    9. Author
    10. Repository
    11. Version History
    12. See Also
    13. License
    14. References

nanograd

A lightweight automatic differentiation and neural network framework for CHICKEN Scheme, featuring BLAS-accelerated operations, comprehensive batch processing support, and YASOS-based object abstractions.

Description

NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features:

Requirements

Modules

nanograd-autograd

Core automatic differentiation engine with tensor operations and batch support.

Tensor Constructors
[procedure] (make-tensor32 data shape #!key (requires-grad? #t)) -> tensor

Creates a 32-bit floating-point tensor with automatic differentiation support.

data
f32vector containing the tensor data
shape
list of dimensions, e.g., '(2 3) for a 2x3 matrix or '(10 2 3) for batch of 10 matrices
requires-grad?
whether to track gradients (default #t)
; Single vector
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t))

; Batch of vectors
(define batch (make-tensor32 (make-f32vector 60) '(10 6) requires-grad?: #t))
[procedure] (make-tensor64 data shape #!key (requires-grad? #t)) -> tensor

Creates a 64-bit floating-point tensor with automatic differentiation support.

Tensor Predicates
[procedure] (tensor? obj) -> boolean
[procedure] (tensor32? obj) -> boolean
[procedure] (tensor64? obj) -> boolean

Type predicates for tensors.

Tensor Accessors
[procedure] (tensor-data tensor) -> vector

Returns the underlying f32vector or f64vector containing the tensor's data.

[procedure] (tensor-grad tensor) -> vector or #f

Returns the gradient vector if gradients are enabled, #f otherwise.

[procedure] (tensor-shape tensor) -> list

Returns the shape as a list of dimensions.

[procedure] (tensor-dtype tensor) -> symbol

Returns the data type: 'f32 or 'f64.

[procedure] (tensor-requires-grad? tensor) -> boolean

Returns #t if the tensor tracks gradients.

Arithmetic Operations
[procedure] (add a b) -> tensor

Element-wise addition of tensors a and b. Both tensors must have the same shape and dtype.

(define z (add x y))  ; z = x + y

Gradient: dL/da = dL/dz, dL/db = dL/dz

[procedure] (sub a b) -> tensor

Element-wise subtraction: a - b.

Gradient: dL/da = dL/dz, dL/db = -dL/dz

[procedure] (mul a b) -> tensor

Element-wise multiplication (Hadamard product).

Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a

[procedure] (div a b) -> tensor

Element-wise division: a / b.

Gradient: dL/da = dL/dz / b, dL/db = -dL/dz · (a / b²)

[procedure] (safe-div a b #!key (epsilon 1e-8)) -> tensor

Safe element-wise division: a / (b + epsilon) to avoid division by zero.

Linear Algebra Operations
[procedure] (matmul-op a b) -> tensor

Matrix multiplication using BLAS GEMM/GEMV operations with batch support. Supports:

; Standard matrix-vector multiplication
(define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define b (make-tensor32 (f32vector 5.0 6.0) '(2)))
(define c (matmul-op A b))  ; 2×2 matrix times 2×1 vector = 2×1 vector

; Batch matrix multiplication
(define batch-A (make-tensor32 (make-f32vector 80) '(10 2 4)))  ; 10 samples
(define W (make-tensor32 (make-f32vector 12) '(4 3)))
(define batch-result (matmul-op batch-A W))  ; Shape: (10, 2, 3)

Gradient: dL/dA = dL/dC · B^T, dL/dB = A^T · dL/dC

[procedure] (dot-op a b) -> tensor

Dot product (inner product) of two 1D vectors using BLAS DOT.

(define result (dot-op x y))  ; scalar result

Gradient: dL/da = (dL/dresult) · b, dL/db = (dL/dresult) · a

[procedure] (scale-op tensor scalar) -> tensor

Scalar multiplication using BLAS SCAL.

Gradient: dL/dtensor = scalar · dL/dresult

Reduction Operations
[procedure] (reduce-tensor tensor reducer #!key (compute-gradient #f)) -> tensor

Generic reduction operation that maintains gradient flow. The reducer function is applied to each element in the forward pass. An optional compute-gradient function specifies how gradients are distributed in the backward pass.

tensor
input tensor to reduce
reducer
function (element accumulator) -> new-accumulator
compute-gradient
optional function (grad-out index value all-values) -> grad-in
If not provided, assumes uniform distribution (like sum)

Returns a scalar tensor with the reduced value.

;; Sum all elements (uniform gradient distribution)
(define total (reduce-tensor x +))

;; Product of all elements (gradient uses product rule)
(define prod (reduce-tensor x *
  compute-gradient: (lambda (grad-out idx val all-values)
                     ;; d(prod)/dx_i = prod / x_i
                     (let ((prod (fold * 1.0 all-values)))
                       (if (> val 0.0)
                           (* grad-out (/ prod val))
                           0.0)))))
[procedure] (sum-tensor tensor) -> tensor

Sums all elements in the tensor. Gradient is distributed uniformly to all elements.

[procedure] (product-tensor tensor) -> tensor

Computes the product of all elements. Gradient uses the product rule: d(prod)/dx_i = prod / x_i.

[procedure] (mean-tensor tensor) -> tensor

Computes the mean (average) of all elements.

Tensor Manipulation Operations
[procedure] (slice-tensor tensor start length) -> tensor

Extracts a slice of a tensor along the first dimension. Gradients flow back correctly to the original tensor positions.

tensor
input tensor with shape (n, ...)
start
starting index (0-based)
length
number of elements to extract
Returns
tensor with shape (length, ...)
;; Slice a batch of data
(define batch-data (make-tensor32 (make-f32vector 100) '(10 10)))
(define mini-batch (slice-tensor batch-data 2 5))  ; Shape: (5, 10)

;; Gradients flow back to original positions
(backward! (sum-tensor mini-batch))
(tensor-grad batch-data)  ; Only indices 2-6 have non-zero gradients
[procedure] (reshape tensor new-shape) -> tensor

Reshapes the tensor. Total number of elements must be preserved.

[procedure] (flatten-tensor tensor) -> tensor

Flattens a multi-dimensional tensor to 1D.

Activation Functions
[procedure] (relu tensor) -> tensor

Rectified Linear Unit: max(0, x).

Gradient: 1 if x > 0, else 0

[procedure] (tanh-op tensor) -> tensor

Hyperbolic tangent activation.

Gradient: 1 - tanh^2(x)

[procedure] (sigmoid tensor) -> tensor

Sigmoid (logistic) activation: σ(x) = 1 / (1 + e^(-x)).

Gradient: σ(x) · (1 - σ(x))

[procedure] (sigmoid-stable tensor) -> tensor

Numerically stable sigmoid implementation for large negative values.

[procedure] (softmax x #!key (axis -1)) -> tensor

Softmax normalization with numerical stability and batch support.

Input shapes:

; Single sample
(define logits (make-tensor32 (f32vector 1.0 2.0 3.0) '(3)))
(define probs (softmax logits))  ; Sums to 1.0

; Batch of samples
(define batch-logits (make-tensor32 (make-f32vector 60) '(20 3)))
(define batch-probs (softmax batch-logits axis: -1))  ; Each row sums to 1.0

Gradient: dL/dx = softmax(x) ⊙ (dL/dy - Σ(dL/dy ⊙ softmax(x)))

[procedure] (log-softmax x #!key (axis -1)) -> tensor

Log-softmax with batch support: more numerically stable than log(softmax(x)).

Input shapes:

Gradient: dL/dx = dL/dy - exp(log_softmax(x)) · Σ(dL/dy)

[procedure] (leaky-relu tensor #!key (alpha 0.01)) -> tensor

Leaky ReLU: max(alpha * x, x).

[procedure] (softplus tensor #!key (beta 1.0)) -> tensor

Softplus activation: log(1 + e^(beta * x)) / beta.

[procedure] (gelu tensor) -> tensor

Gaussian Error Linear Unit activation using tanh approximation.

[procedure] (silu tensor) -> tensor

SiLU (Sigmoid Linear Unit) activation, also known as Swish: x * σ(x).

Loss Functions
[procedure] (mse-loss pred target #!key (reduction 'mean)) -> tensor

Mean Squared Error loss with batch support.

pred
predictions tensor (any shape)
target
target tensor (same shape as pred)
reduction
'mean (average over all elements) or 'sum

For batched inputs (batch_size, ...), computes loss per sample and reduces according to reduction parameter.

; Single sample
(define loss (mse-loss predictions targets))

; Batch of samples
(define batch-pred (make-tensor32 pred-data '(32 10)))
(define batch-target (make-tensor32 target-data '(32 10)))
(define batch-loss (mse-loss batch-pred batch-target reduction: 'mean))
[procedure] (cross-entropy-loss pred target #!key (reduction 'mean) (from-logits #f)) -> tensor

Cross-entropy loss with batch support.

pred
predictions tensor
If from-logits=#f
probabilities (softmax already applied)
If from-logits=#t
logits (raw scores, log-softmax applied internally)
target
target tensor
One-hot
same shape as pred
Class indices
(batch_size,) with integer class labels
reduction
'mean (average over batch) or 'sum
from-logits
if true, apply log-softmax to pred first

Input shapes:

; Single sample with one-hot target
(define loss (cross-entropy-loss probs target))

; Batch with one-hot targets
(define batch-probs (softmax logits axis: -1))
(define batch-loss (cross-entropy-loss batch-probs targets reduction: 'mean))

; Batch with class indices (more memory efficient)
(define class-indices (make-tensor32 (f32vector 0.0 2.0 1.0) '(3)))
(define batch-loss (cross-entropy-loss logits class-indices 
                                       from-logits: #t 
                                       reduction: 'mean))
Normalization Operations
[procedure] (rmsnorm x weight #!key (epsilon 1e-5)) -> tensor

Root Mean Square Layer Normalization with batch support.

Input shapes:

Formula: output[i] = (x[i] / RMS(x)) * weight[i] where RMS(x) = sqrt(mean(x^2) + epsilon)

; Single vector
(define x (make-tensor32 (make-f32vector 512) '(512)))
(define gamma (make-tensor32 (make-f32vector 512 1.0) '(512)))
(define normalized (rmsnorm x gamma))

; Batch of vectors
(define batch-x (make-tensor32 (make-f32vector (* 32 512)) '(32 512)))
(define batch-norm (rmsnorm batch-x gamma))  ; Normalized per batch element
[procedure] (l2-normalize tensor #!key (axis #f) (epsilon 1e-8)) -> tensor

L2 normalization with axis support.

axis
#f (normalize entire tensor) or integer (normalize along axis)

For 2D tensors:

; Normalize entire tensor
(define normalized (l2-normalize x))

; Normalize each row of a batch
(define batch (make-tensor32 (make-f32vector 200) '(10 20)))
(define row-normalized (l2-normalize batch axis: 1))  ; Each row has ||·||₂ = 1
[procedure] (cosine-similarity a b) -> tensor

Cosine similarity: (a · b) / (||a|| · ||b||).

Convolution Operations
[procedure] (conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor

2D convolution using im2col + GEMM algorithm with batch support.

input
tensor of shape (C_in, H, W) or (N, C_in, H, W)
weight
tensor of shape (C_out, C_in, KH, KW)
bias
tensor of shape (C_out) or #f
stride
stride for convolution (default 1)
padding
zero-padding (default 0)

Input shapes:

Output shapes:

; Single image
(define img (make-tensor32 (make-f32vector (* 3 32 32)) '(3 32 32)))
(define output (conv2d img weights bias stride: 2 padding: 1))

; Batch of images
(define batch-imgs (make-tensor32 (make-f32vector (* 16 3 32 32)) '(16 3 32 32)))
(define batch-output (conv2d batch-imgs weights bias))  ; Shape: (16, C_out, H_out, W_out)
Gradient Operations
[procedure] (zero-grad! tensor) -> void

Sets all gradient values to zero.

[procedure] (backward! tensor) -> void

Computes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found.

[procedure] (add-to-grad! tensor delta) -> void

Accumulates delta into the tensor's gradient using BLAS AXPY.

Utility Functions
[procedure] (tensor->list tensor) -> list

Converts tensor data to a list.

[procedure] (print-tensor tensor) -> void

Pretty-prints tensor information including shape, dtype, data, and gradients.

[procedure] (vector-length-for-dtype vec dtype) -> integer

Returns the length of a vector based on its dtype.

nanograd-layer

Neural network layer abstractions and containers with batch processing support.

Layer Predicates
[procedure] (layer? obj) -> boolean
[procedure] (dense-layer? obj) -> boolean
[procedure] (conv2d-layer? obj) -> boolean
[procedure] (batch-norm-2d? obj) -> boolean
[procedure] (sequential? obj) -> boolean
[procedure] (flatten-layer? obj) -> boolean
Dense Layer
[procedure] (make-dense-layer input-size output-size #!key (activation (make-identity)) (use-bias #t) (dtype 'f32) (name "Dense")) -> layer

Creates a fully-connected (dense) layer with Xavier/Glorot initialization. Supports both single vectors and batches.

input-size
number of input features
output-size
number of output features
activation
activation function object (default identity)
use-bias
whether to include bias term (default #t)
dtype
'f32 or 'f64 (default 'f32)
name
layer name for debugging

Input shapes:

For 2D inputs, uses BLAS GEMM for efficient batch processing.

(define layer (make-dense-layer 784 128 
                                activation: (make-relu)
                                name: "Hidden1"))

; Single input
(define x (make-tensor32 (make-f32vector 784) '(784)))
(define output (forward layer x))  ; Shape: (128,)

; Batch input
(define batch-x (make-tensor32 (make-f32vector (* 32 784)) '(32 784)))
(define batch-output (forward layer batch-x))  ; Shape: (32, 128)
Convolutional Layer
[procedure] (make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layer

Creates a 2D convolutional layer with He initialization. Supports both single images and batches.

in-channels
number of input channels
out-channels
number of output channels
kernel-size
size of convolution kernel (square)
stride
convolution stride (default 1)
padding
zero-padding (default 0)
activation
activation function object
dtype
'f32 or 'f64
name
layer name

Input shapes:

Output shapes:

(define conv (make-conv2d-layer 3 32 3 
                                stride: 1 
                                padding: 1
                                activation: (make-relu)))

; Single image
(define img (make-tensor32 img-data '(3 32 32)))
(define features (forward conv img))  ; Shape: (32, 32, 32)

; Batch of images
(define batch (make-tensor32 batch-data '(16 3 32 32)))
(define batch-features (forward conv batch))  ; Shape: (16, 32, 32, 32)
Batch Normalization Layer
[procedure] (make-batch-norm-2d num-features #!key (epsilon 1e-5) (momentum 0.1) (dtype 'f32) (name "BatchNorm2d")) -> layer

Creates a 2D batch normalization layer. Normalizes activations across the batch dimension:

y = γ * (x - μ) / √(σ² + ε) + β

where μ and σ² are computed from the batch (training mode) or from running statistics (evaluation mode).

num-features
number of channels (C)
epsilon
small constant for numerical stability (default 1e-5)
momentum
momentum for updating running statistics (default 0.1)
dtype
'f32 or 'f64 (default 'f32)
name
layer name

Input shapes:

Output shapes: same as input

;; Create batch norm for 64 channels
(define bn (make-batch-norm-2d 64 epsilon: 1e-5 momentum: 0.1))

;; Training mode: uses batch statistics
(set-training-mode! bn #t)
(define normalized (forward bn input))  ; Input shape: (N, 64, H, W)

;; Evaluation mode: uses running statistics
(set-eval-mode! bn)
(define test-normalized (forward bn test-input))  ; Deterministic output

Batch normalization improves training stability and convergence by:

Key features:

Flatten Layer
[procedure] (make-flatten #!key (name "Flatten")) -> layer

Creates a flatten layer that converts multi-dimensional tensors to 1D or 2D.

Input shapes and outputs:

(define flatten (make-flatten name: "Flatten"))

; Flatten batch of feature maps
(define features (make-tensor32 data '(32 64 8 8)))
(define flattened (forward flatten features))  ; Shape: (32, 4096)
Global Average Pooling
[procedure] (global-avg-pool2d input) -> tensor

Global average pooling over spatial dimensions with batch support. Reduces spatial dimensions to 1x1 by averaging.

Input shapes
3D
(C, H, W) → Output: (C,)
4D
(N, C, H, W) → Output: (N, C)

Gradient: Distributed uniformly over all spatial positions for each channel.

;; Single image
(define feature-maps (make-tensor32 (make-f32vector (* 128 8 8)) '(128 8 8)))
(define pooled (global-avg-pool2d feature-maps))  ; Shape: (128,)

;; Batch of images
(define batch-features (make-tensor32 (make-f32vector (* 32 128 8 8)) '(32 128 8 8)))
(define batch-pooled (global-avg-pool2d batch-features))  ; Shape: (32, 128)

;; Use in classification network
(define logits (forward fc-layer batch-pooled))  ; Shape: (32, num_classes)

Global average pooling is commonly used to replace large fully-connected layers:

Sequential Container
[procedure] (make-sequential layers #!key (name "Sequential")) -> layer

Creates a sequential container that chains multiple layers. Automatically handles batch propagation through all layers.

(define model
  (make-sequential
   (list
    (make-dense-layer 784 128 activation: (make-relu))
    (make-dense-layer 128 64 activation: (make-relu))
    (make-dense-layer 64 10 activation: (make-identity)))
   name: "MLP"))

; Works with both single and batch inputs
(define single-output (forward model single-input))
(define batch-output (forward model batch-input))
Layer Operations
[procedure] (forward layer input) -> tensor

Performs a forward pass through the layer. Automatically handles both single samples and batches based on input shape.

[procedure] (parameters layer) -> list

Returns a list of all trainable parameter tensors.

[procedure] (zero-grad-layer! layer) -> void

Zeros gradients for all parameters in the layer.

[procedure] (set-training-mode! layer training?) -> void

Sets the training mode for the layer. When training? is #t, the layer uses training-specific behavior (e.g., batch statistics for batch norm). When #f, uses evaluation behavior.

;; Set model to training mode
(set-training-mode! model #t)

;; Set model to evaluation mode
(set-training-mode! model #f)
[procedure] (set-eval-mode! layer) -> void

Shorthand for (set-training-mode! layer #f). Sets the layer to evaluation mode.

Training vs Evaluation Mode:

Training Mode ((set-training-mode! layer #t)):

Evaluation Mode ((set-eval-mode! layer)):

[procedure] (layer-input-size layer) -> integer or #f
[procedure] (layer-output-size layer) -> integer or #f
[procedure] (layer-activation layer) -> activation
[procedure] (layer-name layer) -> string

Accessor functions for layer properties. Note: input/output sizes may be #f for layers with dynamic dimensions (e.g., flatten).

Activation Function Objects
[procedure] (make-relu) -> activation
[procedure] (make-tanh) -> activation
[procedure] (make-sigmoid) -> activation
[procedure] (make-gelu) -> activation
[procedure] (make-silu) -> activation
[procedure] (make-identity) -> activation

Creates activation function objects for use in layers.

[procedure] (activation? obj) -> boolean
[procedure] (activation-forward act x) -> tensor
[procedure] (activation-name act) -> string
Utility Functions
[procedure] (print-layer layer #!optional (indent 0)) -> void

Prints layer information with optional indentation.

[procedure] (summary model) -> void

Prints a model summary including all layers and parameter counts.

nanograd-optimizer

Optimization algorithms for neural network training.

Optimizer Predicates
[procedure] (optimizer? obj) -> boolean
[procedure] (sgd? obj) -> boolean
[procedure] (adam? obj) -> boolean
[procedure] (rmsprop? obj) -> boolean
SGD Optimizer
[procedure] (make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizer

Stochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration.

parameters
list of parameter tensors to optimize
learning-rate
step size (default 0.01)
momentum
momentum factor (default 0.0, no momentum)
weight-decay
L2 regularization factor (default 0.0)
nesterov
use Nesterov momentum (default #f)
Adam Optimizer
[procedure] (make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizer

Adam (Adaptive Moment Estimation) optimizer with bias correction.

beta1
exponential decay rate for first moment (default 0.9)
beta2
exponential decay rate for second moment (default 0.999)
epsilon
numerical stability constant (default 1e-8)
RMSprop Optimizer
[procedure] (make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizer

RMSprop optimizer with optional momentum.

alpha
smoothing constant (default 0.99)
Optimizer Operations
[procedure] (step! optimizer) -> void

Applies parameter updates based on accumulated gradients.

[procedure] (get-learning-rate optimizer) -> number

Returns the current learning rate.

[procedure] (set-learning-rate! optimizer lr) -> void

Updates the learning rate (useful for learning rate scheduling).

[procedure] (optimizer-state optimizer) -> alist

Returns an association list of optimizer configuration parameters.

Examples

Batch Processing with Dense Layers

(import nanograd-autograd nanograd-layer)

;; Create a batch of inputs
(define batch-size 32)
(define input-dim 784)
(define batch-data (make-f32vector (* batch-size input-dim)))

;; Fill with data...

(define batch-input (make-tensor32 batch-data (list batch-size input-dim)))

;; Dense layer automatically handles batches
(define layer (make-dense-layer input-dim 128 activation: (make-relu)))
(define output (forward layer batch-input))  ; Shape: (32, 128)

Batched Softmax and Cross-Entropy

;; Batch of logits
(define batch-size 32)
(define num-classes 10)
(define logits (make-tensor32 (make-f32vector (* batch-size num-classes)) 
                              (list batch-size num-classes)))
(define targets (make-tensor32 target-data (list batch-size num-classes)))

;; Softmax along class dimension
(define probs (softmax logits axis: -1))  ; Each row sums to 1

;; Cross-entropy with batches
(define loss (cross-entropy-loss probs targets reduction: 'mean))

;; Alternative: use from-logits for stability
(define loss-stable (cross-entropy-loss logits targets 
                                        from-logits: #t 
                                        reduction: 'mean))

Training with Batches

(import nanograd-autograd nanograd-layer nanograd-optimizer)

;; Define model
(define model
  (make-sequential
   (list
    (make-dense-layer 784 256 activation: (make-relu))
    (make-dense-layer 256 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "MLP"))

(define optimizer (make-adam (parameters model) learning-rate: 0.001))

;; Training loop with batches
(define (train-epoch train-batches)
  (set-training-mode! model #t)
  
  (for-each
   (lambda (batch)
     (let* ((x (car batch))        ; Shape: (batch_size, 784)
            (y (cdr batch))        ; Shape: (batch_size, 10)
            (logits (forward model x))
            (loss (cross-entropy-loss logits y 
                                      from-logits: #t 
                                      reduction: 'mean)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   train-batches))

;; Evaluation
(define (evaluate test-batches)
  (set-eval-mode! model)
  ;; ... evaluation code ...
  )

Convolutional Network with Batch Normalization

(import nanograd-autograd nanograd-layer nanograd-optimizer)

;; CNN with batch support
(define cnn
  (make-sequential
   (list
    (make-conv2d-layer 3 32 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 32)  ; Normalizes across batch
    (make-conv2d-layer 32 64 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 64)
    (make-flatten)
    (make-dense-layer (* 64 32 32) 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "CNN"))

;; Process batch of images
(define batch-images (make-tensor32 image-data '(16 3 32 32)))  ; 16 RGB images
(set-training-mode! cnn #t)
(define predictions (forward cnn batch-images))  ; Shape: (16, 10)

ResNet-Style Architecture

;; ResNet block with batch normalization
(define (make-resnet-block in-channels out-channels stride)
  (make-sequential
   (list
    (make-conv2d-layer in-channels out-channels 3 
                       stride: stride padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels)
    (make-conv2d-layer out-channels out-channels 3
                       stride: 1 padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels))
   name: "ResBlock"))

;; Full model
(define resnet
  (make-sequential
   (list
    (make-conv2d-layer 3 64 7 stride: 2 padding: 3)
    (make-batch-norm-2d 64)
    (make-resnet-block 64 64 1)
    (make-resnet-block 64 128 2)
    (make-resnet-block 128 256 2)
    (make-resnet-block 256 512 2)
    (make-dense-layer 512 1000))
   name: "ResNet"))

Performance Notes

Batch Processing Best Practices

1. Always use batches during training for better performance and stable gradients 2. Set appropriate batch sizes (typically 16-256 depending on memory) 3. Use batch normalization for deeper networks (>10 layers) 4. Switch to eval mode during validation/testing to use running statistics 5. Prefer global average pooling over large fully-connected layers in CNNs

Limitations

Troubleshooting

Common Errors

Shape mismatch errors

Ensure tensor shapes are compatible for operations. For batched operations, the batch dimension should match.

; Batch size mismatch
(define x (make-tensor32 (make-f32vector 200) '(10 20)))
(define y (make-tensor32 (make-f32vector 300) '(15 20)))
(add x y)  ; Error: shape mismatch

Batch normalization mode not set

Always explicitly set training/eval mode:

; Training
(set-training-mode! model #t)
(train-epoch model)

; Evaluation
(set-eval-mode! model)
(evaluate model)

Author

Ivan Raikov

Repository

https://github.com/iraikov/nanograd

Version History

2.0
Batch processing support

- Dense layers support 1D/2D inputs - Conv2D supports 3D/4D inputs - Batch normalization for 3D/4D inputs - Softmax/log-softmax with batch and axis support - Cross-entropy loss with batch reduction - RMSNorm with 1D/2D support - Global average pooling with 3D/4D support - L2-normalize with axis parameter

1.2
Additional operations

- Reduction operations (sum-tensor, mean-tensor, product-tensor, reduce-tensor) - Tensor slicing (slice-tensor) - Batch normalization (make-batch-norm-2d) - Global average pooling (global-avg-pool2d) - Training/evaluation mode control

1.1
Bug fix in mul layer operation
1.0
Initial release

- Core autograd engine - Dense and convolutional layers - SGD, Adam, and RMSprop optimizers - Basic activation and loss functions

See Also

License

LGPL-3

References