nanograd

nanograd

A lightweight automatic differentiation and neural network framework for CHICKEN Scheme, featuring BLAS-accelerated operations, comprehensive batch processing support, and YASOS-based object abstractions.

Description

NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features:

Reverse-mode automatic differentiation with gradient computation
Native batch processing support throughout the stack
BLAS-accelerated linear algebra operations with batched GEMM
YASOS-based polymorphic object system
Support for both 32-bit and 64-bit floating-point precision
Common neural network layers with 1D/2D input support (Dense), 3D/4D support (Conv2D, BatchNorm2D)
Common optimization algorithms (SGD, Adam, RMSprop)
Batch-aware activation functions (Softmax, Log-Softmax) and loss functions
Tensor manipulation with reduction operations and slicing
Training/evaluation mode support for layers

Requirements

Modules

nanograd-autograd

Core automatic differentiation engine with tensor operations and batch support.

Tensor Constructors

[procedure] (make-tensor32 data shape #!key (requires-grad? #t)) -> tensor

Creates a 32-bit floating-point tensor with automatic differentiation support.

data: f32vector containing the tensor data
shape: list of dimensions, e.g., '(2 3) for a 2x3 matrix or '(10 2 3) for batch of 10 matrices
requires-grad?: whether to track gradients (default #t)

; Single vector
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t))

; Batch of vectors
(define batch (make-tensor32 (make-f32vector 60) '(10 6) requires-grad?: #t))

[procedure] (make-tensor64 data shape #!key (requires-grad? #t)) -> tensor

Creates a 64-bit floating-point tensor with automatic differentiation support.

Tensor Predicates

[procedure] (tensor? obj) -> boolean
[procedure] (tensor32? obj) -> boolean
[procedure] (tensor64? obj) -> boolean

Type predicates for tensors.

Tensor Accessors

[procedure] (tensor-data tensor) -> vector

Returns the underlying f32vector or f64vector containing the tensor's data.

[procedure] (tensor-grad tensor) -> vector or #f

Returns the gradient vector if gradients are enabled, #f otherwise.

[procedure] (tensor-shape tensor) -> list

Returns the shape as a list of dimensions.

[procedure] (tensor-dtype tensor) -> symbol

Returns the data type: 'f32 or 'f64.

[procedure] (tensor-requires-grad? tensor) -> boolean

Returns #t if the tensor tracks gradients.

Arithmetic Operations

[procedure] (add a b) -> tensor

Element-wise addition of tensors a and b. Both tensors must have the same shape and dtype.

(define z (add x y))  ; z = x + y

Gradient: dL/da = dL/dz, dL/db = dL/dz

[procedure] (sub a b) -> tensor

Element-wise subtraction: a - b.

Gradient: dL/da = dL/dz, dL/db = -dL/dz

[procedure] (mul a b) -> tensor

Element-wise multiplication (Hadamard product).

Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a

[procedure] (div a b) -> tensor

Element-wise division: a / b.

Gradient: dL/da = dL/dz / b, dL/db = -dL/dz · (a / b²)

[procedure] (safe-div a b #!key (epsilon 1e-8)) -> tensor

Safe element-wise division: a / (b + epsilon) to avoid division by zero.

Linear Algebra Operations

[procedure] (matmul-op a b) -> tensor

Matrix multiplication using BLAS GEMM/GEMV operations with batch support. Supports:

Matrix × Matrix
Matrix × Vector
Vector × Matrix
Vector × Vector (dot product)
Batched operations (implicit batching over first dimension)

; Standard matrix-vector multiplication
(define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define b (make-tensor32 (f32vector 5.0 6.0) '(2)))
(define c (matmul-op A b))  ; 2×2 matrix times 2×1 vector = 2×1 vector

; Batch matrix multiplication
(define batch-A (make-tensor32 (make-f32vector 80) '(10 2 4)))  ; 10 samples
(define W (make-tensor32 (make-f32vector 12) '(4 3)))
(define batch-result (matmul-op batch-A W))  ; Shape: (10, 2, 3)

Gradient: dL/dA = dL/dC · B^T, dL/dB = A^T · dL/dC

[procedure] (dot-op a b) -> tensor

Dot product (inner product) of two 1D vectors using BLAS DOT.

(define result (dot-op x y))  ; scalar result

Gradient: dL/da = (dL/dresult) · b, dL/db = (dL/dresult) · a

[procedure] (scale-op tensor scalar) -> tensor

Scalar multiplication using BLAS SCAL.

Gradient: dL/dtensor = scalar · dL/dresult

Reduction Operations

[procedure] (reduce-tensor tensor reducer #!key (compute-gradient #f)) -> tensor

Generic reduction operation that maintains gradient flow. The reducer function is applied to each element in the forward pass. An optional compute-gradient function specifies how gradients are distributed in the backward pass.

tensor: input tensor to reduce
reducer: function (element accumulator) -> new-accumulator
compute-gradient: optional function (grad-out index value all-values) -> grad-in
If not provided, assumes uniform distribution (like sum)

Returns a scalar tensor with the reduced value.

;; Sum all elements (uniform gradient distribution)
(define total (reduce-tensor x +))

;; Product of all elements (gradient uses product rule)
(define prod (reduce-tensor x *
  compute-gradient: (lambda (grad-out idx val all-values)
                     ;; d(prod)/dx_i = prod / x_i
                     (let ((prod (fold * 1.0 all-values)))
                       (if (> val 0.0)
                           (* grad-out (/ prod val))
                           0.0)))))

[procedure] (sum-tensor tensor) -> tensor

Sums all elements in the tensor. Gradient is distributed uniformly to all elements.

[procedure] (product-tensor tensor) -> tensor

Computes the product of all elements. Gradient uses the product rule: d(prod)/dx_i = prod / x_i.

[procedure] (mean-tensor tensor) -> tensor

Computes the mean (average) of all elements.

Tensor Manipulation Operations

[procedure] (slice-tensor tensor start length) -> tensor

Extracts a slice of a tensor along the first dimension. Gradients flow back correctly to the original tensor positions.

tensor: input tensor with shape (n, ...)
start: starting index (0-based)
length: number of elements to extract
Returns: tensor with shape (length, ...)

;; Slice a batch of data
(define batch-data (make-tensor32 (make-f32vector 100) '(10 10)))
(define mini-batch (slice-tensor batch-data 2 5))  ; Shape: (5, 10)

;; Gradients flow back to original positions
(backward! (sum-tensor mini-batch))
(tensor-grad batch-data)  ; Only indices 2-6 have non-zero gradients

[procedure] (reshape tensor new-shape) -> tensor

Reshapes the tensor. Total number of elements must be preserved.

[procedure] (flatten-tensor tensor) -> tensor

Flattens a multi-dimensional tensor to 1D.

Activation Functions

[procedure] (relu tensor) -> tensor

Rectified Linear Unit: max(0, x).

Gradient: 1 if x > 0, else 0

[procedure] (tanh-op tensor) -> tensor

Hyperbolic tangent activation.

Gradient: 1 - tanh^2(x)

[procedure] (sigmoid tensor) -> tensor

Sigmoid (logistic) activation: σ(x) = 1 / (1 + e^(-x)).

Gradient: σ(x) · (1 - σ(x))

[procedure] (sigmoid-stable tensor) -> tensor

Numerically stable sigmoid implementation for large negative values.

[procedure] (softmax x #!key (axis -1)) -> tensor

Softmax normalization with numerical stability and batch support.

Input shapes:

1D: (n_classes,) - standard softmax
2D: (batch_size, n_classes) - softmax along axis (default: -1 for last axis)

; Single sample
(define logits (make-tensor32 (f32vector 1.0 2.0 3.0) '(3)))
(define probs (softmax logits))  ; Sums to 1.0

; Batch of samples
(define batch-logits (make-tensor32 (make-f32vector 60) '(20 3)))
(define batch-probs (softmax batch-logits axis: -1))  ; Each row sums to 1.0

Gradient: dL/dx = softmax(x) ⊙ (dL/dy - Σ(dL/dy ⊙ softmax(x)))

[procedure] (log-softmax x #!key (axis -1)) -> tensor

Log-softmax with batch support: more numerically stable than log(softmax(x)).

Input shapes:

1D: (n_classes,)
2D: (batch_size, n_classes) - log-softmax along axis

Gradient: dL/dx = dL/dy - exp(log_softmax(x)) · Σ(dL/dy)

[procedure] (leaky-relu tensor #!key (alpha 0.01)) -> tensor

Leaky ReLU: max(alpha * x, x).

[procedure] (softplus tensor #!key (beta 1.0)) -> tensor

Softplus activation: log(1 + e^(beta * x)) / beta.

[procedure] (gelu tensor) -> tensor

Gaussian Error Linear Unit activation using tanh approximation.

[procedure] (silu tensor) -> tensor

SiLU (Sigmoid Linear Unit) activation, also known as Swish: x * σ(x).

Loss Functions

[procedure] (mse-loss pred target #!key (reduction 'mean)) -> tensor

Mean Squared Error loss with batch support.

pred: predictions tensor (any shape)
target: target tensor (same shape as pred)
reduction: 'mean (average over all elements) or 'sum

For batched inputs (batch_size, ...), computes loss per sample and reduces according to reduction parameter.

; Single sample
(define loss (mse-loss predictions targets))

; Batch of samples
(define batch-pred (make-tensor32 pred-data '(32 10)))
(define batch-target (make-tensor32 target-data '(32 10)))
(define batch-loss (mse-loss batch-pred batch-target reduction: 'mean))

[procedure] (cross-entropy-loss pred target #!key (reduction 'mean) (from-logits #f)) -> tensor

Cross-entropy loss with batch support.

pred: predictions tensor
If from-logits=#f: probabilities (softmax already applied)
If from-logits=#t: logits (raw scores, log-softmax applied internally)
target: target tensor
One-hot: same shape as pred
Class indices: (batch_size,) with integer class labels
reduction: 'mean (average over batch) or 'sum
from-logits: if true, apply log-softmax to pred first

Input shapes:

1D pred (n_classes,): single sample
2D pred (batch_size, n_classes): batch of samples

; Single sample with one-hot target
(define loss (cross-entropy-loss probs target))

; Batch with one-hot targets
(define batch-probs (softmax logits axis: -1))
(define batch-loss (cross-entropy-loss batch-probs targets reduction: 'mean))

; Batch with class indices (more memory efficient)
(define class-indices (make-tensor32 (f32vector 0.0 2.0 1.0) '(3)))
(define batch-loss (cross-entropy-loss logits class-indices 
                                       from-logits: #t 
                                       reduction: 'mean))

Normalization Operations

[procedure] (rmsnorm x weight #!key (epsilon 1e-5)) -> tensor

Root Mean Square Layer Normalization with batch support.

Input shapes:

1D: (d_model,) - standard RMSNorm
2D: (batch_size, d_model) - RMSNorm applied to each batch element independently

Formula: output[i] = (x[i] / RMS(x)) * weight[i] where RMS(x) = sqrt(mean(x^2) + epsilon)

; Single vector
(define x (make-tensor32 (make-f32vector 512) '(512)))
(define gamma (make-tensor32 (make-f32vector 512 1.0) '(512)))
(define normalized (rmsnorm x gamma))

; Batch of vectors
(define batch-x (make-tensor32 (make-f32vector (* 32 512)) '(32 512)))
(define batch-norm (rmsnorm batch-x gamma))  ; Normalized per batch element

[procedure] (l2-normalize tensor #!key (axis #f) (epsilon 1e-8)) -> tensor

L2 normalization with axis support.

axis: #f (normalize entire tensor) or integer (normalize along axis)

For 2D tensors:

axis=0: normalize along rows (each column becomes unit vector)
axis=1: normalize along columns (each row becomes unit vector)

; Normalize entire tensor
(define normalized (l2-normalize x))

; Normalize each row of a batch
(define batch (make-tensor32 (make-f32vector 200) '(10 20)))
(define row-normalized (l2-normalize batch axis: 1))  ; Each row has ||·||₂ = 1

[procedure] (cosine-similarity a b) -> tensor

Cosine similarity: (a · b) / (||a|| · ||b||).

Convolution Operations

[procedure] (conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor

2D convolution using im2col + GEMM algorithm with batch support.

input: tensor of shape (C_in, H, W) or (N, C_in, H, W)
weight: tensor of shape (C_out, C_in, KH, KW)
bias: tensor of shape (C_out) or #f
stride: stride for convolution (default 1)
padding: zero-padding (default 0)

Input shapes:

3D: (C_in, H, W) - single image
4D: (N, C_in, H, W) - batch of images

Output shapes:

3D: (C_out, H_out, W_out)
4D: (N, C_out, H_out, W_out)

; Single image
(define img (make-tensor32 (make-f32vector (* 3 32 32)) '(3 32 32)))
(define output (conv2d img weights bias stride: 2 padding: 1))

; Batch of images
(define batch-imgs (make-tensor32 (make-f32vector (* 16 3 32 32)) '(16 3 32 32)))
(define batch-output (conv2d batch-imgs weights bias))  ; Shape: (16, C_out, H_out, W_out)

Gradient Operations

[procedure] (zero-grad! tensor) -> void

Sets all gradient values to zero.

[procedure] (backward! tensor) -> void

Computes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found.

[procedure] (add-to-grad! tensor delta) -> void

Accumulates delta into the tensor's gradient using BLAS AXPY.

Utility Functions

[procedure] (tensor->list tensor) -> list

Converts tensor data to a list.

[procedure] (print-tensor tensor) -> void

Pretty-prints tensor information including shape, dtype, data, and gradients.

[procedure] (vector-length-for-dtype vec dtype) -> integer

Returns the length of a vector based on its dtype.

nanograd-layer

Neural network layer abstractions and containers with batch processing support.

Layer Predicates

[procedure] (layer? obj) -> boolean
[procedure] (dense-layer? obj) -> boolean
[procedure] (conv2d-layer? obj) -> boolean
[procedure] (batch-norm-2d? obj) -> boolean
[procedure] (sequential? obj) -> boolean
[procedure] (flatten-layer? obj) -> boolean

Dense Layer

[procedure] (make-dense-layer input-size output-size #!key (activation (make-identity)) (use-bias #t) (dtype 'f32) (name "Dense")) -> layer

Creates a fully-connected (dense) layer with Xavier/Glorot initialization. Supports both single vectors and batches.

input-size: number of input features
output-size: number of output features
activation: activation function object (default identity)
use-bias: whether to include bias term (default #t)
dtype: 'f32 or 'f64 (default 'f32)
name: layer name for debugging

Input shapes:

1D: (input_size,) → output: (output_size,)
2D: (batch_size, input_size) → output: (batch_size, output_size)

For 2D inputs, uses BLAS GEMM for efficient batch processing.

(define layer (make-dense-layer 784 128 
                                activation: (make-relu)
                                name: "Hidden1"))

; Single input
(define x (make-tensor32 (make-f32vector 784) '(784)))
(define output (forward layer x))  ; Shape: (128,)

; Batch input
(define batch-x (make-tensor32 (make-f32vector (* 32 784)) '(32 784)))
(define batch-output (forward layer batch-x))  ; Shape: (32, 128)

Convolutional Layer

[procedure] (make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layer

Creates a 2D convolutional layer with He initialization. Supports both single images and batches.

in-channels: number of input channels
out-channels: number of output channels
kernel-size: size of convolution kernel (square)
stride: convolution stride (default 1)
padding: zero-padding (default 0)
activation: activation function object
dtype: 'f32 or 'f64
name: layer name

Input shapes:

3D: (C_in, H, W) - single image
4D: (N, C_in, H, W) - batch of images

Output shapes:

3D: (C_out, H_out, W_out)
4D: (N, C_out, H_out, W_out)

(define conv (make-conv2d-layer 3 32 3 
                                stride: 1 
                                padding: 1
                                activation: (make-relu)))

; Single image
(define img (make-tensor32 img-data '(3 32 32)))
(define features (forward conv img))  ; Shape: (32, 32, 32)

; Batch of images
(define batch (make-tensor32 batch-data '(16 3 32 32)))
(define batch-features (forward conv batch))  ; Shape: (16, 32, 32, 32)

Batch Normalization Layer

[procedure] (make-batch-norm-2d num-features #!key (epsilon 1e-5) (momentum 0.1) (dtype 'f32) (name "BatchNorm2d")) -> layer

Creates a 2D batch normalization layer. Normalizes activations across the batch dimension:

y = γ * (x - μ) / √(σ² + ε) + β

where μ and σ² are computed from the batch (training mode) or from running statistics (evaluation mode).

num-features: number of channels (C)
epsilon: small constant for numerical stability (default 1e-5)
momentum: momentum for updating running statistics (default 0.1)
dtype: 'f32 or 'f64 (default 'f32)
name: layer name

Input shapes:

3D: (C, H, W) - treated as batch of 1
4D: (N, C, H, W) - standard batch

Output shapes: same as input

;; Create batch norm for 64 channels
(define bn (make-batch-norm-2d 64 epsilon: 1e-5 momentum: 0.1))

;; Training mode: uses batch statistics
(set-training-mode! bn #t)
(define normalized (forward bn input))  ; Input shape: (N, 64, H, W)

;; Evaluation mode: uses running statistics
(set-eval-mode! bn)
(define test-normalized (forward bn test-input))  ; Deterministic output

Batch normalization improves training stability and convergence by:

Reducing internal covariate shift
Allowing higher learning rates
Acting as a form of regularization
Making networks less sensitive to initialization

Key features:

Learnable scale (gamma) and shift (beta) parameters
Running mean and variance maintained for evaluation
Automatic mode switching between training and evaluation
Numerical stability with epsilon parameter

Flatten Layer

[procedure] (make-flatten #!key (name "Flatten")) -> layer

Creates a flatten layer that converts multi-dimensional tensors to 1D or 2D.

Input shapes and outputs:

4D: (N, C, H, W) → (N, C*H*W)
3D: (C, H, W) → (C*H*W)
2D: (N, features) → (N, features) (no change)
1D: (features,) → (features,) (no change)

(define flatten (make-flatten name: "Flatten"))

; Flatten batch of feature maps
(define features (make-tensor32 data '(32 64 8 8)))
(define flattened (forward flatten features))  ; Shape: (32, 4096)

Global Average Pooling

[procedure] (global-avg-pool2d input) -> tensor

Global average pooling over spatial dimensions with batch support. Reduces spatial dimensions to 1x1 by averaging.

Input shapes
3D: (C, H, W) → Output: (C,)
4D: (N, C, H, W) → Output: (N, C)

Gradient: Distributed uniformly over all spatial positions for each channel.

;; Single image
(define feature-maps (make-tensor32 (make-f32vector (* 128 8 8)) '(128 8 8)))
(define pooled (global-avg-pool2d feature-maps))  ; Shape: (128,)

;; Batch of images
(define batch-features (make-tensor32 (make-f32vector (* 32 128 8 8)) '(32 128 8 8)))
(define batch-pooled (global-avg-pool2d batch-features))  ; Shape: (32, 128)

;; Use in classification network
(define logits (forward fc-layer batch-pooled))  ; Shape: (32, num_classes)

Global average pooling is commonly used to replace large fully-connected layers:

Reduces number of parameters dramatically
Improves generalization
Makes networks translation-invariant
Standard in modern architectures (ResNet, MobileNet, EfficientNet)

Sequential Container

[procedure] (make-sequential layers #!key (name "Sequential")) -> layer

Creates a sequential container that chains multiple layers. Automatically handles batch propagation through all layers.

(define model
  (make-sequential
   (list
    (make-dense-layer 784 128 activation: (make-relu))
    (make-dense-layer 128 64 activation: (make-relu))
    (make-dense-layer 64 10 activation: (make-identity)))
   name: "MLP"))

; Works with both single and batch inputs
(define single-output (forward model single-input))
(define batch-output (forward model batch-input))

Layer Operations

[procedure] (forward layer input) -> tensor

Performs a forward pass through the layer. Automatically handles both single samples and batches based on input shape.

[procedure] (parameters layer) -> list

Returns a list of all trainable parameter tensors.

[procedure] (zero-grad-layer! layer) -> void

Zeros gradients for all parameters in the layer.

[procedure] (set-training-mode! layer training?) -> void

Sets the training mode for the layer. When training? is #t, the layer uses training-specific behavior (e.g., batch statistics for batch norm). When #f, uses evaluation behavior.

;; Set model to training mode
(set-training-mode! model #t)

;; Set model to evaluation mode
(set-training-mode! model #f)

[procedure] (set-eval-mode! layer) -> void

Shorthand for (set-training-mode! layer #f). Sets the layer to evaluation mode.

Training vs Evaluation Mode:

Training Mode ((set-training-mode! layer #t)):

Batch normalization uses batch statistics (mean and variance computed from current batch)
Dropout is active (if implemented)
Stochastic behavior enabled
Running statistics updated

Evaluation Mode ((set-eval-mode! layer)):

Batch normalization uses running statistics (accumulated during training)
Dropout is disabled
Deterministic behavior
Running statistics frozen

[procedure] (layer-input-size layer) -> integer or #f
[procedure] (layer-output-size layer) -> integer or #f
[procedure] (layer-activation layer) -> activation
[procedure] (layer-name layer) -> string

Accessor functions for layer properties. Note: input/output sizes may be #f for layers with dynamic dimensions (e.g., flatten).

Activation Function Objects

[procedure] (make-relu) -> activation
[procedure] (make-tanh) -> activation
[procedure] (make-sigmoid) -> activation
[procedure] (make-gelu) -> activation
[procedure] (make-silu) -> activation
[procedure] (make-identity) -> activation

Creates activation function objects for use in layers.

[procedure] (activation? obj) -> boolean
[procedure] (activation-forward act x) -> tensor
[procedure] (activation-name act) -> string

Utility Functions

[procedure] (print-layer layer #!optional (indent 0)) -> void

Prints layer information with optional indentation.

[procedure] (summary model) -> void

Prints a model summary including all layers and parameter counts.

nanograd-optimizer

Optimization algorithms for neural network training.

Optimizer Predicates

[procedure] (optimizer? obj) -> boolean
[procedure] (sgd? obj) -> boolean
[procedure] (adam? obj) -> boolean
[procedure] (rmsprop? obj) -> boolean

SGD Optimizer

[procedure] (make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizer

Stochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration.

parameters: list of parameter tensors to optimize
learning-rate: step size (default 0.01)
momentum: momentum factor (default 0.0, no momentum)
weight-decay: L2 regularization factor (default 0.0)
nesterov: use Nesterov momentum (default #f)

Adam Optimizer

[procedure] (make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizer

Adam (Adaptive Moment Estimation) optimizer with bias correction.

beta1: exponential decay rate for first moment (default 0.9)
beta2: exponential decay rate for second moment (default 0.999)
epsilon: numerical stability constant (default 1e-8)

RMSprop Optimizer

[procedure] (make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizer

RMSprop optimizer with optional momentum.

alpha: smoothing constant (default 0.99)

Optimizer Operations

[procedure] (step! optimizer) -> void

Applies parameter updates based on accumulated gradients.

[procedure] (get-learning-rate optimizer) -> number

Returns the current learning rate.

[procedure] (set-learning-rate! optimizer lr) -> void

Updates the learning rate (useful for learning rate scheduling).

[procedure] (optimizer-state optimizer) -> alist

Returns an association list of optimizer configuration parameters.

Examples

Batch Processing with Dense Layers

(import nanograd-autograd nanograd-layer)

;; Create a batch of inputs
(define batch-size 32)
(define input-dim 784)
(define batch-data (make-f32vector (* batch-size input-dim)))

;; Fill with data...

(define batch-input (make-tensor32 batch-data (list batch-size input-dim)))

;; Dense layer automatically handles batches
(define layer (make-dense-layer input-dim 128 activation: (make-relu)))
(define output (forward layer batch-input))  ; Shape: (32, 128)

Batched Softmax and Cross-Entropy

;; Batch of logits
(define batch-size 32)
(define num-classes 10)
(define logits (make-tensor32 (make-f32vector (* batch-size num-classes)) 
                              (list batch-size num-classes)))
(define targets (make-tensor32 target-data (list batch-size num-classes)))

;; Softmax along class dimension
(define probs (softmax logits axis: -1))  ; Each row sums to 1

;; Cross-entropy with batches
(define loss (cross-entropy-loss probs targets reduction: 'mean))

;; Alternative: use from-logits for stability
(define loss-stable (cross-entropy-loss logits targets 
                                        from-logits: #t 
                                        reduction: 'mean))

Training with Batches

(import nanograd-autograd nanograd-layer nanograd-optimizer)

;; Define model
(define model
  (make-sequential
   (list
    (make-dense-layer 784 256 activation: (make-relu))
    (make-dense-layer 256 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "MLP"))

(define optimizer (make-adam (parameters model) learning-rate: 0.001))

;; Training loop with batches
(define (train-epoch train-batches)
  (set-training-mode! model #t)
  
  (for-each
   (lambda (batch)
     (let* ((x (car batch))        ; Shape: (batch_size, 784)
            (y (cdr batch))        ; Shape: (batch_size, 10)
            (logits (forward model x))
            (loss (cross-entropy-loss logits y 
                                      from-logits: #t 
                                      reduction: 'mean)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   train-batches))

;; Evaluation
(define (evaluate test-batches)
  (set-eval-mode! model)
  ;; ... evaluation code ...
  )

Convolutional Network with Batch Normalization

(import nanograd-autograd nanograd-layer nanograd-optimizer)

;; CNN with batch support
(define cnn
  (make-sequential
   (list
    (make-conv2d-layer 3 32 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 32)  ; Normalizes across batch
    (make-conv2d-layer 32 64 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 64)
    (make-flatten)
    (make-dense-layer (* 64 32 32) 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "CNN"))

;; Process batch of images
(define batch-images (make-tensor32 image-data '(16 3 32 32)))  ; 16 RGB images
(set-training-mode! cnn #t)
(define predictions (forward cnn batch-images))  ; Shape: (16, 10)

ResNet-Style Architecture

;; ResNet block with batch normalization
(define (make-resnet-block in-channels out-channels stride)
  (make-sequential
   (list
    (make-conv2d-layer in-channels out-channels 3 
                       stride: stride padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels)
    (make-conv2d-layer out-channels out-channels 3
                       stride: 1 padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels))
   name: "ResBlock"))

;; Full model
(define resnet
  (make-sequential
   (list
    (make-conv2d-layer 3 64 7 stride: 2 padding: 3)
    (make-batch-norm-2d 64)
    (make-resnet-block 64 64 1)
    (make-resnet-block 64 128 2)
    (make-resnet-block 128 256 2)
    (make-resnet-block 256 512 2)
    (make-dense-layer 512 1000))
   name: "ResNet"))

Performance Notes

NanoGrad uses BLAS for matrix operations, including batched GEMM
Batch operations are significantly more efficient than processing samples individually
Use f32 (32-bit) tensors when 64-bit precision is not required
The framework detects computation graph cycles
Batch normalization adds minimal overhead and significantly improves training
Global average pooling reduces parameters without sacrificing performance

Batch Processing Best Practices

1. Always use batches during training for better performance and stable gradients 2. Set appropriate batch sizes (typically 16-256 depending on memory) 3. Use batch normalization for deeper networks (>10 layers) 4. Switch to eval mode during validation/testing to use running statistics 5. Prefer global average pooling over large fully-connected layers in CNNs

Limitations

CPU-only (no GPU support)
No automatic batching (must manually create batches)
Limited built-in layer types (dense, convolutional, batch norm)
Single-threaded execution
Batch normalization requires proper training/eval mode switching

Troubleshooting

Common Errors

Shape mismatch errors

Ensure tensor shapes are compatible for operations. For batched operations, the batch dimension should match.

; Batch size mismatch
(define x (make-tensor32 (make-f32vector 200) '(10 20)))
(define y (make-tensor32 (make-f32vector 300) '(15 20)))
(add x y)  ; Error: shape mismatch

Batch normalization mode not set

Always explicitly set training/eval mode:

; Training
(set-training-mode! model #t)
(train-epoch model)

; Evaluation
(set-eval-mode! model)
(evaluate model)

Author

Ivan Raikov

Repository

https://github.com/iraikov/nanograd

Version History

2.0: Batch processing support

- Dense layers support 1D/2D inputs - Conv2D supports 3D/4D inputs - Batch normalization for 3D/4D inputs - Softmax/log-softmax with batch and axis support - Cross-entropy loss with batch reduction - RMSNorm with 1D/2D support - Global average pooling with 3D/4D support - L2-normalize with axis parameter

1.2: Additional operations

- Reduction operations (sum-tensor, mean-tensor, product-tensor, reduce-tensor) - Tensor slicing (slice-tensor) - Batch normalization (make-batch-norm-2d) - Global average pooling (global-avg-pool2d) - Training/evaluation mode control

1.1: Bug fix in mul layer operation
1.0: Initial release

- Core autograd engine - Dense and convolutional layers - SGD, Adam, and RMSprop optimizers - Basic activation and loss functions

License

LGPL-3

References

PyTorch: Dynamic computation graphs, autograd design, and batch-first conventions
micrograd: Minimalist autograd engine by Andrej Karpathy
"Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018)
"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (Ioffe & Szegedy, 2015)
"BLAS (Basic Linear Algebra Subprograms)" documentation

nanograd

Description

Requirements

Modules

nanograd-autograd

Tensor Constructors

Tensor Predicates

Tensor Accessors

Arithmetic Operations

Linear Algebra Operations

Reduction Operations

Tensor Manipulation Operations

Activation Functions

Loss Functions

Normalization Operations

Convolution Operations

Gradient Operations

Utility Functions

nanograd-layer

Layer Predicates

Dense Layer

Convolutional Layer

Batch Normalization Layer

Flatten Layer

Global Average Pooling

Sequential Container

Layer Operations

Activation Function Objects

Utility Functions

nanograd-optimizer

Optimizer Predicates

SGD Optimizer

Adam Optimizer

RMSprop Optimizer

Optimizer Operations

Examples

Batch Processing with Dense Layers

Batched Softmax and Cross-Entropy

Training with Batches

Convolutional Network with Batch Normalization

ResNet-Style Architecture

Performance Notes

Batch Processing Best Practices

Limitations

Troubleshooting

Common Errors

Author

Repository

Version History

See Also

License

References