nanograd
A lightweight automatic differentiation and neural network framework for CHICKEN Scheme, featuring BLAS-accelerated operations, comprehensive batch processing support, and YASOS-based object abstractions.
Description
NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features:
- Reverse-mode automatic differentiation with gradient computation
- Native batch processing support throughout the stack
- BLAS-accelerated linear algebra operations with batched GEMM
- YASOS-based polymorphic object system
- Support for both 32-bit and 64-bit floating-point precision
- Common neural network layers with 1D/2D input support (Dense), 3D/4D support (Conv2D, BatchNorm2D)
- Common optimization algorithms (SGD, Adam, RMSprop)
- Batch-aware activation functions (Softmax, Log-Softmax) and loss functions
- Tensor manipulation with reduction operations and slicing
- Training/evaluation mode support for layers
Requirements
Modules
nanograd-autograd
Core automatic differentiation engine with tensor operations and batch support.
Tensor Constructors
[procedure] (make-tensor32 data shape #!key (requires-grad? #t)) -> tensorCreates a 32-bit floating-point tensor with automatic differentiation support.
- data
- f32vector containing the tensor data
- shape
- list of dimensions, e.g., '(2 3) for a 2x3 matrix or '(10 2 3) for batch of 10 matrices
- requires-grad?
- whether to track gradients (default #t)
; Single vector (define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t)) ; Batch of vectors (define batch (make-tensor32 (make-f32vector 60) '(10 6) requires-grad?: #t))[procedure] (make-tensor64 data shape #!key (requires-grad? #t)) -> tensor
Creates a 64-bit floating-point tensor with automatic differentiation support.
Tensor Predicates
[procedure] (tensor? obj) -> boolean[procedure] (tensor32? obj) -> boolean
[procedure] (tensor64? obj) -> boolean
Type predicates for tensors.
Tensor Accessors
[procedure] (tensor-data tensor) -> vectorReturns the underlying f32vector or f64vector containing the tensor's data.
[procedure] (tensor-grad tensor) -> vector or #fReturns the gradient vector if gradients are enabled, #f otherwise.
[procedure] (tensor-shape tensor) -> listReturns the shape as a list of dimensions.
[procedure] (tensor-dtype tensor) -> symbolReturns the data type: 'f32 or 'f64.
[procedure] (tensor-requires-grad? tensor) -> booleanReturns #t if the tensor tracks gradients.
Arithmetic Operations
[procedure] (add a b) -> tensorElement-wise addition of tensors a and b. Both tensors must have the same shape and dtype.
(define z (add x y)) ; z = x + y
Gradient: dL/da = dL/dz, dL/db = dL/dz
[procedure] (sub a b) -> tensorElement-wise subtraction: a - b.
Gradient: dL/da = dL/dz, dL/db = -dL/dz
[procedure] (mul a b) -> tensorElement-wise multiplication (Hadamard product).
Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a
[procedure] (div a b) -> tensorElement-wise division: a / b.
Gradient: dL/da = dL/dz / b, dL/db = -dL/dz · (a / b²)
[procedure] (safe-div a b #!key (epsilon 1e-8)) -> tensorSafe element-wise division: a / (b + epsilon) to avoid division by zero.
Linear Algebra Operations
[procedure] (matmul-op a b) -> tensorMatrix multiplication using BLAS GEMM/GEMV operations with batch support. Supports:
- Matrix × Matrix
- Matrix × Vector
- Vector × Matrix
- Vector × Vector (dot product)
- Batched operations (implicit batching over first dimension)
; Standard matrix-vector multiplication (define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2))) (define b (make-tensor32 (f32vector 5.0 6.0) '(2))) (define c (matmul-op A b)) ; 2×2 matrix times 2×1 vector = 2×1 vector ; Batch matrix multiplication (define batch-A (make-tensor32 (make-f32vector 80) '(10 2 4))) ; 10 samples (define W (make-tensor32 (make-f32vector 12) '(4 3))) (define batch-result (matmul-op batch-A W)) ; Shape: (10, 2, 3)
Gradient: dL/dA = dL/dC · B^T, dL/dB = A^T · dL/dC
[procedure] (dot-op a b) -> tensorDot product (inner product) of two 1D vectors using BLAS DOT.
(define result (dot-op x y)) ; scalar result
Gradient: dL/da = (dL/dresult) · b, dL/db = (dL/dresult) · a
[procedure] (scale-op tensor scalar) -> tensorScalar multiplication using BLAS SCAL.
Gradient: dL/dtensor = scalar · dL/dresult
Reduction Operations
[procedure] (reduce-tensor tensor reducer #!key (compute-gradient #f)) -> tensorGeneric reduction operation that maintains gradient flow. The reducer function is applied to each element in the forward pass. An optional compute-gradient function specifies how gradients are distributed in the backward pass.
- tensor
- input tensor to reduce
- reducer
- function (element accumulator) -> new-accumulator
- compute-gradient
- optional function (grad-out index value all-values) -> grad-in
- If not provided, assumes uniform distribution (like sum)
Returns a scalar tensor with the reduced value.
;; Sum all elements (uniform gradient distribution) (define total (reduce-tensor x +)) ;; Product of all elements (gradient uses product rule) (define prod (reduce-tensor x * compute-gradient: (lambda (grad-out idx val all-values) ;; d(prod)/dx_i = prod / x_i (let ((prod (fold * 1.0 all-values))) (if (> val 0.0) (* grad-out (/ prod val)) 0.0)))))[procedure] (sum-tensor tensor) -> tensor
Sums all elements in the tensor. Gradient is distributed uniformly to all elements.
[procedure] (product-tensor tensor) -> tensorComputes the product of all elements. Gradient uses the product rule: d(prod)/dx_i = prod / x_i.
[procedure] (mean-tensor tensor) -> tensorComputes the mean (average) of all elements.
Tensor Manipulation Operations
[procedure] (slice-tensor tensor start length) -> tensorExtracts a slice of a tensor along the first dimension. Gradients flow back correctly to the original tensor positions.
- tensor
- input tensor with shape (n, ...)
- start
- starting index (0-based)
- length
- number of elements to extract
- Returns
- tensor with shape (length, ...)
;; Slice a batch of data (define batch-data (make-tensor32 (make-f32vector 100) '(10 10))) (define mini-batch (slice-tensor batch-data 2 5)) ; Shape: (5, 10) ;; Gradients flow back to original positions (backward! (sum-tensor mini-batch)) (tensor-grad batch-data) ; Only indices 2-6 have non-zero gradients[procedure] (reshape tensor new-shape) -> tensor
Reshapes the tensor. Total number of elements must be preserved.
[procedure] (flatten-tensor tensor) -> tensorFlattens a multi-dimensional tensor to 1D.
Activation Functions
[procedure] (relu tensor) -> tensorRectified Linear Unit: max(0, x).
Gradient: 1 if x > 0, else 0
[procedure] (tanh-op tensor) -> tensorHyperbolic tangent activation.
Gradient: 1 - tanh^2(x)
[procedure] (sigmoid tensor) -> tensorSigmoid (logistic) activation: σ(x) = 1 / (1 + e^(-x)).
Gradient: σ(x) · (1 - σ(x))
[procedure] (sigmoid-stable tensor) -> tensorNumerically stable sigmoid implementation for large negative values.
[procedure] (softmax x #!key (axis -1)) -> tensorSoftmax normalization with numerical stability and batch support.
Input shapes:
- 1D: (n_classes,) - standard softmax
- 2D: (batch_size, n_classes) - softmax along axis (default: -1 for last axis)
; Single sample (define logits (make-tensor32 (f32vector 1.0 2.0 3.0) '(3))) (define probs (softmax logits)) ; Sums to 1.0 ; Batch of samples (define batch-logits (make-tensor32 (make-f32vector 60) '(20 3))) (define batch-probs (softmax batch-logits axis: -1)) ; Each row sums to 1.0
Gradient: dL/dx = softmax(x) ⊙ (dL/dy - Σ(dL/dy ⊙ softmax(x)))
[procedure] (log-softmax x #!key (axis -1)) -> tensorLog-softmax with batch support: more numerically stable than log(softmax(x)).
Input shapes:
- 1D: (n_classes,)
- 2D: (batch_size, n_classes) - log-softmax along axis
Gradient: dL/dx = dL/dy - exp(log_softmax(x)) · Σ(dL/dy)
[procedure] (leaky-relu tensor #!key (alpha 0.01)) -> tensorLeaky ReLU: max(alpha * x, x).
[procedure] (softplus tensor #!key (beta 1.0)) -> tensorSoftplus activation: log(1 + e^(beta * x)) / beta.
[procedure] (gelu tensor) -> tensorGaussian Error Linear Unit activation using tanh approximation.
[procedure] (silu tensor) -> tensorSiLU (Sigmoid Linear Unit) activation, also known as Swish: x * σ(x).
Loss Functions
[procedure] (mse-loss pred target #!key (reduction 'mean)) -> tensorMean Squared Error loss with batch support.
- pred
- predictions tensor (any shape)
- target
- target tensor (same shape as pred)
- reduction
- 'mean (average over all elements) or 'sum
For batched inputs (batch_size, ...), computes loss per sample and reduces according to reduction parameter.
; Single sample (define loss (mse-loss predictions targets)) ; Batch of samples (define batch-pred (make-tensor32 pred-data '(32 10))) (define batch-target (make-tensor32 target-data '(32 10))) (define batch-loss (mse-loss batch-pred batch-target reduction: 'mean))[procedure] (cross-entropy-loss pred target #!key (reduction 'mean) (from-logits #f)) -> tensor
Cross-entropy loss with batch support.
- pred
- predictions tensor
- If from-logits=#f
- probabilities (softmax already applied)
- If from-logits=#t
- logits (raw scores, log-softmax applied internally)
- target
- target tensor
- One-hot
- same shape as pred
- Class indices
- (batch_size,) with integer class labels
- reduction
- 'mean (average over batch) or 'sum
- from-logits
- if true, apply log-softmax to pred first
Input shapes:
- 1D pred (n_classes,): single sample
- 2D pred (batch_size, n_classes): batch of samples
; Single sample with one-hot target (define loss (cross-entropy-loss probs target)) ; Batch with one-hot targets (define batch-probs (softmax logits axis: -1)) (define batch-loss (cross-entropy-loss batch-probs targets reduction: 'mean)) ; Batch with class indices (more memory efficient) (define class-indices (make-tensor32 (f32vector 0.0 2.0 1.0) '(3))) (define batch-loss (cross-entropy-loss logits class-indices from-logits: #t reduction: 'mean))
Normalization Operations
[procedure] (rmsnorm x weight #!key (epsilon 1e-5)) -> tensorRoot Mean Square Layer Normalization with batch support.
Input shapes:
- 1D: (d_model,) - standard RMSNorm
- 2D: (batch_size, d_model) - RMSNorm applied to each batch element independently
Formula: output[i] = (x[i] / RMS(x)) * weight[i] where RMS(x) = sqrt(mean(x^2) + epsilon)
; Single vector (define x (make-tensor32 (make-f32vector 512) '(512))) (define gamma (make-tensor32 (make-f32vector 512 1.0) '(512))) (define normalized (rmsnorm x gamma)) ; Batch of vectors (define batch-x (make-tensor32 (make-f32vector (* 32 512)) '(32 512))) (define batch-norm (rmsnorm batch-x gamma)) ; Normalized per batch element[procedure] (l2-normalize tensor #!key (axis #f) (epsilon 1e-8)) -> tensor
L2 normalization with axis support.
- axis
- #f (normalize entire tensor) or integer (normalize along axis)
For 2D tensors:
- axis=0: normalize along rows (each column becomes unit vector)
- axis=1: normalize along columns (each row becomes unit vector)
; Normalize entire tensor (define normalized (l2-normalize x)) ; Normalize each row of a batch (define batch (make-tensor32 (make-f32vector 200) '(10 20))) (define row-normalized (l2-normalize batch axis: 1)) ; Each row has ||·||₂ = 1[procedure] (cosine-similarity a b) -> tensor
Cosine similarity: (a · b) / (||a|| · ||b||).
Convolution Operations
[procedure] (conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor2D convolution using im2col + GEMM algorithm with batch support.
- input
- tensor of shape (C_in, H, W) or (N, C_in, H, W)
- weight
- tensor of shape (C_out, C_in, KH, KW)
- bias
- tensor of shape (C_out) or #f
- stride
- stride for convolution (default 1)
- padding
- zero-padding (default 0)
Input shapes:
- 3D: (C_in, H, W) - single image
- 4D: (N, C_in, H, W) - batch of images
Output shapes:
- 3D: (C_out, H_out, W_out)
- 4D: (N, C_out, H_out, W_out)
; Single image (define img (make-tensor32 (make-f32vector (* 3 32 32)) '(3 32 32))) (define output (conv2d img weights bias stride: 2 padding: 1)) ; Batch of images (define batch-imgs (make-tensor32 (make-f32vector (* 16 3 32 32)) '(16 3 32 32))) (define batch-output (conv2d batch-imgs weights bias)) ; Shape: (16, C_out, H_out, W_out)
Gradient Operations
[procedure] (zero-grad! tensor) -> voidSets all gradient values to zero.
[procedure] (backward! tensor) -> voidComputes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found.
[procedure] (add-to-grad! tensor delta) -> voidAccumulates delta into the tensor's gradient using BLAS AXPY.
Utility Functions
[procedure] (tensor->list tensor) -> listConverts tensor data to a list.
[procedure] (print-tensor tensor) -> voidPretty-prints tensor information including shape, dtype, data, and gradients.
[procedure] (vector-length-for-dtype vec dtype) -> integerReturns the length of a vector based on its dtype.
nanograd-layer
Neural network layer abstractions and containers with batch processing support.
Layer Predicates
[procedure] (layer? obj) -> boolean[procedure] (dense-layer? obj) -> boolean
[procedure] (conv2d-layer? obj) -> boolean
[procedure] (batch-norm-2d? obj) -> boolean
[procedure] (sequential? obj) -> boolean
[procedure] (flatten-layer? obj) -> boolean
Dense Layer
[procedure] (make-dense-layer input-size output-size #!key (activation (make-identity)) (use-bias #t) (dtype 'f32) (name "Dense")) -> layerCreates a fully-connected (dense) layer with Xavier/Glorot initialization. Supports both single vectors and batches.
- input-size
- number of input features
- output-size
- number of output features
- activation
- activation function object (default identity)
- use-bias
- whether to include bias term (default #t)
- dtype
- 'f32 or 'f64 (default 'f32)
- name
- layer name for debugging
Input shapes:
- 1D: (input_size,) → output: (output_size,)
- 2D: (batch_size, input_size) → output: (batch_size, output_size)
For 2D inputs, uses BLAS GEMM for efficient batch processing.
(define layer (make-dense-layer 784 128 activation: (make-relu) name: "Hidden1")) ; Single input (define x (make-tensor32 (make-f32vector 784) '(784))) (define output (forward layer x)) ; Shape: (128,) ; Batch input (define batch-x (make-tensor32 (make-f32vector (* 32 784)) '(32 784))) (define batch-output (forward layer batch-x)) ; Shape: (32, 128)
Convolutional Layer
[procedure] (make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layerCreates a 2D convolutional layer with He initialization. Supports both single images and batches.
- in-channels
- number of input channels
- out-channels
- number of output channels
- kernel-size
- size of convolution kernel (square)
- stride
- convolution stride (default 1)
- padding
- zero-padding (default 0)
- activation
- activation function object
- dtype
- 'f32 or 'f64
- name
- layer name
Input shapes:
- 3D: (C_in, H, W) - single image
- 4D: (N, C_in, H, W) - batch of images
Output shapes:
- 3D: (C_out, H_out, W_out)
- 4D: (N, C_out, H_out, W_out)
(define conv (make-conv2d-layer 3 32 3 stride: 1 padding: 1 activation: (make-relu))) ; Single image (define img (make-tensor32 img-data '(3 32 32))) (define features (forward conv img)) ; Shape: (32, 32, 32) ; Batch of images (define batch (make-tensor32 batch-data '(16 3 32 32))) (define batch-features (forward conv batch)) ; Shape: (16, 32, 32, 32)
Batch Normalization Layer
[procedure] (make-batch-norm-2d num-features #!key (epsilon 1e-5) (momentum 0.1) (dtype 'f32) (name "BatchNorm2d")) -> layerCreates a 2D batch normalization layer. Normalizes activations across the batch dimension:
y = γ * (x - μ) / √(σ² + ε) + β
where μ and σ² are computed from the batch (training mode) or from running statistics (evaluation mode).
- num-features
- number of channels (C)
- epsilon
- small constant for numerical stability (default 1e-5)
- momentum
- momentum for updating running statistics (default 0.1)
- dtype
- 'f32 or 'f64 (default 'f32)
- name
- layer name
Input shapes:
- 3D: (C, H, W) - treated as batch of 1
- 4D: (N, C, H, W) - standard batch
Output shapes: same as input
;; Create batch norm for 64 channels (define bn (make-batch-norm-2d 64 epsilon: 1e-5 momentum: 0.1)) ;; Training mode: uses batch statistics (set-training-mode! bn #t) (define normalized (forward bn input)) ; Input shape: (N, 64, H, W) ;; Evaluation mode: uses running statistics (set-eval-mode! bn) (define test-normalized (forward bn test-input)) ; Deterministic output
Batch normalization improves training stability and convergence by:
- Reducing internal covariate shift
- Allowing higher learning rates
- Acting as a form of regularization
- Making networks less sensitive to initialization
Key features:
- Learnable scale (gamma) and shift (beta) parameters
- Running mean and variance maintained for evaluation
- Automatic mode switching between training and evaluation
- Numerical stability with epsilon parameter
Flatten Layer
[procedure] (make-flatten #!key (name "Flatten")) -> layerCreates a flatten layer that converts multi-dimensional tensors to 1D or 2D.
Input shapes and outputs:
- 4D: (N, C, H, W) → (N, C*H*W)
- 3D: (C, H, W) → (C*H*W)
- 2D: (N, features) → (N, features) (no change)
- 1D: (features,) → (features,) (no change)
(define flatten (make-flatten name: "Flatten")) ; Flatten batch of feature maps (define features (make-tensor32 data '(32 64 8 8))) (define flattened (forward flatten features)) ; Shape: (32, 4096)
Global Average Pooling
[procedure] (global-avg-pool2d input) -> tensorGlobal average pooling over spatial dimensions with batch support. Reduces spatial dimensions to 1x1 by averaging.
- Input shapes
- 3D
- (C, H, W) → Output: (C,)
- 4D
- (N, C, H, W) → Output: (N, C)
Gradient: Distributed uniformly over all spatial positions for each channel.
;; Single image (define feature-maps (make-tensor32 (make-f32vector (* 128 8 8)) '(128 8 8))) (define pooled (global-avg-pool2d feature-maps)) ; Shape: (128,) ;; Batch of images (define batch-features (make-tensor32 (make-f32vector (* 32 128 8 8)) '(32 128 8 8))) (define batch-pooled (global-avg-pool2d batch-features)) ; Shape: (32, 128) ;; Use in classification network (define logits (forward fc-layer batch-pooled)) ; Shape: (32, num_classes)
Global average pooling is commonly used to replace large fully-connected layers:
- Reduces number of parameters dramatically
- Improves generalization
- Makes networks translation-invariant
- Standard in modern architectures (ResNet, MobileNet, EfficientNet)
Sequential Container
[procedure] (make-sequential layers #!key (name "Sequential")) -> layerCreates a sequential container that chains multiple layers. Automatically handles batch propagation through all layers.
(define model (make-sequential (list (make-dense-layer 784 128 activation: (make-relu)) (make-dense-layer 128 64 activation: (make-relu)) (make-dense-layer 64 10 activation: (make-identity))) name: "MLP")) ; Works with both single and batch inputs (define single-output (forward model single-input)) (define batch-output (forward model batch-input))
Layer Operations
[procedure] (forward layer input) -> tensorPerforms a forward pass through the layer. Automatically handles both single samples and batches based on input shape.
[procedure] (parameters layer) -> listReturns a list of all trainable parameter tensors.
[procedure] (zero-grad-layer! layer) -> voidZeros gradients for all parameters in the layer.
[procedure] (set-training-mode! layer training?) -> voidSets the training mode for the layer. When training? is #t, the layer uses training-specific behavior (e.g., batch statistics for batch norm). When #f, uses evaluation behavior.
;; Set model to training mode (set-training-mode! model #t) ;; Set model to evaluation mode (set-training-mode! model #f)[procedure] (set-eval-mode! layer) -> void
Shorthand for (set-training-mode! layer #f). Sets the layer to evaluation mode.
Training vs Evaluation Mode:
Training Mode ((set-training-mode! layer #t)):
- Batch normalization uses batch statistics (mean and variance computed from current batch)
- Dropout is active (if implemented)
- Stochastic behavior enabled
- Running statistics updated
Evaluation Mode ((set-eval-mode! layer)):
- Batch normalization uses running statistics (accumulated during training)
- Dropout is disabled
- Deterministic behavior
- Running statistics frozen
[procedure] (layer-output-size layer) -> integer or #f
[procedure] (layer-activation layer) -> activation
[procedure] (layer-name layer) -> string
Accessor functions for layer properties. Note: input/output sizes may be #f for layers with dynamic dimensions (e.g., flatten).
Activation Function Objects
[procedure] (make-relu) -> activation[procedure] (make-tanh) -> activation
[procedure] (make-sigmoid) -> activation
[procedure] (make-gelu) -> activation
[procedure] (make-silu) -> activation
[procedure] (make-identity) -> activation
Creates activation function objects for use in layers.
[procedure] (activation? obj) -> boolean[procedure] (activation-forward act x) -> tensor
[procedure] (activation-name act) -> string
Utility Functions
[procedure] (print-layer layer #!optional (indent 0)) -> voidPrints layer information with optional indentation.
[procedure] (summary model) -> voidPrints a model summary including all layers and parameter counts.
nanograd-optimizer
Optimization algorithms for neural network training.
Optimizer Predicates
[procedure] (optimizer? obj) -> boolean[procedure] (sgd? obj) -> boolean
[procedure] (adam? obj) -> boolean
[procedure] (rmsprop? obj) -> boolean
SGD Optimizer
[procedure] (make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizerStochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration.
- parameters
- list of parameter tensors to optimize
- learning-rate
- step size (default 0.01)
- momentum
- momentum factor (default 0.0, no momentum)
- weight-decay
- L2 regularization factor (default 0.0)
- nesterov
- use Nesterov momentum (default #f)
Adam Optimizer
[procedure] (make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizerAdam (Adaptive Moment Estimation) optimizer with bias correction.
- beta1
- exponential decay rate for first moment (default 0.9)
- beta2
- exponential decay rate for second moment (default 0.999)
- epsilon
- numerical stability constant (default 1e-8)
RMSprop Optimizer
[procedure] (make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizerRMSprop optimizer with optional momentum.
- alpha
- smoothing constant (default 0.99)
Optimizer Operations
[procedure] (step! optimizer) -> voidApplies parameter updates based on accumulated gradients.
[procedure] (get-learning-rate optimizer) -> numberReturns the current learning rate.
[procedure] (set-learning-rate! optimizer lr) -> voidUpdates the learning rate (useful for learning rate scheduling).
[procedure] (optimizer-state optimizer) -> alistReturns an association list of optimizer configuration parameters.
Examples
Batch Processing with Dense Layers
(import nanograd-autograd nanograd-layer) ;; Create a batch of inputs (define batch-size 32) (define input-dim 784) (define batch-data (make-f32vector (* batch-size input-dim))) ;; Fill with data... (define batch-input (make-tensor32 batch-data (list batch-size input-dim))) ;; Dense layer automatically handles batches (define layer (make-dense-layer input-dim 128 activation: (make-relu))) (define output (forward layer batch-input)) ; Shape: (32, 128)
Batched Softmax and Cross-Entropy
;; Batch of logits (define batch-size 32) (define num-classes 10) (define logits (make-tensor32 (make-f32vector (* batch-size num-classes)) (list batch-size num-classes))) (define targets (make-tensor32 target-data (list batch-size num-classes))) ;; Softmax along class dimension (define probs (softmax logits axis: -1)) ; Each row sums to 1 ;; Cross-entropy with batches (define loss (cross-entropy-loss probs targets reduction: 'mean)) ;; Alternative: use from-logits for stability (define loss-stable (cross-entropy-loss logits targets from-logits: #t reduction: 'mean))
Training with Batches
(import nanograd-autograd nanograd-layer nanograd-optimizer) ;; Define model (define model (make-sequential (list (make-dense-layer 784 256 activation: (make-relu)) (make-dense-layer 256 128 activation: (make-relu)) (make-dense-layer 128 10 activation: (make-identity))) name: "MLP")) (define optimizer (make-adam (parameters model) learning-rate: 0.001)) ;; Training loop with batches (define (train-epoch train-batches) (set-training-mode! model #t) (for-each (lambda (batch) (let* ((x (car batch)) ; Shape: (batch_size, 784) (y (cdr batch)) ; Shape: (batch_size, 10) (logits (forward model x)) (loss (cross-entropy-loss logits y from-logits: #t reduction: 'mean))) (backward! loss) (step! optimizer) (zero-grad-layer! model))) train-batches)) ;; Evaluation (define (evaluate test-batches) (set-eval-mode! model) ;; ... evaluation code ... )
Convolutional Network with Batch Normalization
(import nanograd-autograd nanograd-layer nanograd-optimizer) ;; CNN with batch support (define cnn (make-sequential (list (make-conv2d-layer 3 32 3 stride: 1 padding: 1 activation: (make-identity)) (make-batch-norm-2d 32) ; Normalizes across batch (make-conv2d-layer 32 64 3 stride: 1 padding: 1 activation: (make-identity)) (make-batch-norm-2d 64) (make-flatten) (make-dense-layer (* 64 32 32) 128 activation: (make-relu)) (make-dense-layer 128 10 activation: (make-identity))) name: "CNN")) ;; Process batch of images (define batch-images (make-tensor32 image-data '(16 3 32 32))) ; 16 RGB images (set-training-mode! cnn #t) (define predictions (forward cnn batch-images)) ; Shape: (16, 10)
ResNet-Style Architecture
;; ResNet block with batch normalization (define (make-resnet-block in-channels out-channels stride) (make-sequential (list (make-conv2d-layer in-channels out-channels 3 stride: stride padding: 1 activation: (make-identity)) (make-batch-norm-2d out-channels) (make-conv2d-layer out-channels out-channels 3 stride: 1 padding: 1 activation: (make-identity)) (make-batch-norm-2d out-channels)) name: "ResBlock")) ;; Full model (define resnet (make-sequential (list (make-conv2d-layer 3 64 7 stride: 2 padding: 3) (make-batch-norm-2d 64) (make-resnet-block 64 64 1) (make-resnet-block 64 128 2) (make-resnet-block 128 256 2) (make-resnet-block 256 512 2) (make-dense-layer 512 1000)) name: "ResNet"))
Performance Notes
- NanoGrad uses BLAS for matrix operations, including batched GEMM
- Batch operations are significantly more efficient than processing samples individually
- Use f32 (32-bit) tensors when 64-bit precision is not required
- The framework detects computation graph cycles
- Batch normalization adds minimal overhead and significantly improves training
- Global average pooling reduces parameters without sacrificing performance
Batch Processing Best Practices
1. Always use batches during training for better performance and stable gradients 2. Set appropriate batch sizes (typically 16-256 depending on memory) 3. Use batch normalization for deeper networks (>10 layers) 4. Switch to eval mode during validation/testing to use running statistics 5. Prefer global average pooling over large fully-connected layers in CNNs
Limitations
- CPU-only (no GPU support)
- No automatic batching (must manually create batches)
- Limited built-in layer types (dense, convolutional, batch norm)
- Single-threaded execution
- Batch normalization requires proper training/eval mode switching
Troubleshooting
Common Errors
Shape mismatch errors
Ensure tensor shapes are compatible for operations. For batched operations, the batch dimension should match.
; Batch size mismatch (define x (make-tensor32 (make-f32vector 200) '(10 20))) (define y (make-tensor32 (make-f32vector 300) '(15 20))) (add x y) ; Error: shape mismatch
Batch normalization mode not set
Always explicitly set training/eval mode:
; Training (set-training-mode! model #t) (train-epoch model) ; Evaluation (set-eval-mode! model) (evaluate model)
Author
Repository
https://github.com/iraikov/nanograd
Version History
- 2.0
- Batch processing support
- Dense layers support 1D/2D inputs - Conv2D supports 3D/4D inputs - Batch normalization for 3D/4D inputs - Softmax/log-softmax with batch and axis support - Cross-entropy loss with batch reduction - RMSNorm with 1D/2D support - Global average pooling with 3D/4D support - L2-normalize with axis parameter
- 1.2
- Additional operations
- Reduction operations (sum-tensor, mean-tensor, product-tensor, reduce-tensor) - Tensor slicing (slice-tensor) - Batch normalization (make-batch-norm-2d) - Global average pooling (global-avg-pool2d) - Training/evaluation mode control
- 1.1
- Bug fix in mul layer operation
- 1.0
- Initial release
- Core autograd engine - Dense and convolutional layers - SGD, Adam, and RMSprop optimizers - Basic activation and loss functions
See Also
License
LGPL-3
References
- PyTorch: Dynamic computation graphs, autograd design, and batch-first conventions
- micrograd: Minimalist autograd engine by Andrej Karpathy
- "Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018)
- "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (Ioffe & Szegedy, 2015)
- "BLAS (Basic Linear Algebra Subprograms)" documentation