Wiki
Download
Manual
Eggs
API
Tests
Bugs
show
edit
history
You can edit this page using
wiki syntax
for markup.
Article contents:
[[tags: egg math ai machine-learning]] [[toc:]] == nanograd A lightweight automatic differentiation and neural network framework for CHICKEN Scheme, featuring BLAS-accelerated operations, comprehensive batch processing support, and YASOS-based object abstractions. === Description NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features: * Reverse-mode automatic differentiation with gradient computation * Native batch processing support throughout the stack * BLAS-accelerated linear algebra operations with batched GEMM * YASOS-based polymorphic object system * Support for both 32-bit and 64-bit floating-point precision * Common neural network layers with 1D/2D input support (Dense), 3D/4D support (Conv2D, BatchNorm2D) * Common optimization algorithms (SGD, Adam, RMSprop) * Batch-aware activation functions (Softmax, Log-Softmax) and loss functions * Tensor manipulation with reduction operations and slicing * Training/evaluation mode support for layers === Requirements * [[yasos]] * [[blas]] * [[mathh]] * [[srfi-1]] * [[srfi-4]] * [[srfi-42]] * [[srfi-69]] === Modules ==== nanograd-autograd Core automatic differentiation engine with tensor operations and batch support. ===== Tensor Constructors <procedure>(make-tensor32 data shape #!key (requires-grad? #t)) -> tensor</procedure> Creates a 32-bit floating-point tensor with automatic differentiation support. ; data : f32vector containing the tensor data ; shape : list of dimensions, e.g., '(2 3) for a 2x3 matrix or '(10 2 3) for batch of 10 matrices ; requires-grad? : whether to track gradients (default #t) <enscript highlight="scheme"> ; Single vector (define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t)) ; Batch of vectors (define batch (make-tensor32 (make-f32vector 60) '(10 6) requires-grad?: #t)) </enscript> <procedure>(make-tensor64 data shape #!key (requires-grad? #t)) -> tensor</procedure> Creates a 64-bit floating-point tensor with automatic differentiation support. ===== Tensor Predicates <procedure>(tensor? obj) -> boolean</procedure> <procedure>(tensor32? obj) -> boolean</procedure> <procedure>(tensor64? obj) -> boolean</procedure> Type predicates for tensors. ===== Tensor Accessors <procedure>(tensor-data tensor) -> vector</procedure> Returns the underlying f32vector or f64vector containing the tensor's data. <procedure>(tensor-grad tensor) -> vector or #f</procedure> Returns the gradient vector if gradients are enabled, #f otherwise. <procedure>(tensor-shape tensor) -> list</procedure> Returns the shape as a list of dimensions. <procedure>(tensor-dtype tensor) -> symbol</procedure> Returns the data type: 'f32 or 'f64. <procedure>(tensor-requires-grad? tensor) -> boolean</procedure> Returns #t if the tensor tracks gradients. ===== Arithmetic Operations <procedure>(add a b) -> tensor</procedure> Element-wise addition of tensors a and b. Both tensors must have the same shape and dtype. <enscript highlight="scheme"> (define z (add x y)) ; z = x + y </enscript> Gradient: dL/da = dL/dz, dL/db = dL/dz <procedure>(sub a b) -> tensor</procedure> Element-wise subtraction: a - b. Gradient: dL/da = dL/dz, dL/db = -dL/dz <procedure>(mul a b) -> tensor</procedure> Element-wise multiplication (Hadamard product). Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a <procedure>(div a b) -> tensor</procedure> Element-wise division: a / b. Gradient: dL/da = dL/dz / b, dL/db = -dL/dz · (a / b²) <procedure>(safe-div a b #!key (epsilon 1e-8)) -> tensor</procedure> Safe element-wise division: a / (b + epsilon) to avoid division by zero. ===== Linear Algebra Operations <procedure>(matmul-op a b) -> tensor</procedure> Matrix multiplication using BLAS GEMM/GEMV operations with batch support. Supports: * Matrix × Matrix * Matrix × Vector * Vector × Matrix * Vector × Vector (dot product) * Batched operations (implicit batching over first dimension) <enscript highlight="scheme"> ; Standard matrix-vector multiplication (define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2))) (define b (make-tensor32 (f32vector 5.0 6.0) '(2))) (define c (matmul-op A b)) ; 2×2 matrix times 2×1 vector = 2×1 vector ; Batch matrix multiplication (define batch-A (make-tensor32 (make-f32vector 80) '(10 2 4))) ; 10 samples (define W (make-tensor32 (make-f32vector 12) '(4 3))) (define batch-result (matmul-op batch-A W)) ; Shape: (10, 2, 3) </enscript> Gradient: dL/dA = dL/dC · B^T, dL/dB = A^T · dL/dC <procedure>(dot-op a b) -> tensor</procedure> Dot product (inner product) of two 1D vectors using BLAS DOT. <enscript highlight="scheme"> (define result (dot-op x y)) ; scalar result </enscript> Gradient: dL/da = (dL/dresult) · b, dL/db = (dL/dresult) · a <procedure>(scale-op tensor scalar) -> tensor</procedure> Scalar multiplication using BLAS SCAL. Gradient: dL/dtensor = scalar · dL/dresult ===== Reduction Operations <procedure>(reduce-tensor tensor reducer #!key (compute-gradient #f)) -> tensor</procedure> Generic reduction operation that maintains gradient flow. The {{reducer}} function is applied to each element in the forward pass. An optional {{compute-gradient}} function specifies how gradients are distributed in the backward pass. ; tensor : input tensor to reduce ; reducer : function (element accumulator) -> new-accumulator ; compute-gradient : optional function (grad-out index value all-values) -> grad-in ; If not provided, assumes uniform distribution (like sum) Returns a scalar tensor with the reduced value. <enscript highlight="scheme"> ;; Sum all elements (uniform gradient distribution) (define total (reduce-tensor x +)) ;; Product of all elements (gradient uses product rule) (define prod (reduce-tensor x * compute-gradient: (lambda (grad-out idx val all-values) ;; d(prod)/dx_i = prod / x_i (let ((prod (fold * 1.0 all-values))) (if (> val 0.0) (* grad-out (/ prod val)) 0.0))))) </enscript> <procedure>(sum-tensor tensor) -> tensor</procedure> Sums all elements in the tensor. Gradient is distributed uniformly to all elements. <procedure>(product-tensor tensor) -> tensor</procedure> Computes the product of all elements. Gradient uses the product rule: d(prod)/dx_i = prod / x_i. <procedure>(mean-tensor tensor) -> tensor</procedure> Computes the mean (average) of all elements. ===== Tensor Manipulation Operations <procedure>(slice-tensor tensor start length) -> tensor</procedure> Extracts a slice of a tensor along the first dimension. Gradients flow back correctly to the original tensor positions. ; tensor : input tensor with shape (n, ...) ; start : starting index (0-based) ; length : number of elements to extract ; Returns : tensor with shape (length, ...) <enscript highlight="scheme"> ;; Slice a batch of data (define batch-data (make-tensor32 (make-f32vector 100) '(10 10))) (define mini-batch (slice-tensor batch-data 2 5)) ; Shape: (5, 10) ;; Gradients flow back to original positions (backward! (sum-tensor mini-batch)) (tensor-grad batch-data) ; Only indices 2-6 have non-zero gradients </enscript> <procedure>(reshape tensor new-shape) -> tensor</procedure> Reshapes the tensor. Total number of elements must be preserved. <procedure>(flatten-tensor tensor) -> tensor</procedure> Flattens a multi-dimensional tensor to 1D. ===== Activation Functions <procedure>(relu tensor) -> tensor</procedure> Rectified Linear Unit: max(0, x). Gradient: 1 if x > 0, else 0 <procedure>(tanh-op tensor) -> tensor</procedure> Hyperbolic tangent activation. Gradient: 1 - tanh^2(x) <procedure>(sigmoid tensor) -> tensor</procedure> Sigmoid (logistic) activation: σ(x) = 1 / (1 + e^(-x)). Gradient: σ(x) · (1 - σ(x)) <procedure>(sigmoid-stable tensor) -> tensor</procedure> Numerically stable sigmoid implementation for large negative values. <procedure>(softmax x #!key (axis -1)) -> tensor</procedure> Softmax normalization with numerical stability and batch support. Input shapes: * 1D: (n_classes,) - standard softmax * 2D: (batch_size, n_classes) - softmax along axis (default: -1 for last axis) <enscript highlight="scheme"> ; Single sample (define logits (make-tensor32 (f32vector 1.0 2.0 3.0) '(3))) (define probs (softmax logits)) ; Sums to 1.0 ; Batch of samples (define batch-logits (make-tensor32 (make-f32vector 60) '(20 3))) (define batch-probs (softmax batch-logits axis: -1)) ; Each row sums to 1.0 </enscript> Gradient: dL/dx = softmax(x) ⊙ (dL/dy - Σ(dL/dy ⊙ softmax(x))) <procedure>(log-softmax x #!key (axis -1)) -> tensor</procedure> Log-softmax with batch support: more numerically stable than log(softmax(x)). Input shapes: * 1D: (n_classes,) * 2D: (batch_size, n_classes) - log-softmax along axis Gradient: dL/dx = dL/dy - exp(log_softmax(x)) · Σ(dL/dy) <procedure>(leaky-relu tensor #!key (alpha 0.01)) -> tensor</procedure> Leaky ReLU: max(alpha * x, x). <procedure>(softplus tensor #!key (beta 1.0)) -> tensor</procedure> Softplus activation: log(1 + e^(beta * x)) / beta. <procedure>(gelu tensor) -> tensor</procedure> Gaussian Error Linear Unit activation using tanh approximation. <procedure>(silu tensor) -> tensor</procedure> SiLU (Sigmoid Linear Unit) activation, also known as Swish: x * σ(x). ===== Loss Functions <procedure>(mse-loss pred target #!key (reduction 'mean)) -> tensor</procedure> Mean Squared Error loss with batch support. ; pred : predictions tensor (any shape) ; target : target tensor (same shape as pred) ; reduction : 'mean (average over all elements) or 'sum For batched inputs (batch_size, ...), computes loss per sample and reduces according to reduction parameter. <enscript highlight="scheme"> ; Single sample (define loss (mse-loss predictions targets)) ; Batch of samples (define batch-pred (make-tensor32 pred-data '(32 10))) (define batch-target (make-tensor32 target-data '(32 10))) (define batch-loss (mse-loss batch-pred batch-target reduction: 'mean)) </enscript> <procedure>(cross-entropy-loss pred target #!key (reduction 'mean) (from-logits #f)) -> tensor</procedure> Cross-entropy loss with batch support. ; pred : predictions tensor ; If from-logits=#f: probabilities (softmax already applied) ; If from-logits=#t: logits (raw scores, log-softmax applied internally) ; target : target tensor ; One-hot: same shape as pred ; Class indices: (batch_size,) with integer class labels ; reduction : 'mean (average over batch) or 'sum ; from-logits : if true, apply log-softmax to pred first Input shapes: * 1D pred (n_classes,): single sample * 2D pred (batch_size, n_classes): batch of samples <enscript highlight="scheme"> ; Single sample with one-hot target (define loss (cross-entropy-loss probs target)) ; Batch with one-hot targets (define batch-probs (softmax logits axis: -1)) (define batch-loss (cross-entropy-loss batch-probs targets reduction: 'mean)) ; Batch with class indices (more memory efficient) (define class-indices (make-tensor32 (f32vector 0.0 2.0 1.0) '(3))) (define batch-loss (cross-entropy-loss logits class-indices from-logits: #t reduction: 'mean)) </enscript> ===== Normalization Operations <procedure>(rmsnorm x weight #!key (epsilon 1e-5)) -> tensor</procedure> Root Mean Square Layer Normalization with batch support. Input shapes: * 1D: (d_model,) - standard RMSNorm * 2D: (batch_size, d_model) - RMSNorm applied to each batch element independently Formula: output[i] = (x[i] / RMS(x)) * weight[i] where RMS(x) = sqrt(mean(x^2) + epsilon) <enscript highlight="scheme"> ; Single vector (define x (make-tensor32 (make-f32vector 512) '(512))) (define gamma (make-tensor32 (make-f32vector 512 1.0) '(512))) (define normalized (rmsnorm x gamma)) ; Batch of vectors (define batch-x (make-tensor32 (make-f32vector (* 32 512)) '(32 512))) (define batch-norm (rmsnorm batch-x gamma)) ; Normalized per batch element </enscript> <procedure>(l2-normalize tensor #!key (axis #f) (epsilon 1e-8)) -> tensor</procedure> L2 normalization with axis support. ; axis : #f (normalize entire tensor) or integer (normalize along axis) For 2D tensors: * axis=0: normalize along rows (each column becomes unit vector) * axis=1: normalize along columns (each row becomes unit vector) <enscript highlight="scheme"> ; Normalize entire tensor (define normalized (l2-normalize x)) ; Normalize each row of a batch (define batch (make-tensor32 (make-f32vector 200) '(10 20))) (define row-normalized (l2-normalize batch axis: 1)) ; Each row has ||·||₂ = 1 </enscript> <procedure>(cosine-similarity a b) -> tensor</procedure> Cosine similarity: (a · b) / (||a|| · ||b||). ===== Convolution Operations <procedure>(conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor</procedure> 2D convolution using im2col + GEMM algorithm with batch support. ; input : tensor of shape (C_in, H, W) or (N, C_in, H, W) ; weight : tensor of shape (C_out, C_in, KH, KW) ; bias : tensor of shape (C_out) or #f ; stride : stride for convolution (default 1) ; padding : zero-padding (default 0) Input shapes: * 3D: (C_in, H, W) - single image * 4D: (N, C_in, H, W) - batch of images Output shapes: * 3D: (C_out, H_out, W_out) * 4D: (N, C_out, H_out, W_out) <enscript highlight="scheme"> ; Single image (define img (make-tensor32 (make-f32vector (* 3 32 32)) '(3 32 32))) (define output (conv2d img weights bias stride: 2 padding: 1)) ; Batch of images (define batch-imgs (make-tensor32 (make-f32vector (* 16 3 32 32)) '(16 3 32 32))) (define batch-output (conv2d batch-imgs weights bias)) ; Shape: (16, C_out, H_out, W_out) </enscript> ===== Gradient Operations <procedure>(zero-grad! tensor) -> void</procedure> Sets all gradient values to zero. <procedure>(backward! tensor) -> void</procedure> Computes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found. <procedure>(add-to-grad! tensor delta) -> void</procedure> Accumulates delta into the tensor's gradient using BLAS AXPY. ===== Utility Functions <procedure>(tensor->list tensor) -> list</procedure> Converts tensor data to a list. <procedure>(print-tensor tensor) -> void</procedure> Pretty-prints tensor information including shape, dtype, data, and gradients. <procedure>(vector-length-for-dtype vec dtype) -> integer</procedure> Returns the length of a vector based on its dtype. ==== nanograd-layer Neural network layer abstractions and containers with batch processing support. ===== Layer Predicates <procedure>(layer? obj) -> boolean</procedure> <procedure>(dense-layer? obj) -> boolean</procedure> <procedure>(conv2d-layer? obj) -> boolean</procedure> <procedure>(batch-norm-2d? obj) -> boolean</procedure> <procedure>(sequential? obj) -> boolean</procedure> <procedure>(flatten-layer? obj) -> boolean</procedure> ===== Dense Layer <procedure>(make-dense-layer input-size output-size #!key (activation (make-identity)) (use-bias #t) (dtype 'f32) (name "Dense")) -> layer</procedure> Creates a fully-connected (dense) layer with Xavier/Glorot initialization. Supports both single vectors and batches. ; input-size : number of input features ; output-size : number of output features ; activation : activation function object (default identity) ; use-bias : whether to include bias term (default #t) ; dtype : 'f32 or 'f64 (default 'f32) ; name : layer name for debugging Input shapes: * 1D: (input_size,) → output: (output_size,) * 2D: (batch_size, input_size) → output: (batch_size, output_size) For 2D inputs, uses BLAS GEMM for efficient batch processing. <enscript highlight="scheme"> (define layer (make-dense-layer 784 128 activation: (make-relu) name: "Hidden1")) ; Single input (define x (make-tensor32 (make-f32vector 784) '(784))) (define output (forward layer x)) ; Shape: (128,) ; Batch input (define batch-x (make-tensor32 (make-f32vector (* 32 784)) '(32 784))) (define batch-output (forward layer batch-x)) ; Shape: (32, 128) </enscript> ===== Convolutional Layer <procedure>(make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layer</procedure> Creates a 2D convolutional layer with He initialization. Supports both single images and batches. ; in-channels : number of input channels ; out-channels : number of output channels ; kernel-size : size of convolution kernel (square) ; stride : convolution stride (default 1) ; padding : zero-padding (default 0) ; activation : activation function object ; dtype : 'f32 or 'f64 ; name : layer name Input shapes: * 3D: (C_in, H, W) - single image * 4D: (N, C_in, H, W) - batch of images Output shapes: * 3D: (C_out, H_out, W_out) * 4D: (N, C_out, H_out, W_out) <enscript highlight="scheme"> (define conv (make-conv2d-layer 3 32 3 stride: 1 padding: 1 activation: (make-relu))) ; Single image (define img (make-tensor32 img-data '(3 32 32))) (define features (forward conv img)) ; Shape: (32, 32, 32) ; Batch of images (define batch (make-tensor32 batch-data '(16 3 32 32))) (define batch-features (forward conv batch)) ; Shape: (16, 32, 32, 32) </enscript> ===== Batch Normalization Layer <procedure>(make-batch-norm-2d num-features #!key (epsilon 1e-5) (momentum 0.1) (dtype 'f32) (name "BatchNorm2d")) -> layer</procedure> Creates a 2D batch normalization layer. Normalizes activations across the batch dimension: y = γ * (x - μ) / √(σ² + ε) + β where μ and σ² are computed from the batch (training mode) or from running statistics (evaluation mode). ; num-features : number of channels (C) ; epsilon : small constant for numerical stability (default 1e-5) ; momentum : momentum for updating running statistics (default 0.1) ; dtype : 'f32 or 'f64 (default 'f32) ; name : layer name Input shapes: * 3D: (C, H, W) - treated as batch of 1 * 4D: (N, C, H, W) - standard batch Output shapes: same as input <enscript highlight="scheme"> ;; Create batch norm for 64 channels (define bn (make-batch-norm-2d 64 epsilon: 1e-5 momentum: 0.1)) ;; Training mode: uses batch statistics (set-training-mode! bn #t) (define normalized (forward bn input)) ; Input shape: (N, 64, H, W) ;; Evaluation mode: uses running statistics (set-eval-mode! bn) (define test-normalized (forward bn test-input)) ; Deterministic output </enscript> Batch normalization improves training stability and convergence by: * Reducing internal covariate shift * Allowing higher learning rates * Acting as a form of regularization * Making networks less sensitive to initialization Key features: * Learnable scale (gamma) and shift (beta) parameters * Running mean and variance maintained for evaluation * Automatic mode switching between training and evaluation * Numerical stability with epsilon parameter ===== Flatten Layer <procedure>(make-flatten #!key (name "Flatten")) -> layer</procedure> Creates a flatten layer that converts multi-dimensional tensors to 1D or 2D. Input shapes and outputs: * 4D: (N, C, H, W) → (N, C*H*W) * 3D: (C, H, W) → (C*H*W) * 2D: (N, features) → (N, features) (no change) * 1D: (features,) → (features,) (no change) <enscript highlight="scheme"> (define flatten (make-flatten name: "Flatten")) ; Flatten batch of feature maps (define features (make-tensor32 data '(32 64 8 8))) (define flattened (forward flatten features)) ; Shape: (32, 4096) </enscript> ===== Global Average Pooling <procedure>(global-avg-pool2d input) -> tensor</procedure> Global average pooling over spatial dimensions with batch support. Reduces spatial dimensions to 1x1 by averaging. ; Input shapes: ; 3D: (C, H, W) → Output: (C,) ; 4D: (N, C, H, W) → Output: (N, C) Gradient: Distributed uniformly over all spatial positions for each channel. <enscript highlight="scheme"> ;; Single image (define feature-maps (make-tensor32 (make-f32vector (* 128 8 8)) '(128 8 8))) (define pooled (global-avg-pool2d feature-maps)) ; Shape: (128,) ;; Batch of images (define batch-features (make-tensor32 (make-f32vector (* 32 128 8 8)) '(32 128 8 8))) (define batch-pooled (global-avg-pool2d batch-features)) ; Shape: (32, 128) ;; Use in classification network (define logits (forward fc-layer batch-pooled)) ; Shape: (32, num_classes) </enscript> Global average pooling is commonly used to replace large fully-connected layers: * Reduces number of parameters dramatically * Improves generalization * Makes networks translation-invariant * Standard in modern architectures (ResNet, MobileNet, EfficientNet) ===== Sequential Container <procedure>(make-sequential layers #!key (name "Sequential")) -> layer</procedure> Creates a sequential container that chains multiple layers. Automatically handles batch propagation through all layers. <enscript highlight="scheme"> (define model (make-sequential (list (make-dense-layer 784 128 activation: (make-relu)) (make-dense-layer 128 64 activation: (make-relu)) (make-dense-layer 64 10 activation: (make-identity))) name: "MLP")) ; Works with both single and batch inputs (define single-output (forward model single-input)) (define batch-output (forward model batch-input)) </enscript> ===== Layer Operations <procedure>(forward layer input) -> tensor</procedure> Performs a forward pass through the layer. Automatically handles both single samples and batches based on input shape. <procedure>(parameters layer) -> list</procedure> Returns a list of all trainable parameter tensors. <procedure>(zero-grad-layer! layer) -> void</procedure> Zeros gradients for all parameters in the layer. <procedure>(set-training-mode! layer training?) -> void</procedure> Sets the training mode for the layer. When {{training?}} is #t, the layer uses training-specific behavior (e.g., batch statistics for batch norm). When #f, uses evaluation behavior. <enscript highlight="scheme"> ;; Set model to training mode (set-training-mode! model #t) ;; Set model to evaluation mode (set-training-mode! model #f) </enscript> <procedure>(set-eval-mode! layer) -> void</procedure> Shorthand for {{(set-training-mode! layer #f)}}. Sets the layer to evaluation mode. Training vs Evaluation Mode: '''Training Mode''' ({{(set-training-mode! layer #t)}}): * Batch normalization uses batch statistics (mean and variance computed from current batch) * Dropout is active (if implemented) * Stochastic behavior enabled * Running statistics updated '''Evaluation Mode''' ({{(set-eval-mode! layer)}}): * Batch normalization uses running statistics (accumulated during training) * Dropout is disabled * Deterministic behavior * Running statistics frozen <procedure>(layer-input-size layer) -> integer or #f</procedure> <procedure>(layer-output-size layer) -> integer or #f</procedure> <procedure>(layer-activation layer) -> activation</procedure> <procedure>(layer-name layer) -> string</procedure> Accessor functions for layer properties. Note: input/output sizes may be #f for layers with dynamic dimensions (e.g., flatten). ===== Activation Function Objects <procedure>(make-relu) -> activation</procedure> <procedure>(make-tanh) -> activation</procedure> <procedure>(make-sigmoid) -> activation</procedure> <procedure>(make-gelu) -> activation</procedure> <procedure>(make-silu) -> activation</procedure> <procedure>(make-identity) -> activation</procedure> Creates activation function objects for use in layers. <procedure>(activation? obj) -> boolean</procedure> <procedure>(activation-forward act x) -> tensor</procedure> <procedure>(activation-name act) -> string</procedure> ===== Utility Functions <procedure>(print-layer layer #!optional (indent 0)) -> void</procedure> Prints layer information with optional indentation. <procedure>(summary model) -> void</procedure> Prints a model summary including all layers and parameter counts. ==== nanograd-optimizer Optimization algorithms for neural network training. ===== Optimizer Predicates <procedure>(optimizer? obj) -> boolean</procedure> <procedure>(sgd? obj) -> boolean</procedure> <procedure>(adam? obj) -> boolean</procedure> <procedure>(rmsprop? obj) -> boolean</procedure> ===== SGD Optimizer <procedure>(make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizer</procedure> Stochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration. ; parameters : list of parameter tensors to optimize ; learning-rate : step size (default 0.01) ; momentum : momentum factor (default 0.0, no momentum) ; weight-decay : L2 regularization factor (default 0.0) ; nesterov : use Nesterov momentum (default #f) ===== Adam Optimizer <procedure>(make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizer</procedure> Adam (Adaptive Moment Estimation) optimizer with bias correction. ; beta1 : exponential decay rate for first moment (default 0.9) ; beta2 : exponential decay rate for second moment (default 0.999) ; epsilon : numerical stability constant (default 1e-8) ===== RMSprop Optimizer <procedure>(make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizer</procedure> RMSprop optimizer with optional momentum. ; alpha : smoothing constant (default 0.99) ===== Optimizer Operations <procedure>(step! optimizer) -> void</procedure> Applies parameter updates based on accumulated gradients. <procedure>(get-learning-rate optimizer) -> number</procedure> Returns the current learning rate. <procedure>(set-learning-rate! optimizer lr) -> void</procedure> Updates the learning rate (useful for learning rate scheduling). <procedure>(optimizer-state optimizer) -> alist</procedure> Returns an association list of optimizer configuration parameters. === Examples ==== Batch Processing with Dense Layers <enscript highlight="scheme"> (import nanograd-autograd nanograd-layer) ;; Create a batch of inputs (define batch-size 32) (define input-dim 784) (define batch-data (make-f32vector (* batch-size input-dim))) ;; Fill with data... (define batch-input (make-tensor32 batch-data (list batch-size input-dim))) ;; Dense layer automatically handles batches (define layer (make-dense-layer input-dim 128 activation: (make-relu))) (define output (forward layer batch-input)) ; Shape: (32, 128) </enscript> ==== Batched Softmax and Cross-Entropy <enscript highlight="scheme"> ;; Batch of logits (define batch-size 32) (define num-classes 10) (define logits (make-tensor32 (make-f32vector (* batch-size num-classes)) (list batch-size num-classes))) (define targets (make-tensor32 target-data (list batch-size num-classes))) ;; Softmax along class dimension (define probs (softmax logits axis: -1)) ; Each row sums to 1 ;; Cross-entropy with batches (define loss (cross-entropy-loss probs targets reduction: 'mean)) ;; Alternative: use from-logits for stability (define loss-stable (cross-entropy-loss logits targets from-logits: #t reduction: 'mean)) </enscript> ==== Training with Batches <enscript highlight="scheme"> (import nanograd-autograd nanograd-layer nanograd-optimizer) ;; Define model (define model (make-sequential (list (make-dense-layer 784 256 activation: (make-relu)) (make-dense-layer 256 128 activation: (make-relu)) (make-dense-layer 128 10 activation: (make-identity))) name: "MLP")) (define optimizer (make-adam (parameters model) learning-rate: 0.001)) ;; Training loop with batches (define (train-epoch train-batches) (set-training-mode! model #t) (for-each (lambda (batch) (let* ((x (car batch)) ; Shape: (batch_size, 784) (y (cdr batch)) ; Shape: (batch_size, 10) (logits (forward model x)) (loss (cross-entropy-loss logits y from-logits: #t reduction: 'mean))) (backward! loss) (step! optimizer) (zero-grad-layer! model))) train-batches)) ;; Evaluation (define (evaluate test-batches) (set-eval-mode! model) ;; ... evaluation code ... ) </enscript> ==== Convolutional Network with Batch Normalization <enscript highlight="scheme"> (import nanograd-autograd nanograd-layer nanograd-optimizer) ;; CNN with batch support (define cnn (make-sequential (list (make-conv2d-layer 3 32 3 stride: 1 padding: 1 activation: (make-identity)) (make-batch-norm-2d 32) ; Normalizes across batch (make-conv2d-layer 32 64 3 stride: 1 padding: 1 activation: (make-identity)) (make-batch-norm-2d 64) (make-flatten) (make-dense-layer (* 64 32 32) 128 activation: (make-relu)) (make-dense-layer 128 10 activation: (make-identity))) name: "CNN")) ;; Process batch of images (define batch-images (make-tensor32 image-data '(16 3 32 32))) ; 16 RGB images (set-training-mode! cnn #t) (define predictions (forward cnn batch-images)) ; Shape: (16, 10) </enscript> ==== ResNet-Style Architecture <enscript highlight="scheme"> ;; ResNet block with batch normalization (define (make-resnet-block in-channels out-channels stride) (make-sequential (list (make-conv2d-layer in-channels out-channels 3 stride: stride padding: 1 activation: (make-identity)) (make-batch-norm-2d out-channels) (make-conv2d-layer out-channels out-channels 3 stride: 1 padding: 1 activation: (make-identity)) (make-batch-norm-2d out-channels)) name: "ResBlock")) ;; Full model (define resnet (make-sequential (list (make-conv2d-layer 3 64 7 stride: 2 padding: 3) (make-batch-norm-2d 64) (make-resnet-block 64 64 1) (make-resnet-block 64 128 2) (make-resnet-block 128 256 2) (make-resnet-block 256 512 2) (make-dense-layer 512 1000)) name: "ResNet")) </enscript> === Performance Notes * NanoGrad uses BLAS for matrix operations, including batched GEMM * Batch operations are significantly more efficient than processing samples individually * Use f32 (32-bit) tensors when 64-bit precision is not required * The framework detects computation graph cycles * Batch normalization adds minimal overhead and significantly improves training * Global average pooling reduces parameters without sacrificing performance === Batch Processing Best Practices 1. '''Always use batches during training''' for better performance and stable gradients 2. '''Set appropriate batch sizes''' (typically 16-256 depending on memory) 3. '''Use batch normalization''' for deeper networks (>10 layers) 4. '''Switch to eval mode''' during validation/testing to use running statistics 5. '''Prefer global average pooling''' over large fully-connected layers in CNNs === Limitations * CPU-only (no GPU support) * No automatic batching (must manually create batches) * Limited built-in layer types (dense, convolutional, batch norm) * Single-threaded execution * Batch normalization requires proper training/eval mode switching === Troubleshooting ==== Common Errors '''Shape mismatch errors''' Ensure tensor shapes are compatible for operations. For batched operations, the batch dimension should match. <enscript highlight="scheme"> ; Batch size mismatch (define x (make-tensor32 (make-f32vector 200) '(10 20))) (define y (make-tensor32 (make-f32vector 300) '(15 20))) (add x y) ; Error: shape mismatch </enscript> '''Batch normalization mode not set''' Always explicitly set training/eval mode: <enscript highlight="scheme"> ; Training (set-training-mode! model #t) (train-epoch model) ; Evaluation (set-eval-mode! model) (evaluate model) </enscript> === Author [[https://github.com/iraikov|Ivan Raikov]] === Repository [[https://github.com/iraikov/nanograd|https://github.com/iraikov/nanograd]] === Version History ; 2.0 : Batch processing support - Dense layers support 1D/2D inputs - Conv2D supports 3D/4D inputs - Batch normalization for 3D/4D inputs - Softmax/log-softmax with batch and axis support - Cross-entropy loss with batch reduction - RMSNorm with 1D/2D support - Global average pooling with 3D/4D support - L2-normalize with axis parameter ; 1.2 : Additional operations - Reduction operations (sum-tensor, mean-tensor, product-tensor, reduce-tensor) - Tensor slicing (slice-tensor) - Batch normalization (make-batch-norm-2d) - Global average pooling (global-avg-pool2d) - Training/evaluation mode control ; 1.1 : Bug fix in mul layer operation ; 1.0 : Initial release - Core autograd engine - Dense and convolutional layers - SGD, Adam, and RMSprop optimizers - Basic activation and loss functions === See Also * [[blas|BLAS bindings]] * [[yasos|YASOS object system]] * [[mathh|Extended math functions]] === License LGPL-3 === References * PyTorch: Dynamic computation graphs, autograd design, and batch-first conventions * micrograd: Minimalist autograd engine by Andrej Karpathy * "Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018) * "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (Ioffe & Szegedy, 2015) * "BLAS (Basic Linear Algebra Subprograms)" documentation
Description of your changes:
I would like to authenticate
Authentication
Username:
Password:
Spam control
What do you get when you subtract 14 from 12?