Editing page: nanograd - The CHICKEN Scheme wiki

You can edit this page using wiki syntax for markup.

Article contents:

[[tags: egg math ai machine-learning]]
[[toc:]]

== nanograd

A lightweight automatic differentiation and neural network framework
for CHICKEN Scheme, featuring BLAS-accelerated operations, comprehensive
batch processing support, and YASOS-based object abstractions.

=== Description

NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features:

* Reverse-mode automatic differentiation with gradient computation
* Native batch processing support throughout the stack
* BLAS-accelerated linear algebra operations with batched GEMM
* YASOS-based polymorphic object system
* Support for both 32-bit and 64-bit floating-point precision
* Common neural network layers with 1D/2D input support (Dense), 3D/4D support (Conv2D, BatchNorm2D)
* Common optimization algorithms (SGD, Adam, RMSprop)
* Batch-aware activation functions (Softmax, Log-Softmax) and loss functions
* Tensor manipulation with reduction operations and slicing
* Training/evaluation mode support for layers

=== Requirements

* [[yasos]]
* [[blas]]
* [[mathh]]
* [[srfi-1]]
* [[srfi-4]]
* [[srfi-42]]
* [[srfi-69]]

=== Modules

==== nanograd-autograd

Core automatic differentiation engine with tensor operations and batch support.

===== Tensor Constructors

<procedure>(make-tensor32 data shape #!key (requires-grad? #t)) -> tensor</procedure>

Creates a 32-bit floating-point tensor with automatic differentiation support.

; data : f32vector containing the tensor data
; shape : list of dimensions, e.g., '(2 3) for a 2x3 matrix or '(10 2 3) for batch of 10 matrices
; requires-grad? : whether to track gradients (default #t)

<enscript highlight="scheme">
; Single vector
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t))

; Batch of vectors
(define batch (make-tensor32 (make-f32vector 60) '(10 6) requires-grad?: #t))
</enscript>

<procedure>(make-tensor64 data shape #!key (requires-grad? #t)) -> tensor</procedure>

Creates a 64-bit floating-point tensor with automatic differentiation support.

===== Tensor Predicates

<procedure>(tensor? obj) -> boolean</procedure>
<procedure>(tensor32? obj) -> boolean</procedure>
<procedure>(tensor64? obj) -> boolean</procedure>

Type predicates for tensors.

===== Tensor Accessors

<procedure>(tensor-data tensor) -> vector</procedure>

Returns the underlying f32vector or f64vector containing the tensor's data.

<procedure>(tensor-grad tensor) -> vector or #f</procedure>

Returns the gradient vector if gradients are enabled, #f otherwise.

<procedure>(tensor-shape tensor) -> list</procedure>

Returns the shape as a list of dimensions.

<procedure>(tensor-dtype tensor) -> symbol</procedure>

Returns the data type: 'f32 or 'f64.

<procedure>(tensor-requires-grad? tensor) -> boolean</procedure>

Returns #t if the tensor tracks gradients.

===== Arithmetic Operations

<procedure>(add a b) -> tensor</procedure>

Element-wise addition of tensors a and b. Both tensors must have the same shape and dtype.

<enscript highlight="scheme">
(define z (add x y))  ; z = x + y
</enscript>

Gradient: dL/da = dL/dz, dL/db = dL/dz

<procedure>(sub a b) -> tensor</procedure>

Element-wise subtraction: a - b.

Gradient: dL/da = dL/dz, dL/db = -dL/dz

<procedure>(mul a b) -> tensor</procedure>

Element-wise multiplication (Hadamard product).

Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a

<procedure>(div a b) -> tensor</procedure>

Element-wise division: a / b.

Gradient: dL/da = dL/dz / b, dL/db = -dL/dz · (a / b²)

<procedure>(safe-div a b #!key (epsilon 1e-8)) -> tensor</procedure>

Safe element-wise division: a / (b + epsilon) to avoid division by zero.

===== Linear Algebra Operations

<procedure>(matmul-op a b) -> tensor</procedure>

Matrix multiplication using BLAS GEMM/GEMV operations with batch support. Supports:

* Matrix × Matrix
* Matrix × Vector
* Vector × Matrix
* Vector × Vector (dot product)
* Batched operations (implicit batching over first dimension)

<enscript highlight="scheme">
; Standard matrix-vector multiplication
(define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define b (make-tensor32 (f32vector 5.0 6.0) '(2)))
(define c (matmul-op A b))  ; 2×2 matrix times 2×1 vector = 2×1 vector

; Batch matrix multiplication
(define batch-A (make-tensor32 (make-f32vector 80) '(10 2 4)))  ; 10 samples
(define W (make-tensor32 (make-f32vector 12) '(4 3)))
(define batch-result (matmul-op batch-A W))  ; Shape: (10, 2, 3)
</enscript>

Gradient: dL/dA = dL/dC · B^T, dL/dB = A^T · dL/dC

<procedure>(dot-op a b) -> tensor</procedure>

Dot product (inner product) of two 1D vectors using BLAS DOT.

<enscript highlight="scheme">
(define result (dot-op x y))  ; scalar result
</enscript>

Gradient: dL/da = (dL/dresult) · b, dL/db = (dL/dresult) · a

<procedure>(scale-op tensor scalar) -> tensor</procedure>

Scalar multiplication using BLAS SCAL.

Gradient: dL/dtensor = scalar · dL/dresult

===== Reduction Operations

<procedure>(reduce-tensor tensor reducer #!key (compute-gradient #f)) -> tensor</procedure>

Generic reduction operation that maintains gradient flow. The {{reducer}} function is applied to each element in the forward pass. An optional {{compute-gradient}} function specifies how gradients are distributed in the backward pass.

; tensor : input tensor to reduce
; reducer : function (element accumulator) -> new-accumulator
; compute-gradient : optional function (grad-out index value all-values) -> grad-in
;                    If not provided, assumes uniform distribution (like sum)

Returns a scalar tensor with the reduced value.

<enscript highlight="scheme">
;; Sum all elements (uniform gradient distribution)
(define total (reduce-tensor x +))

;; Product of all elements (gradient uses product rule)
(define prod (reduce-tensor x *
  compute-gradient: (lambda (grad-out idx val all-values)
                     ;; d(prod)/dx_i = prod / x_i
                     (let ((prod (fold * 1.0 all-values)))
                       (if (> val 0.0)
                           (* grad-out (/ prod val))
                           0.0)))))
</enscript>

<procedure>(sum-tensor tensor) -> tensor</procedure>

Sums all elements in the tensor. Gradient is distributed uniformly to all elements.

<procedure>(product-tensor tensor) -> tensor</procedure>

Computes the product of all elements. Gradient uses the product rule: d(prod)/dx_i = prod / x_i.

<procedure>(mean-tensor tensor) -> tensor</procedure>

Computes the mean (average) of all elements.

===== Tensor Manipulation Operations

<procedure>(slice-tensor tensor start length) -> tensor</procedure>

Extracts a slice of a tensor along the first dimension. Gradients flow back correctly to the original tensor positions.

; tensor : input tensor with shape (n, ...)
; start : starting index (0-based)
; length : number of elements to extract
; Returns : tensor with shape (length, ...)

<enscript highlight="scheme">
;; Slice a batch of data
(define batch-data (make-tensor32 (make-f32vector 100) '(10 10)))
(define mini-batch (slice-tensor batch-data 2 5))  ; Shape: (5, 10)

;; Gradients flow back to original positions
(backward! (sum-tensor mini-batch))
(tensor-grad batch-data)  ; Only indices 2-6 have non-zero gradients
</enscript>

<procedure>(reshape tensor new-shape) -> tensor</procedure>

Reshapes the tensor. Total number of elements must be preserved.

<procedure>(flatten-tensor tensor) -> tensor</procedure>

Flattens a multi-dimensional tensor to 1D.

===== Activation Functions

<procedure>(relu tensor) -> tensor</procedure>

Rectified Linear Unit: max(0, x).

Gradient: 1 if x > 0, else 0

<procedure>(tanh-op tensor) -> tensor</procedure>

Hyperbolic tangent activation.

Gradient: 1 - tanh^2(x)

<procedure>(sigmoid tensor) -> tensor</procedure>

Sigmoid (logistic) activation: σ(x) = 1 / (1 + e^(-x)).

Gradient: σ(x) · (1 - σ(x))

<procedure>(sigmoid-stable tensor) -> tensor</procedure>

Numerically stable sigmoid implementation for large negative values.

<procedure>(softmax x #!key (axis -1)) -> tensor</procedure>

Softmax normalization with numerical stability and batch support.

Input shapes:
* 1D: (n_classes,) - standard softmax
* 2D: (batch_size, n_classes) - softmax along axis (default: -1 for last axis)

<enscript highlight="scheme">
; Single sample
(define logits (make-tensor32 (f32vector 1.0 2.0 3.0) '(3)))
(define probs (softmax logits))  ; Sums to 1.0

; Batch of samples
(define batch-logits (make-tensor32 (make-f32vector 60) '(20 3)))
(define batch-probs (softmax batch-logits axis: -1))  ; Each row sums to 1.0
</enscript>

Gradient: dL/dx = softmax(x) ⊙ (dL/dy - Σ(dL/dy ⊙ softmax(x)))

<procedure>(log-softmax x #!key (axis -1)) -> tensor</procedure>

Log-softmax with batch support: more numerically stable than log(softmax(x)).

Input shapes:
* 1D: (n_classes,)
* 2D: (batch_size, n_classes) - log-softmax along axis

Gradient: dL/dx = dL/dy - exp(log_softmax(x)) · Σ(dL/dy)

<procedure>(leaky-relu tensor #!key (alpha 0.01)) -> tensor</procedure>

Leaky ReLU: max(alpha * x, x).

<procedure>(softplus tensor #!key (beta 1.0)) -> tensor</procedure>

Softplus activation: log(1 + e^(beta * x)) / beta.

<procedure>(gelu tensor) -> tensor</procedure>

Gaussian Error Linear Unit activation using tanh approximation.

<procedure>(silu tensor) -> tensor</procedure>

SiLU (Sigmoid Linear Unit) activation, also known as Swish: x * σ(x).

===== Loss Functions

<procedure>(mse-loss pred target #!key (reduction 'mean)) -> tensor</procedure>

Mean Squared Error loss with batch support.

; pred : predictions tensor (any shape)
; target : target tensor (same shape as pred)
; reduction : 'mean (average over all elements) or 'sum

For batched inputs (batch_size, ...), computes loss per sample and reduces according to reduction parameter.

<enscript highlight="scheme">
; Single sample
(define loss (mse-loss predictions targets))

; Batch of samples
(define batch-pred (make-tensor32 pred-data '(32 10)))
(define batch-target (make-tensor32 target-data '(32 10)))
(define batch-loss (mse-loss batch-pred batch-target reduction: 'mean))
</enscript>

<procedure>(cross-entropy-loss pred target #!key (reduction 'mean) (from-logits #f)) -> tensor</procedure>

Cross-entropy loss with batch support.

; pred : predictions tensor
;        If from-logits=#f: probabilities (softmax already applied)
;        If from-logits=#t: logits (raw scores, log-softmax applied internally)
; target : target tensor
;          One-hot: same shape as pred
;          Class indices: (batch_size,) with integer class labels
; reduction : 'mean (average over batch) or 'sum
; from-logits : if true, apply log-softmax to pred first

Input shapes:
* 1D pred (n_classes,): single sample
* 2D pred (batch_size, n_classes): batch of samples

<enscript highlight="scheme">
; Single sample with one-hot target
(define loss (cross-entropy-loss probs target))

; Batch with one-hot targets
(define batch-probs (softmax logits axis: -1))
(define batch-loss (cross-entropy-loss batch-probs targets reduction: 'mean))

; Batch with class indices (more memory efficient)
(define class-indices (make-tensor32 (f32vector 0.0 2.0 1.0) '(3)))
(define batch-loss (cross-entropy-loss logits class-indices 
                                       from-logits: #t 
                                       reduction: 'mean))
</enscript>

===== Normalization Operations

<procedure>(rmsnorm x weight #!key (epsilon 1e-5)) -> tensor</procedure>

Root Mean Square Layer Normalization with batch support.

Input shapes:
* 1D: (d_model,) - standard RMSNorm
* 2D: (batch_size, d_model) - RMSNorm applied to each batch element independently

Formula: output[i] = (x[i] / RMS(x)) * weight[i]
where RMS(x) = sqrt(mean(x^2) + epsilon)

<enscript highlight="scheme">
; Single vector
(define x (make-tensor32 (make-f32vector 512) '(512)))
(define gamma (make-tensor32 (make-f32vector 512 1.0) '(512)))
(define normalized (rmsnorm x gamma))

; Batch of vectors
(define batch-x (make-tensor32 (make-f32vector (* 32 512)) '(32 512)))
(define batch-norm (rmsnorm batch-x gamma))  ; Normalized per batch element
</enscript>

<procedure>(l2-normalize tensor #!key (axis #f) (epsilon 1e-8)) -> tensor</procedure>

L2 normalization with axis support.

; axis : #f (normalize entire tensor) or integer (normalize along axis)

For 2D tensors:
* axis=0: normalize along rows (each column becomes unit vector)
* axis=1: normalize along columns (each row becomes unit vector)

<enscript highlight="scheme">
; Normalize entire tensor
(define normalized (l2-normalize x))

; Normalize each row of a batch
(define batch (make-tensor32 (make-f32vector 200) '(10 20)))
(define row-normalized (l2-normalize batch axis: 1))  ; Each row has ||·||₂ = 1
</enscript>

<procedure>(cosine-similarity a b) -> tensor</procedure>

Cosine similarity: (a · b) / (||a|| · ||b||).

===== Convolution Operations

<procedure>(conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor</procedure>

2D convolution using im2col + GEMM algorithm with batch support.

; input : tensor of shape (C_in, H, W) or (N, C_in, H, W)
; weight : tensor of shape (C_out, C_in, KH, KW)
; bias : tensor of shape (C_out) or #f
; stride : stride for convolution (default 1)
; padding : zero-padding (default 0)

Input shapes:
* 3D: (C_in, H, W) - single image
* 4D: (N, C_in, H, W) - batch of images

Output shapes:
* 3D: (C_out, H_out, W_out)
* 4D: (N, C_out, H_out, W_out)

<enscript highlight="scheme">
; Single image
(define img (make-tensor32 (make-f32vector (* 3 32 32)) '(3 32 32)))
(define output (conv2d img weights bias stride: 2 padding: 1))

; Batch of images
(define batch-imgs (make-tensor32 (make-f32vector (* 16 3 32 32)) '(16 3 32 32)))
(define batch-output (conv2d batch-imgs weights bias))  ; Shape: (16, C_out, H_out, W_out)
</enscript>

===== Gradient Operations

<procedure>(zero-grad! tensor) -> void</procedure>

Sets all gradient values to zero.

<procedure>(backward! tensor) -> void</procedure>

Computes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found.

<procedure>(add-to-grad! tensor delta) -> void</procedure>

Accumulates delta into the tensor's gradient using BLAS AXPY.

===== Utility Functions

<procedure>(tensor->list tensor) -> list</procedure>

Converts tensor data to a list.

<procedure>(print-tensor tensor) -> void</procedure>

Pretty-prints tensor information including shape, dtype, data, and gradients.

<procedure>(vector-length-for-dtype vec dtype) -> integer</procedure>

Returns the length of a vector based on its dtype.

==== nanograd-layer

Neural network layer abstractions and containers with batch processing support.

===== Layer Predicates

<procedure>(layer? obj) -> boolean</procedure>
<procedure>(dense-layer? obj) -> boolean</procedure>
<procedure>(conv2d-layer? obj) -> boolean</procedure>
<procedure>(batch-norm-2d? obj) -> boolean</procedure>
<procedure>(sequential? obj) -> boolean</procedure>
<procedure>(flatten-layer? obj) -> boolean</procedure>

===== Dense Layer

<procedure>(make-dense-layer input-size output-size #!key (activation (make-identity)) (use-bias #t) (dtype 'f32) (name "Dense")) -> layer</procedure>

Creates a fully-connected (dense) layer with Xavier/Glorot initialization. Supports both single vectors and batches.

; input-size : number of input features
; output-size : number of output features
; activation : activation function object (default identity)
; use-bias : whether to include bias term (default #t)
; dtype : 'f32 or 'f64 (default 'f32)
; name : layer name for debugging

Input shapes:
* 1D: (input_size,) → output: (output_size,)
* 2D: (batch_size, input_size) → output: (batch_size, output_size)

For 2D inputs, uses BLAS GEMM for efficient batch processing.

<enscript highlight="scheme">
(define layer (make-dense-layer 784 128 
                                activation: (make-relu)
                                name: "Hidden1"))

; Single input
(define x (make-tensor32 (make-f32vector 784) '(784)))
(define output (forward layer x))  ; Shape: (128,)

; Batch input
(define batch-x (make-tensor32 (make-f32vector (* 32 784)) '(32 784)))
(define batch-output (forward layer batch-x))  ; Shape: (32, 128)
</enscript>

===== Convolutional Layer

<procedure>(make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layer</procedure>

Creates a 2D convolutional layer with He initialization. Supports both single images and batches.

; in-channels : number of input channels
; out-channels : number of output channels
; kernel-size : size of convolution kernel (square)
; stride : convolution stride (default 1)
; padding : zero-padding (default 0)
; activation : activation function object
; dtype : 'f32 or 'f64
; name : layer name

Input shapes:
* 3D: (C_in, H, W) - single image
* 4D: (N, C_in, H, W) - batch of images

Output shapes:
* 3D: (C_out, H_out, W_out)
* 4D: (N, C_out, H_out, W_out)

<enscript highlight="scheme">
(define conv (make-conv2d-layer 3 32 3 
                                stride: 1 
                                padding: 1
                                activation: (make-relu)))

; Single image
(define img (make-tensor32 img-data '(3 32 32)))
(define features (forward conv img))  ; Shape: (32, 32, 32)

; Batch of images
(define batch (make-tensor32 batch-data '(16 3 32 32)))
(define batch-features (forward conv batch))  ; Shape: (16, 32, 32, 32)
</enscript>

===== Batch Normalization Layer

<procedure>(make-batch-norm-2d num-features #!key (epsilon 1e-5) (momentum 0.1) (dtype 'f32) (name "BatchNorm2d")) -> layer</procedure>

Creates a 2D batch normalization layer. Normalizes activations across the batch dimension:

y = γ * (x - μ) / √(σ² + ε) + β

where μ and σ² are computed from the batch (training mode) or from running statistics (evaluation mode).

; num-features : number of channels (C)
; epsilon : small constant for numerical stability (default 1e-5)
; momentum : momentum for updating running statistics (default 0.1)
; dtype : 'f32 or 'f64 (default 'f32)
; name : layer name

Input shapes:
* 3D: (C, H, W) - treated as batch of 1
* 4D: (N, C, H, W) - standard batch

Output shapes: same as input

<enscript highlight="scheme">
;; Create batch norm for 64 channels
(define bn (make-batch-norm-2d 64 epsilon: 1e-5 momentum: 0.1))

;; Training mode: uses batch statistics
(set-training-mode! bn #t)
(define normalized (forward bn input))  ; Input shape: (N, 64, H, W)

;; Evaluation mode: uses running statistics
(set-eval-mode! bn)
(define test-normalized (forward bn test-input))  ; Deterministic output
</enscript>

Batch normalization improves training stability and convergence by:
* Reducing internal covariate shift
* Allowing higher learning rates
* Acting as a form of regularization
* Making networks less sensitive to initialization

Key features:
* Learnable scale (gamma) and shift (beta) parameters
* Running mean and variance maintained for evaluation
* Automatic mode switching between training and evaluation
* Numerical stability with epsilon parameter

===== Flatten Layer

<procedure>(make-flatten #!key (name "Flatten")) -> layer</procedure>

Creates a flatten layer that converts multi-dimensional tensors to 1D or 2D.

Input shapes and outputs:
* 4D: (N, C, H, W) → (N, C*H*W)
* 3D: (C, H, W) → (C*H*W)
* 2D: (N, features) → (N, features) (no change)
* 1D: (features,) → (features,) (no change)

<enscript highlight="scheme">
(define flatten (make-flatten name: "Flatten"))

; Flatten batch of feature maps
(define features (make-tensor32 data '(32 64 8 8)))
(define flattened (forward flatten features))  ; Shape: (32, 4096)
</enscript>

===== Global Average Pooling

<procedure>(global-avg-pool2d input) -> tensor</procedure>

Global average pooling over spatial dimensions with batch support. Reduces spatial dimensions to 1x1 by averaging.

; Input shapes:
;   3D: (C, H, W) → Output: (C,)
;   4D: (N, C, H, W) → Output: (N, C)

Gradient: Distributed uniformly over all spatial positions for each channel.

<enscript highlight="scheme">
;; Single image
(define feature-maps (make-tensor32 (make-f32vector (* 128 8 8)) '(128 8 8)))
(define pooled (global-avg-pool2d feature-maps))  ; Shape: (128,)

;; Batch of images
(define batch-features (make-tensor32 (make-f32vector (* 32 128 8 8)) '(32 128 8 8)))
(define batch-pooled (global-avg-pool2d batch-features))  ; Shape: (32, 128)

;; Use in classification network
(define logits (forward fc-layer batch-pooled))  ; Shape: (32, num_classes)
</enscript>

Global average pooling is commonly used to replace large fully-connected layers:
* Reduces number of parameters dramatically
* Improves generalization
* Makes networks translation-invariant
* Standard in modern architectures (ResNet, MobileNet, EfficientNet)

===== Sequential Container

<procedure>(make-sequential layers #!key (name "Sequential")) -> layer</procedure>

Creates a sequential container that chains multiple layers. Automatically handles batch propagation through all layers.

<enscript highlight="scheme">
(define model
  (make-sequential
   (list
    (make-dense-layer 784 128 activation: (make-relu))
    (make-dense-layer 128 64 activation: (make-relu))
    (make-dense-layer 64 10 activation: (make-identity)))
   name: "MLP"))

; Works with both single and batch inputs
(define single-output (forward model single-input))
(define batch-output (forward model batch-input))
</enscript>

===== Layer Operations

<procedure>(forward layer input) -> tensor</procedure>

Performs a forward pass through the layer. Automatically handles both single samples and batches based on input shape.

<procedure>(parameters layer) -> list</procedure>

Returns a list of all trainable parameter tensors.

<procedure>(zero-grad-layer! layer) -> void</procedure>

Zeros gradients for all parameters in the layer.

<procedure>(set-training-mode! layer training?) -> void</procedure>

Sets the training mode for the layer. When {{training?}} is #t, the layer uses training-specific behavior (e.g., batch statistics for batch norm). When #f, uses evaluation behavior.

<enscript highlight="scheme">
;; Set model to training mode
(set-training-mode! model #t)

;; Set model to evaluation mode
(set-training-mode! model #f)
</enscript>

<procedure>(set-eval-mode! layer) -> void</procedure>

Shorthand for {{(set-training-mode! layer #f)}}. Sets the layer to evaluation mode.

Training vs Evaluation Mode:

'''Training Mode''' ({{(set-training-mode! layer #t)}}):
* Batch normalization uses batch statistics (mean and variance computed from current batch)
* Dropout is active (if implemented)
* Stochastic behavior enabled
* Running statistics updated

'''Evaluation Mode''' ({{(set-eval-mode! layer)}}):
* Batch normalization uses running statistics (accumulated during training)
* Dropout is disabled
* Deterministic behavior
* Running statistics frozen

<procedure>(layer-input-size layer) -> integer or #f</procedure>
<procedure>(layer-output-size layer) -> integer or #f</procedure>
<procedure>(layer-activation layer) -> activation</procedure>
<procedure>(layer-name layer) -> string</procedure>

Accessor functions for layer properties. Note: input/output sizes may be #f for layers with dynamic dimensions (e.g., flatten).

===== Activation Function Objects

<procedure>(make-relu) -> activation</procedure>
<procedure>(make-tanh) -> activation</procedure>
<procedure>(make-sigmoid) -> activation</procedure>
<procedure>(make-gelu) -> activation</procedure>
<procedure>(make-silu) -> activation</procedure>
<procedure>(make-identity) -> activation</procedure>

Creates activation function objects for use in layers.

<procedure>(activation? obj) -> boolean</procedure>
<procedure>(activation-forward act x) -> tensor</procedure>
<procedure>(activation-name act) -> string</procedure>

===== Utility Functions

<procedure>(print-layer layer #!optional (indent 0)) -> void</procedure>

Prints layer information with optional indentation.

<procedure>(summary model) -> void</procedure>

Prints a model summary including all layers and parameter counts.

==== nanograd-optimizer

Optimization algorithms for neural network training.

===== Optimizer Predicates

<procedure>(optimizer? obj) -> boolean</procedure>
<procedure>(sgd? obj) -> boolean</procedure>
<procedure>(adam? obj) -> boolean</procedure>
<procedure>(rmsprop? obj) -> boolean</procedure>

===== SGD Optimizer

<procedure>(make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizer</procedure>

Stochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration.

; parameters : list of parameter tensors to optimize
; learning-rate : step size (default 0.01)
; momentum : momentum factor (default 0.0, no momentum)
; weight-decay : L2 regularization factor (default 0.0)
; nesterov : use Nesterov momentum (default #f)

===== Adam Optimizer

<procedure>(make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizer</procedure>

Adam (Adaptive Moment Estimation) optimizer with bias correction.

; beta1 : exponential decay rate for first moment (default 0.9)
; beta2 : exponential decay rate for second moment (default 0.999)
; epsilon : numerical stability constant (default 1e-8)

===== RMSprop Optimizer

<procedure>(make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizer</procedure>

RMSprop optimizer with optional momentum.

; alpha : smoothing constant (default 0.99)

===== Optimizer Operations

<procedure>(step! optimizer) -> void</procedure>

Applies parameter updates based on accumulated gradients.

<procedure>(get-learning-rate optimizer) -> number</procedure>

Returns the current learning rate.

<procedure>(set-learning-rate! optimizer lr) -> void</procedure>

Updates the learning rate (useful for learning rate scheduling).

<procedure>(optimizer-state optimizer) -> alist</procedure>

Returns an association list of optimizer configuration parameters.

=== Examples

==== Batch Processing with Dense Layers

<enscript highlight="scheme">
(import nanograd-autograd nanograd-layer)

;; Create a batch of inputs
(define batch-size 32)
(define input-dim 784)
(define batch-data (make-f32vector (* batch-size input-dim)))

;; Fill with data...

(define batch-input (make-tensor32 batch-data (list batch-size input-dim)))

;; Dense layer automatically handles batches
(define layer (make-dense-layer input-dim 128 activation: (make-relu)))
(define output (forward layer batch-input))  ; Shape: (32, 128)
</enscript>

==== Batched Softmax and Cross-Entropy

<enscript highlight="scheme">
;; Batch of logits
(define batch-size 32)
(define num-classes 10)
(define logits (make-tensor32 (make-f32vector (* batch-size num-classes)) 
                              (list batch-size num-classes)))
(define targets (make-tensor32 target-data (list batch-size num-classes)))

;; Softmax along class dimension
(define probs (softmax logits axis: -1))  ; Each row sums to 1

;; Cross-entropy with batches
(define loss (cross-entropy-loss probs targets reduction: 'mean))

;; Alternative: use from-logits for stability
(define loss-stable (cross-entropy-loss logits targets 
                                        from-logits: #t 
                                        reduction: 'mean))
</enscript>

==== Training with Batches

<enscript highlight="scheme">
(import nanograd-autograd nanograd-layer nanograd-optimizer)

;; Define model
(define model
  (make-sequential
   (list
    (make-dense-layer 784 256 activation: (make-relu))
    (make-dense-layer 256 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "MLP"))

(define optimizer (make-adam (parameters model) learning-rate: 0.001))

;; Training loop with batches
(define (train-epoch train-batches)
  (set-training-mode! model #t)
  
  (for-each
   (lambda (batch)
     (let* ((x (car batch))        ; Shape: (batch_size, 784)
            (y (cdr batch))        ; Shape: (batch_size, 10)
            (logits (forward model x))
            (loss (cross-entropy-loss logits y 
                                      from-logits: #t 
                                      reduction: 'mean)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   train-batches))

;; Evaluation
(define (evaluate test-batches)
  (set-eval-mode! model)
  ;; ... evaluation code ...
  )
</enscript>

==== Convolutional Network with Batch Normalization

<enscript highlight="scheme">
(import nanograd-autograd nanograd-layer nanograd-optimizer)

;; CNN with batch support
(define cnn
  (make-sequential
   (list
    (make-conv2d-layer 3 32 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 32)  ; Normalizes across batch
    (make-conv2d-layer 32 64 3 stride: 1 padding: 1 
                       activation: (make-identity))
    (make-batch-norm-2d 64)
    (make-flatten)
    (make-dense-layer (* 64 32 32) 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "CNN"))

;; Process batch of images
(define batch-images (make-tensor32 image-data '(16 3 32 32)))  ; 16 RGB images
(set-training-mode! cnn #t)
(define predictions (forward cnn batch-images))  ; Shape: (16, 10)
</enscript>

==== ResNet-Style Architecture

<enscript highlight="scheme">
;; ResNet block with batch normalization
(define (make-resnet-block in-channels out-channels stride)
  (make-sequential
   (list
    (make-conv2d-layer in-channels out-channels 3 
                       stride: stride padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels)
    (make-conv2d-layer out-channels out-channels 3
                       stride: 1 padding: 1
                       activation: (make-identity))
    (make-batch-norm-2d out-channels))
   name: "ResBlock"))

;; Full model
(define resnet
  (make-sequential
   (list
    (make-conv2d-layer 3 64 7 stride: 2 padding: 3)
    (make-batch-norm-2d 64)
    (make-resnet-block 64 64 1)
    (make-resnet-block 64 128 2)
    (make-resnet-block 128 256 2)
    (make-resnet-block 256 512 2)
    (make-dense-layer 512 1000))
   name: "ResNet"))
</enscript>

=== Performance Notes

* NanoGrad uses BLAS for matrix operations, including batched GEMM
* Batch operations are significantly more efficient than processing samples individually
* Use f32 (32-bit) tensors when 64-bit precision is not required
* The framework detects computation graph cycles
* Batch normalization adds minimal overhead and significantly improves training
* Global average pooling reduces parameters without sacrificing performance

=== Batch Processing Best Practices

1. '''Always use batches during training''' for better performance and stable gradients
2. '''Set appropriate batch sizes''' (typically 16-256 depending on memory)
3. '''Use batch normalization''' for deeper networks (>10 layers)
4. '''Switch to eval mode''' during validation/testing to use running statistics
5. '''Prefer global average pooling''' over large fully-connected layers in CNNs

=== Limitations

* CPU-only (no GPU support)
* No automatic batching (must manually create batches)
* Limited built-in layer types (dense, convolutional, batch norm)
* Single-threaded execution
* Batch normalization requires proper training/eval mode switching

=== Troubleshooting

==== Common Errors

'''Shape mismatch errors'''

Ensure tensor shapes are compatible for operations. For batched operations, the batch dimension should match.

<enscript highlight="scheme">
; Batch size mismatch
(define x (make-tensor32 (make-f32vector 200) '(10 20)))
(define y (make-tensor32 (make-f32vector 300) '(15 20)))
(add x y)  ; Error: shape mismatch
</enscript>

'''Batch normalization mode not set'''

Always explicitly set training/eval mode:

<enscript highlight="scheme">
; Training
(set-training-mode! model #t)
(train-epoch model)

; Evaluation
(set-eval-mode! model)
(evaluate model)
</enscript>

=== Author

[[https://github.com/iraikov|Ivan Raikov]]

=== Repository

[[https://github.com/iraikov/nanograd|https://github.com/iraikov/nanograd]]

=== Version History

; 2.0 : Batch processing support
- Dense layers support 1D/2D inputs
- Conv2D supports 3D/4D inputs
- Batch normalization for 3D/4D inputs
- Softmax/log-softmax with batch and axis support
- Cross-entropy loss with batch reduction
- RMSNorm with 1D/2D support
- Global average pooling with 3D/4D support
- L2-normalize with axis parameter
; 1.2 : Additional operations
- Reduction operations (sum-tensor, mean-tensor, product-tensor, reduce-tensor)
- Tensor slicing (slice-tensor)
- Batch normalization (make-batch-norm-2d)
- Global average pooling (global-avg-pool2d)
- Training/evaluation mode control
; 1.1 : Bug fix in mul layer operation
; 1.0 : Initial release
- Core autograd engine
- Dense and convolutional layers
- SGD, Adam, and RMSprop optimizers
- Basic activation and loss functions

=== See Also

* [[blas|BLAS bindings]]
* [[yasos|YASOS object system]]
* [[mathh|Extended math functions]]

=== License

LGPL-3

=== References

* PyTorch: Dynamic computation graphs, autograd design, and batch-first conventions
* micrograd: Minimalist autograd engine by Andrej Karpathy
* "Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018)
* "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (Ioffe & Szegedy, 2015)
* "BLAS (Basic Linear Algebra Subprograms)" documentation

Description of your changes:

I would like to authenticate

Authentication

Username:Password:

Spam control

What do you get when you subtract 18 from 22?