nanograd (historical revision 44975) - The CHICKEN Scheme wiki

You are looking at historical revision 44975 of this page. It may differ significantly from its current revision.

nanograd

A lightweight automatic differentiation and neural network framework for CHICKEN Scheme, featuring BLAS-accelerated operations and YASOS-based object abstractions.

Description

NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features:

Reverse-mode automatic differentiation with gradient computation
BLAS-accelerated linear algebra operations
YASOS-based polymorphic object system
Support for both 32-bit and 64-bit floating-point precision
Common neural network layers (Dense, Convolutional)
Common optimization algorithms (SGD, Adam, RMSprop)
Standard activation functions and loss functions

Modules

nanograd-autograd

Core automatic differentiation engine with tensor operations.

Tensor Constructors

[procedure] (make-tensor32 data shape #!key (requires-grad? #t)) -> tensor

Creates a 32-bit floating-point tensor with automatic differentiation support.

data: f32vector containing the tensor data
shape: list of dimensions, e.g., '(2 3) for a 2x3 matrix
requires-grad?: whether to track gradients (default #t)

(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t))

[procedure] (make-tensor64 data shape #!key (requires-grad? #t)) -> tensor

Creates a 64-bit floating-point tensor with automatic differentiation support.

(define x (make-tensor64 (f64vector 1.0 2.0 3.0 4.0) '(2 2)))

Tensor Predicates

[procedure] (tensor? obj) -> boolean
[procedure] (tensor32? obj) -> boolean
[procedure] (tensor64? obj) -> boolean

Type predicates for tensors.

Tensor Accessors

[procedure] (tensor-data tensor) -> vector

Returns the underlying f32vector or f64vector containing the tensor's data.

[procedure] (tensor-grad tensor) -> vector or #f

Returns the gradient vector if gradients are enabled, #f otherwise.

[procedure] (tensor-shape tensor) -> list

Returns the shape as a list of dimensions.

[procedure] (tensor-dtype tensor) -> symbol

Returns the data type: 'f32 or 'f64.

[procedure] (tensor-requires-grad? tensor) -> boolean

Returns #t if the tensor tracks gradients.

Arithmetic Operations

[procedure] (add a b) -> tensor

Element-wise addition of tensors a and b. Both tensors must have the same shape and dtype.

(define z (add x y))  ; z = x + y

Gradient: dL/da = dL/dz, dL/db = dL/dz

[procedure] (sub a b) -> tensor

Element-wise subtraction: a - b.

Gradient: dL/da = dL/dz, dL/db = -dL/dz

[procedure] (mul a b) -> tensor

Element-wise multiplication (Hadamard product).

Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a

[procedure] (div a b) -> tensor

Element-wise division: a / b.

Gradient: dL/da = dL/dz / b, dL/db = -dL/dz \dot (a / b²)

[procedure] (safe-div a b #!key (epsilon 1e-8)) -> tensor

Safe element-wise division: a / (b + epsilon) to avoid division by zero.

Linear Algebra Operations

[procedure] (matmul-op a b) -> tensor

Matrix multiplication using BLAS GEMM/GEMV operations. Supports:

Matrix × Matrix
Matrix × Vector
Vector × Matrix
Vector × Vector (dot product)

(define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define b (make-tensor32 (f32vector 5.0 6.0) '(2)))
(define c (matmul-op A b))  ; 2×2 matrix times 2×1 vector = 2×1 vector

Gradient: dL/dA = dL/dC \dot B^T, dL/dB = A^T \dot dL/dC

[procedure] (dot-op a b) -> tensor

Dot product (inner product) of two 1D vectors using BLAS DOT.

(define result (dot-op x y))  ; scalar result

Gradient: dL/da = (d L/d result) \dot b, dL/db = (d L/d result) \dot a

[procedure] (scale-op tensor scalar) -> tensor

Scalar multiplication using BLAS SCAL.

Gradient: dL/dtensor = scalar \dot dL/dresult

Activation Functions

[procedure] (relu tensor) -> tensor

Rectified Linear Unit: max(0, x).

Gradient: 1 if x > 0, else 0

[procedure] (tanh-op tensor) -> tensor

Hyperbolic tangent activation.

Gradient: 1 - tanh^2(x)

[procedure] (sigmoid tensor) -> tensor

Sigmoid (logistic) activation: sigm(x) = 1 / (1 + e^(-x)).

Gradient: sigm(x) \dot (1 - sigm(x))

[procedure] (sigmoid-stable tensor) -> tensor

Numerically stable sigmoid implementation for large negative values.

[procedure] (softmax x #!key (dim #f)) -> tensor

Softmax normalization with numerical stability (subtracts max before exp).

(define probs (softmax logits))  ; Converts logits to probabilities

[procedure] (log-softmax x #!key (dim #f)) -> tensor

Log-softmax: more numerically stable than log(softmax(x)).

[procedure] (leaky-relu tensor #!key (alpha 0.01)) -> tensor

Leaky ReLU: max(alpha * x, x).

[procedure] (softplus tensor #!key (beta 1.0)) -> tensor

Softplus activation: log(1 + e^(beta * x)) / beta.

Loss Functions

[procedure] (mse-loss pred target) -> tensor

Mean Squared Error loss: L = (1/n) \sum(pred - target)^{2}.

(define loss (mse-loss predictions targets))

[procedure] (cross-entropy-loss pred target) -> tensor

Cross-entropy loss: L = -\sum(target \dot log(pred)).

Note: Assumes pred is already normalized (e.g., via softmax).

Gradient Operations

[procedure] (zero-grad! tensor) -> void

Sets all gradient values to zero.

[procedure] (backward! tensor) -> void

Computes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found.

(define x (make-tensor32 (f32vector 1.0 2.0) '(2)))
(define y (make-tensor32 (f32vector 3.0 4.0) '(2)))
(define z (add x y))
(define loss (dot-op z z))

(backward! loss)
(print-tensor (tensor-grad x))

[procedure] (add-to-grad! tensor delta) -> void

Accumulates delta into the tensor's gradient using BLAS AXPY.

Convolution Operations

[procedure] (conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor

2D convolution using im2col + GEMM algorithm.

input: tensor of shape (C_in, H, W)
weight: tensor of shape (C_out, C_in, KH, KW)
bias: tensor of shape (C_out) or #f
stride: stride for convolution (default 1)
padding: zero-padding (default 0)

(define output (conv2d input weights bias stride: 2 padding: 1))

Normalization Operations

[procedure] (rmsnorm x weight #!key (epsilon 1e-5)) -> tensor

Root Mean Square Layer Normalization.

[procedure] (l2-normalize tensor #!key (epsilon 1e-8)) -> tensor

L2 normalization: x / ||x||_{2}.

[procedure] (cosine-similarity a b) -> tensor

Cosine similarity: (a \dot b) / (||a|| \dot ||b||).

Utility Functions

[procedure] (tensor->list tensor) -> list

Converts tensor data to a list.

[procedure] (print-tensor tensor) -> void

Pretty-prints tensor information including shape, dtype, data, and gradients.

[procedure] (vector-length-for-dtype vec dtype) -> integer

Returns the length of a vector based on its dtype.

nanograd-layer

Neural network layer abstractions and containers.

Layer Predicates

[procedure] (layer? obj) -> boolean
[procedure] (dense-layer? obj) -> boolean
[procedure] (conv2d-layer? obj) -> boolean
[procedure] (sequential? obj) -> boolean

Dense Layer

[procedure] (make-dense-layer input-size output-size #!key (activation (make-identity)) (dtype 'f32) (name "Dense")) -> layer

Creates a fully-connected (dense) layer with Xavier/Glorot initialization.

input-size: number of input features
output-size: number of output features
activation: activation function object (default identity)
dtype: 'f32 or 'f64 (default 'f32)
name: layer name for debugging

(define layer (make-dense-layer 784 128 
                                activation: (make-relu)
                                name: "Hidden1"))

Convolutional Layer

[procedure] (make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layer

Creates a 2D convolutional layer with He initialization.

(define conv (make-conv2d-layer 3 32 3 
                                stride: 1 
                                padding: 1
                                activation: (make-relu)))

Sequential Container

[procedure] (make-sequential layers #!key (name "Sequential")) -> layer

Creates a sequential container that chains multiple layers.

(define model
  (make-sequential
   (list
    (make-dense-layer 784 128 activation: (make-relu))
    (make-dense-layer 128 64 activation: (make-relu))
    (make-dense-layer 64 10 activation: (make-identity)))
   name: "MLP"))

Layer Operations

[procedure] (forward layer input) -> tensor

Performs a forward pass through the layer.

[procedure] (parameters layer) -> list

Returns a list of all trainable parameter tensors.

[procedure] (zero-grad-layer! layer) -> void

Zeros gradients for all parameters in the layer.

[procedure] (layer-input-size layer) -> integer
[procedure] (layer-output-size layer) -> integer
[procedure] (layer-activation layer) -> activation
[procedure] (layer-name layer) -> string

Accessor functions for layer properties.

Activation Function Objects

[procedure] (make-relu) -> activation
[procedure] (make-tanh) -> activation
[procedure] (make-sigmoid) -> activation
[procedure] (make-identity) -> activation

Creates activation function objects for use in layers.

[procedure] (activation? obj) -> boolean
[procedure] (activation-forward act x) -> tensor
[procedure] (activation-name act) -> string

Utility Functions

[procedure] (print-layer layer #!optional (indent 0)) -> void

Prints layer information with optional indentation.

[procedure] (summary model) -> void

Prints a model summary including all layers and parameter counts.

(summary model)
; === Model Summary ===
; Model: MLP
; Input size: 784
; Output size: 10
; 
; Total parameters: 101770

nanograd-optimizer

Optimization algorithms for neural network training.

Optimizer Predicates

[procedure] (optimizer? obj) -> boolean
[procedure] (sgd? obj) -> boolean
[procedure] (adam? obj) -> boolean
[procedure] (rmsprop? obj) -> boolean

SGD Optimizer

[procedure] (make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizer

Stochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration.

parameters: list of parameter tensors to optimize
learning-rate: step size (default 0.01)
momentum: momentum factor (default 0.0, no momentum)
weight-decay: L2 regularization factor (default 0.0)
nesterov: use Nesterov momentum (default #f)

(define opt (make-sgd (parameters model) 
                      learning-rate: 0.01
                      momentum: 0.9))

Adam Optimizer

[procedure] (make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizer

Adam (Adaptive Moment Estimation) optimizer with bias correction.

beta1: exponential decay rate for first moment (default 0.9)
beta2: exponential decay rate for second moment (default 0.999)
epsilon: numerical stability constant (default 1e-8)

(define opt (make-adam (parameters model) learning-rate: 0.001))

RMSprop Optimizer

[procedure] (make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizer

RMSprop optimizer with optional momentum.

alpha: smoothing constant (default 0.99)

(define opt (make-rmsprop (parameters model) 
                          learning-rate: 0.01
                          alpha: 0.99))

Optimizer Operations

[procedure] (step! optimizer) -> void

Applies parameter updates based on accumulated gradients.

[procedure] (get-learning-rate optimizer) -> number

Returns the current learning rate.

[procedure] (set-learning-rate! optimizer lr) -> void

Updates the learning rate (useful for learning rate scheduling).

; Learning rate decay
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (set-learning-rate! opt (/ 0.1 (+ 1.0 (* 0.01 epoch))))
  ; ... training code ...
  )

[procedure] (optimizer-state optimizer) -> alist

Returns an association list of optimizer configuration parameters.

Examples

Basic Tensor Operations

(import nanograd-autograd)

; Create tensors
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3)))
(define y (make-tensor32 (f32vector 4.0 5.0 6.0) '(3)))

; Operations
(define z (add x y))
(define w (mul x y))

; Compute gradients
(backward! w)
(print-tensor (tensor-grad x))

Training a Neural Network

(import nanograd-autograd nanograd-layer nanograd-optimizer)

; Define model
(define model
  (make-sequential
   (list
    (make-dense-layer 2 8 activation: (make-relu))
    (make-dense-layer 8 1 activation: (make-identity)))
   name: "Regression"))

; Create optimizer
(define optimizer (make-adam (parameters model) learning-rate: 0.01))

; Training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  
  (for-each
   (lambda (sample)
     (let* ((x (make-tensor32 (car sample) '(2)))
            (target (make-tensor32 (f32vector (cdr sample)) '(1)))
            (pred (forward model x))
            (loss (mse-loss pred target)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   training-data))

Convolutional Neural Network

(define cnn
  (make-sequential
   (list
    (make-conv2d-layer 3 32 3 stride: 1 padding: 1 
                       activation: (make-relu))
    (make-conv2d-layer 32 64 3 stride: 1 padding: 1 
                       activation: (make-relu))
    (make-dense-layer (* 64 8 8) 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "CNN"))

; Forward pass with image tensor (3 channels, 32x32 pixels)
(define img (make-tensor32 (make-f32vector (* 3 32 32)) '(3 32 32)))
(define output (forward cnn img))

Performance Notes

NanoGrad uses BLAS for matrix operations
Use f32 (32-bit) tensors when 64-bit precision is not required for better performance
The framework detects computation graph cycles and prevents infinite loops during backpropagation
Memory is managed manually; call zero-grad-layer! after each optimization step

Limitations

CPU-only (no GPU support)
No automatic batching
Limited built-in layer types (dense, sequential and conv2d)
Single-threaded execution

Advanced Usage

Custom Backward Functions

; Define a custom operation
(define (my-operation x)
  (let* ((dtype (tensor-dtype x))
         (data (tensor-data x))
         (n (vector-length-for-dtype data dtype))
         (result-data (case dtype
                        ((f32) (make-f32vector n))
                        ((f64) (make-f64vector n)))))
    
    ; Forward computation
    (case dtype
      ((f32)
       (do ((i 0 (+ i 1)))
           ((= i n))
         (f32vector-set! result-data i 
                         (* 2.0 (f32vector-ref data i)))))
      ((f64)
       (do ((i 0 (+ i 1)))
           ((= i n))
         (f64vector-set! result-data i 
                         (* 2.0 (f64vector-ref data i))))))
    
    (let ((result (make-base-tensor result-data 
                                    (tensor-shape x) 
                                    dtype 
                                    (tensor-requires-grad? x))))
      
      ; Define backward function
      (when (tensor-requires-grad? x)
        (set-backward-fn! result
          (lambda ()
            (let ((grad-out (tensor-grad result))
                  (grad-in (case dtype
                            ((f32) (make-f32vector n))
                            ((f64) (make-f64vector n)))))
              ; Gradient: d(2x)/dx = 2
              (case dtype
                ((f32)
                 (do ((i 0 (+ i 1)))
                     ((= i n))
                   (f32vector-set! grad-in i
                                   (* 2.0 (f32vector-ref grad-out i)))))
                ((f64)
                 (do ((i 0 (+ i 1)))
                     ((= i n))
                   (f64vector-set! grad-in i
                                   (* 2.0 (f64vector-ref grad-out i))))))
              (add-to-grad! x grad-in)))
          (list x)))
      
      result)))

Learning Rate Scheduling

; Step decay
(define (step-decay base-lr epoch drop-every drop-rate)
  (* base-lr (expt drop-rate (floor (/ epoch drop-every)))))

; Exponential decay
(define (exp-decay base-lr epoch decay-rate)
  (* base-lr (exp (- (* decay-rate epoch)))))

; Cosine annealing
(define (cosine-annealing base-lr epoch total-epochs)
  (* 0.5 base-lr (+ 1.0 (cos (* 3.14159 (/ epoch total-epochs))))))

; Usage in training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (set-learning-rate! optimizer (step-decay 0.1 epoch 30 0.5))
  ; ... training code ...
  )

Gradient Clipping

; Clip gradients by norm
(define (clip-grad-norm! parameters max-norm)
  (let ((total-norm 0.0))
    ; Compute total norm
    (for-each
     (lambda (param)
       (let ((grad (tensor-grad param)))
         (when grad
           (let ((dtype (tensor-dtype param))
                 (n (vector-length-for-dtype grad dtype)))
             (case dtype
               ((f32)
                (do ((i 0 (+ i 1)))
                    ((= i n))
                  (let ((g (f32vector-ref grad i)))
                    (set! total-norm (+ total-norm (* g g))))))
               ((f64)
                (do ((i 0 (+ i 1)))
                    ((= i n))
                  (let ((g (f64vector-ref grad i)))
                    (set! total-norm (+ total-norm (* g g)))))))))))
     parameters)
    
    (let ((total-norm (sqrt total-norm)))
      (when (> total-norm max-norm)
        (let ((scale (/ max-norm total-norm)))
          ; Scale all gradients
          (for-each
           (lambda (param)
             (let ((grad (tensor-grad param)))
               (when grad
                 (let ((n (vector-length-for-dtype 
                          grad 
                          (tensor-dtype param))))
                   (case (tensor-dtype param)
                     ((f32) (sscal! n scale grad))
                     ((f64) (dscal! n scale grad)))))))
           parameters))))))

; Usage
(backward! loss)
(clip-grad-norm! (parameters model) 1.0)
(step! optimizer)

Model Evaluation Mode

; Create tensors without gradient tracking for inference
(define (predict model input-data)
  (let ((x (make-tensor32 input-data 
                          (list (f32vector-length input-data))
                          requires-grad?: #f)))
    (forward model x)))

; Batch prediction
(define (predict-batch model batch-data)
  (map (lambda (input) (predict model input))
       batch-data))

Troubleshooting

Common Errors

Shape mismatch errors

Ensure tensor shapes are compatible for operations:

; Matrix multiplication requires compatible dimensions
(define A (make-tensor32 (make-f32vector 6) '(2 3)))
(define B (make-tensor32 (make-f32vector 6) '(3 2)))
(define C (matmul-op A B))  ; OK: (2,3) × (3,2) = (2,2)

(define D (make-tensor32 (make-f32vector 4) '(2 2)))
(matmul-op A D)  ; Error: incompatible dimensions

Gradient computation cycles

Avoid creating cycles in the computation graph:

; Bad: creates a cycle
(define x (make-tensor32 (f32vector 1.0) '(1)))
(define y (add x x))
(set-backward-fn! x (lambda () (add-to-grad! x (tensor-grad y))) (list y))
(backward! y)  ; Error: computation graph contains cycles

Division by zero

Use safe-div when dividing by potentially zero values:

; Instead of (div a b), use:
(define result (safe-div a b epsilon: 1e-8))

1.0: Initial release

: * Core autograd engine : * Dense and convolutional layers : * SGD, Adam, and RMSprop optimizers : * Basic activation and loss functions

License

GPL-3

References

PyTorch: Dynamic computation graphs and autograd
micrograd: Minimalist autograd engine by Andrej Karpathy
"Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018)
"BLAS (Basic Linear Algebra Subprograms)" documentation

nanograd

Description

Requirements

Modules

nanograd-autograd

Tensor Constructors

Tensor Predicates

Tensor Accessors

Arithmetic Operations

Linear Algebra Operations

Activation Functions

Loss Functions

Gradient Operations

Convolution Operations

Normalization Operations

Utility Functions

nanograd-layer

Layer Predicates

Dense Layer

Convolutional Layer

Sequential Container

Layer Operations

Activation Function Objects

Utility Functions

nanograd-optimizer

Optimizer Predicates

SGD Optimizer

Adam Optimizer

RMSprop Optimizer

Optimizer Operations

Examples

Basic Tensor Operations

Training a Neural Network

Convolutional Neural Network

Performance Notes

Limitations

Advanced Usage

Custom Backward Functions

Learning Rate Scheduling

Gradient Clipping

Model Evaluation Mode

Troubleshooting

Common Errors

Author

Repository

Version History

See Also

License

References