You are looking at historical revision 44975 of this page. It may differ significantly from its current revision.

nanograd

A lightweight automatic differentiation and neural network framework for CHICKEN Scheme, featuring BLAS-accelerated operations and YASOS-based object abstractions.

Description

NanoGrad provides a complete framework for building and training neural networks with automatic differentiation. It features:

Requirements

Modules

nanograd-autograd

Core automatic differentiation engine with tensor operations.

Tensor Constructors
[procedure] (make-tensor32 data shape #!key (requires-grad? #t)) -> tensor

Creates a 32-bit floating-point tensor with automatic differentiation support.

data
f32vector containing the tensor data
shape
list of dimensions, e.g., '(2 3) for a 2x3 matrix
requires-grad?
whether to track gradients (default #t)
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t))
[procedure] (make-tensor64 data shape #!key (requires-grad? #t)) -> tensor

Creates a 64-bit floating-point tensor with automatic differentiation support.

(define x (make-tensor64 (f64vector 1.0 2.0 3.0 4.0) '(2 2)))
Tensor Predicates
[procedure] (tensor? obj) -> boolean
[procedure] (tensor32? obj) -> boolean
[procedure] (tensor64? obj) -> boolean

Type predicates for tensors.

Tensor Accessors
[procedure] (tensor-data tensor) -> vector

Returns the underlying f32vector or f64vector containing the tensor's data.

[procedure] (tensor-grad tensor) -> vector or #f

Returns the gradient vector if gradients are enabled, #f otherwise.

[procedure] (tensor-shape tensor) -> list

Returns the shape as a list of dimensions.

[procedure] (tensor-dtype tensor) -> symbol

Returns the data type: 'f32 or 'f64.

[procedure] (tensor-requires-grad? tensor) -> boolean

Returns #t if the tensor tracks gradients.

Arithmetic Operations
[procedure] (add a b) -> tensor

Element-wise addition of tensors a and b. Both tensors must have the same shape and dtype.

(define z (add x y))  ; z = x + y

Gradient: dL/da = dL/dz, dL/db = dL/dz

[procedure] (sub a b) -> tensor

Element-wise subtraction: a - b.

Gradient: dL/da = dL/dz, dL/db = -dL/dz

[procedure] (mul a b) -> tensor

Element-wise multiplication (Hadamard product).

Gradient: dL/da = dL/dz ⊙ b, dL/db = dL/dz ⊙ a

[procedure] (div a b) -> tensor

Element-wise division: a / b.

Gradient: dL/da = dL/dz / b, dL/db = -dL/dz \dot (a / b²)

[procedure] (safe-div a b #!key (epsilon 1e-8)) -> tensor

Safe element-wise division: a / (b + epsilon) to avoid division by zero.

Linear Algebra Operations
[procedure] (matmul-op a b) -> tensor

Matrix multiplication using BLAS GEMM/GEMV operations. Supports:

(define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define b (make-tensor32 (f32vector 5.0 6.0) '(2)))
(define c (matmul-op A b))  ; 2×2 matrix times 2×1 vector = 2×1 vector

Gradient: dL/dA = dL/dC \dot B^T, dL/dB = A^T \dot dL/dC

[procedure] (dot-op a b) -> tensor

Dot product (inner product) of two 1D vectors using BLAS DOT.

(define result (dot-op x y))  ; scalar result

Gradient: dL/da = (d L/d result) \dot b, dL/db = (d L/d result) \dot a

[procedure] (scale-op tensor scalar) -> tensor

Scalar multiplication using BLAS SCAL.

Gradient: dL/dtensor = scalar \dot dL/dresult

Activation Functions
[procedure] (relu tensor) -> tensor

Rectified Linear Unit: max(0, x).

Gradient: 1 if x > 0, else 0

[procedure] (tanh-op tensor) -> tensor

Hyperbolic tangent activation.

Gradient: 1 - tanh^2(x)

[procedure] (sigmoid tensor) -> tensor

Sigmoid (logistic) activation: sigm(x) = 1 / (1 + e^(-x)).

Gradient: sigm(x) \dot (1 - sigm(x))

[procedure] (sigmoid-stable tensor) -> tensor

Numerically stable sigmoid implementation for large negative values.

[procedure] (softmax x #!key (dim #f)) -> tensor

Softmax normalization with numerical stability (subtracts max before exp).

(define probs (softmax logits))  ; Converts logits to probabilities
[procedure] (log-softmax x #!key (dim #f)) -> tensor

Log-softmax: more numerically stable than log(softmax(x)).

[procedure] (leaky-relu tensor #!key (alpha 0.01)) -> tensor

Leaky ReLU: max(alpha * x, x).

[procedure] (softplus tensor #!key (beta 1.0)) -> tensor

Softplus activation: log(1 + e^(beta * x)) / beta.

Loss Functions
[procedure] (mse-loss pred target) -> tensor

Mean Squared Error loss: L = (1/n) \sum(pred - target)^{2}.

(define loss (mse-loss predictions targets))
[procedure] (cross-entropy-loss pred target) -> tensor

Cross-entropy loss: L = -\sum(target \dot log(pred)).

Note: Assumes pred is already normalized (e.g., via softmax).

Gradient Operations
[procedure] (zero-grad! tensor) -> void

Sets all gradient values to zero.

[procedure] (backward! tensor) -> void

Computes gradients via reverse-mode automatic differentiation. Performs topological sort and executes backward functions in correct order. Detects cycles and raises an error if found.

(define x (make-tensor32 (f32vector 1.0 2.0) '(2)))
(define y (make-tensor32 (f32vector 3.0 4.0) '(2)))
(define z (add x y))
(define loss (dot-op z z))

(backward! loss)
(print-tensor (tensor-grad x))
[procedure] (add-to-grad! tensor delta) -> void

Accumulates delta into the tensor's gradient using BLAS AXPY.

Convolution Operations
[procedure] (conv2d input weight bias #!key (stride 1) (padding 0)) -> tensor

2D convolution using im2col + GEMM algorithm.

input
tensor of shape (C_in, H, W)
weight
tensor of shape (C_out, C_in, KH, KW)
bias
tensor of shape (C_out) or #f
stride
stride for convolution (default 1)
padding
zero-padding (default 0)
(define output (conv2d input weights bias stride: 2 padding: 1))
Normalization Operations
[procedure] (rmsnorm x weight #!key (epsilon 1e-5)) -> tensor

Root Mean Square Layer Normalization.

[procedure] (l2-normalize tensor #!key (epsilon 1e-8)) -> tensor

L2 normalization: x / ||x||_{2}.

[procedure] (cosine-similarity a b) -> tensor

Cosine similarity: (a \dot b) / (||a|| \dot ||b||).

Utility Functions
[procedure] (tensor->list tensor) -> list

Converts tensor data to a list.

[procedure] (print-tensor tensor) -> void

Pretty-prints tensor information including shape, dtype, data, and gradients.

[procedure] (vector-length-for-dtype vec dtype) -> integer

Returns the length of a vector based on its dtype.

nanograd-layer

Neural network layer abstractions and containers.

Layer Predicates
[procedure] (layer? obj) -> boolean
[procedure] (dense-layer? obj) -> boolean
[procedure] (conv2d-layer? obj) -> boolean
[procedure] (sequential? obj) -> boolean
Dense Layer
[procedure] (make-dense-layer input-size output-size #!key (activation (make-identity)) (dtype 'f32) (name "Dense")) -> layer

Creates a fully-connected (dense) layer with Xavier/Glorot initialization.

input-size
number of input features
output-size
number of output features
activation
activation function object (default identity)
dtype
'f32 or 'f64 (default 'f32)
name
layer name for debugging
(define layer (make-dense-layer 784 128 
                                activation: (make-relu)
                                name: "Hidden1"))
Convolutional Layer
[procedure] (make-conv2d-layer in-channels out-channels kernel-size #!key (stride 1) (padding 0) (activation (make-identity)) (dtype 'f32) (name "Conv2D")) -> layer

Creates a 2D convolutional layer with He initialization.

(define conv (make-conv2d-layer 3 32 3 
                                stride: 1 
                                padding: 1
                                activation: (make-relu)))
Sequential Container
[procedure] (make-sequential layers #!key (name "Sequential")) -> layer

Creates a sequential container that chains multiple layers.

(define model
  (make-sequential
   (list
    (make-dense-layer 784 128 activation: (make-relu))
    (make-dense-layer 128 64 activation: (make-relu))
    (make-dense-layer 64 10 activation: (make-identity)))
   name: "MLP"))
Layer Operations
[procedure] (forward layer input) -> tensor

Performs a forward pass through the layer.

[procedure] (parameters layer) -> list

Returns a list of all trainable parameter tensors.

[procedure] (zero-grad-layer! layer) -> void

Zeros gradients for all parameters in the layer.

[procedure] (layer-input-size layer) -> integer
[procedure] (layer-output-size layer) -> integer
[procedure] (layer-activation layer) -> activation
[procedure] (layer-name layer) -> string

Accessor functions for layer properties.

Activation Function Objects
[procedure] (make-relu) -> activation
[procedure] (make-tanh) -> activation
[procedure] (make-sigmoid) -> activation
[procedure] (make-identity) -> activation

Creates activation function objects for use in layers.

[procedure] (activation? obj) -> boolean
[procedure] (activation-forward act x) -> tensor
[procedure] (activation-name act) -> string
Utility Functions
[procedure] (print-layer layer #!optional (indent 0)) -> void

Prints layer information with optional indentation.

[procedure] (summary model) -> void

Prints a model summary including all layers and parameter counts.

(summary model)
; === Model Summary ===
; Model: MLP
; Input size: 784
; Output size: 10
; 
; Total parameters: 101770

nanograd-optimizer

Optimization algorithms for neural network training.

Optimizer Predicates
[procedure] (optimizer? obj) -> boolean
[procedure] (sgd? obj) -> boolean
[procedure] (adam? obj) -> boolean
[procedure] (rmsprop? obj) -> boolean
SGD Optimizer
[procedure] (make-sgd parameters #!key (learning-rate 0.01) (momentum 0.0) (weight-decay 0.0) (nesterov #f)) -> optimizer

Stochastic Gradient Descent optimizer with optional momentum and Nesterov acceleration.

parameters
list of parameter tensors to optimize
learning-rate
step size (default 0.01)
momentum
momentum factor (default 0.0, no momentum)
weight-decay
L2 regularization factor (default 0.0)
nesterov
use Nesterov momentum (default #f)
(define opt (make-sgd (parameters model) 
                      learning-rate: 0.01
                      momentum: 0.9))
Adam Optimizer
[procedure] (make-adam parameters #!key (learning-rate 0.001) (beta1 0.9) (beta2 0.999) (epsilon 1e-8) (weight-decay 0.0)) -> optimizer

Adam (Adaptive Moment Estimation) optimizer with bias correction.

beta1
exponential decay rate for first moment (default 0.9)
beta2
exponential decay rate for second moment (default 0.999)
epsilon
numerical stability constant (default 1e-8)
(define opt (make-adam (parameters model) learning-rate: 0.001))
RMSprop Optimizer
[procedure] (make-rmsprop parameters #!key (learning-rate 0.01) (alpha 0.99) (epsilon 1e-8) (weight-decay 0.0) (momentum 0.0)) -> optimizer

RMSprop optimizer with optional momentum.

alpha
smoothing constant (default 0.99)
(define opt (make-rmsprop (parameters model) 
                          learning-rate: 0.01
                          alpha: 0.99))
Optimizer Operations
[procedure] (step! optimizer) -> void

Applies parameter updates based on accumulated gradients.

[procedure] (get-learning-rate optimizer) -> number

Returns the current learning rate.

[procedure] (set-learning-rate! optimizer lr) -> void

Updates the learning rate (useful for learning rate scheduling).

; Learning rate decay
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (set-learning-rate! opt (/ 0.1 (+ 1.0 (* 0.01 epoch))))
  ; ... training code ...
  )
[procedure] (optimizer-state optimizer) -> alist

Returns an association list of optimizer configuration parameters.

Examples

Basic Tensor Operations

(import nanograd-autograd)

; Create tensors
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3)))
(define y (make-tensor32 (f32vector 4.0 5.0 6.0) '(3)))

; Operations
(define z (add x y))
(define w (mul x y))

; Compute gradients
(backward! w)
(print-tensor (tensor-grad x))

Training a Neural Network

(import nanograd-autograd nanograd-layer nanograd-optimizer)

; Define model
(define model
  (make-sequential
   (list
    (make-dense-layer 2 8 activation: (make-relu))
    (make-dense-layer 8 1 activation: (make-identity)))
   name: "Regression"))

; Create optimizer
(define optimizer (make-adam (parameters model) learning-rate: 0.01))

; Training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  
  (for-each
   (lambda (sample)
     (let* ((x (make-tensor32 (car sample) '(2)))
            (target (make-tensor32 (f32vector (cdr sample)) '(1)))
            (pred (forward model x))
            (loss (mse-loss pred target)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   training-data))

Convolutional Neural Network

(define cnn
  (make-sequential
   (list
    (make-conv2d-layer 3 32 3 stride: 1 padding: 1 
                       activation: (make-relu))
    (make-conv2d-layer 32 64 3 stride: 1 padding: 1 
                       activation: (make-relu))
    (make-dense-layer (* 64 8 8) 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "CNN"))

; Forward pass with image tensor (3 channels, 32x32 pixels)
(define img (make-tensor32 (make-f32vector (* 3 32 32)) '(3 32 32)))
(define output (forward cnn img))

Performance Notes

Limitations

Advanced Usage

Custom Backward Functions

; Define a custom operation
(define (my-operation x)
  (let* ((dtype (tensor-dtype x))
         (data (tensor-data x))
         (n (vector-length-for-dtype data dtype))
         (result-data (case dtype
                        ((f32) (make-f32vector n))
                        ((f64) (make-f64vector n)))))
    
    ; Forward computation
    (case dtype
      ((f32)
       (do ((i 0 (+ i 1)))
           ((= i n))
         (f32vector-set! result-data i 
                         (* 2.0 (f32vector-ref data i)))))
      ((f64)
       (do ((i 0 (+ i 1)))
           ((= i n))
         (f64vector-set! result-data i 
                         (* 2.0 (f64vector-ref data i))))))
    
    (let ((result (make-base-tensor result-data 
                                    (tensor-shape x) 
                                    dtype 
                                    (tensor-requires-grad? x))))
      
      ; Define backward function
      (when (tensor-requires-grad? x)
        (set-backward-fn! result
          (lambda ()
            (let ((grad-out (tensor-grad result))
                  (grad-in (case dtype
                            ((f32) (make-f32vector n))
                            ((f64) (make-f64vector n)))))
              ; Gradient: d(2x)/dx = 2
              (case dtype
                ((f32)
                 (do ((i 0 (+ i 1)))
                     ((= i n))
                   (f32vector-set! grad-in i
                                   (* 2.0 (f32vector-ref grad-out i)))))
                ((f64)
                 (do ((i 0 (+ i 1)))
                     ((= i n))
                   (f64vector-set! grad-in i
                                   (* 2.0 (f64vector-ref grad-out i))))))
              (add-to-grad! x grad-in)))
          (list x)))
      
      result)))

Learning Rate Scheduling

; Step decay
(define (step-decay base-lr epoch drop-every drop-rate)
  (* base-lr (expt drop-rate (floor (/ epoch drop-every)))))

; Exponential decay
(define (exp-decay base-lr epoch decay-rate)
  (* base-lr (exp (- (* decay-rate epoch)))))

; Cosine annealing
(define (cosine-annealing base-lr epoch total-epochs)
  (* 0.5 base-lr (+ 1.0 (cos (* 3.14159 (/ epoch total-epochs))))))

; Usage in training loop
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 100))
  (set-learning-rate! optimizer (step-decay 0.1 epoch 30 0.5))
  ; ... training code ...
  )

Gradient Clipping

; Clip gradients by norm
(define (clip-grad-norm! parameters max-norm)
  (let ((total-norm 0.0))
    ; Compute total norm
    (for-each
     (lambda (param)
       (let ((grad (tensor-grad param)))
         (when grad
           (let ((dtype (tensor-dtype param))
                 (n (vector-length-for-dtype grad dtype)))
             (case dtype
               ((f32)
                (do ((i 0 (+ i 1)))
                    ((= i n))
                  (let ((g (f32vector-ref grad i)))
                    (set! total-norm (+ total-norm (* g g))))))
               ((f64)
                (do ((i 0 (+ i 1)))
                    ((= i n))
                  (let ((g (f64vector-ref grad i)))
                    (set! total-norm (+ total-norm (* g g)))))))))))
     parameters)
    
    (let ((total-norm (sqrt total-norm)))
      (when (> total-norm max-norm)
        (let ((scale (/ max-norm total-norm)))
          ; Scale all gradients
          (for-each
           (lambda (param)
             (let ((grad (tensor-grad param)))
               (when grad
                 (let ((n (vector-length-for-dtype 
                          grad 
                          (tensor-dtype param))))
                   (case (tensor-dtype param)
                     ((f32) (sscal! n scale grad))
                     ((f64) (dscal! n scale grad)))))))
           parameters))))))

; Usage
(backward! loss)
(clip-grad-norm! (parameters model) 1.0)
(step! optimizer)

Model Evaluation Mode

; Create tensors without gradient tracking for inference
(define (predict model input-data)
  (let ((x (make-tensor32 input-data 
                          (list (f32vector-length input-data))
                          requires-grad?: #f)))
    (forward model x)))

; Batch prediction
(define (predict-batch model batch-data)
  (map (lambda (input) (predict model input))
       batch-data))

Troubleshooting

Common Errors

Shape mismatch errors

Ensure tensor shapes are compatible for operations:

; Matrix multiplication requires compatible dimensions
(define A (make-tensor32 (make-f32vector 6) '(2 3)))
(define B (make-tensor32 (make-f32vector 6) '(3 2)))
(define C (matmul-op A B))  ; OK: (2,3) × (3,2) = (2,2)

(define D (make-tensor32 (make-f32vector 4) '(2 2)))
(matmul-op A D)  ; Error: incompatible dimensions

Gradient computation cycles

Avoid creating cycles in the computation graph:

; Bad: creates a cycle
(define x (make-tensor32 (f32vector 1.0) '(1)))
(define y (add x x))
(set-backward-fn! x (lambda () (add-to-grad! x (tensor-grad y))) (list y))
(backward! y)  ; Error: computation graph contains cycles

Division by zero

Use safe-div when dividing by potentially zero values:

; Instead of (div a b), use:
(define result (safe-div a b epsilon: 1e-8))

Author

Ivan Raikov

Repository

https://github.com/iraikov/nanograd

Version History

1.0
Initial release

: * Core autograd engine : * Dense and convolutional layers : * SGD, Adam, and RMSprop optimizers : * Basic activation and loss functions

See Also

License

GPL-3

References