llama

A high-performance LLAMA2 inference implementation in CHICKEN Scheme, based on Andrej Karpathy's llama2.c and its OCaml port llama2.ml.

Description

This egg provides a complete implementation of the LLAMA2 transformer architecture for text generation. It features modular components and uses BLAS integration for high performance.

Requirements

System Dependencies

BLAS library (OpenBLAS, Intel MKL, or system BLAS)

CHICKEN Extensions

srfi-1 - List library
srfi-4 - Numeric vectors
srfi-42 - Comprehensions
srfi-69 - Hash tables
vector-lib - Vector utilities
blas - BLAS bindings
endian-blob - Endian-aware blob operations
endian-port - Endian-aware port operations
random-mtzig - Random number generation
getopt-long - Command-line option parsing
test - Unit testing

Installation

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install chicken-bin libchicken-dev libopenblas-dev

# Install CHICKEN extensions
chicken-install llama

# Download example model (15M parameters, ~60MB) and tokenizer.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://github.com/iraikov/llama-chicken/raw/refs/heads/main/tokenizer.bin

Quick Start

# Basic text generation
llama-cli -c stories15M.bin -p "Once upon a time"

# Creative generation with temperature
llama-cli -c stories15M.bin -t 0.8 -s 100 -p "The meaning of life is"

# Verify model integrity
llama-cli -c stories15M.bin --verify-checkpoint

API

Data Types

config

[record] config

Model configuration parameters.

[procedure] (make-config dim hidden-dim n-layers n-heads n-kv-heads vocab-size seq-len shared-weights)

Creates a new configuration object.

dim: Model embedding dimension
hidden-dim: FFN hidden layer dimension
n-layers: Number of transformer layers
n-heads: Number of attention heads
n-kv-heads: Number of key-value heads
vocab-size: Vocabulary size
seq-len: Maximum sequence length
shared-weights: Whether to share input/output embeddings

[procedure] (config-dim config)
[procedure] (config-hidden-dim config)
[procedure] (config-n-layers config)
[procedure] (config-n-heads config)
[procedure] (config-n-kv-heads config)
[procedure] (config-vocab-size config)
[procedure] (config-seq-len config)
[procedure] (config-shared-weights config)

Accessors for configuration fields.

transformer-weights

[record] transformer-weights

Container for all model parameters including embeddings, attention weights, FFN weights, and RoPE frequencies.

[procedure] (make-transformer-weights token-embedding-table rms-att-weight wq wk wv wo rms-ffn-weight w1 w2 w3 rms-final-weight freq-cis-real freq-cis-imag wcls)

Creates a new transformer weights object with all parameter matrices.

run-state

[record] run-state

Runtime state for transformer computation including hidden states, attention caches, and output logits.

[procedure] (make-run-state x xb q k v att key-cache value-cache xb2 hb hb2 logits)

Creates a new runtime state object.

[procedure] (run-state-x state)
[procedure] (run-state-logits state)
[procedure] (run-state-key-cache state)
[procedure] (run-state-value-cache state)

Accessors for runtime state fields.

args

[record] args

Runtime configuration for text generation runs.

[procedure] (make-args checkpoint tokenizer temperature steps prompt seed)

Creates text generation arguments.

checkpoint: Path to model checkpoint file
tokenizer: Path to tokenizer file
temperature: Sampling temperature (0.0 = greedy)
steps: Number of tokens to generate
prompt: Input text prompt
seed: Random seed (optional)

High-Level Functions

[procedure] (run args)

Main inference function. Takes an args object and performs text generation.

(define args (make-args "model.bin" "tokenizer.bin" 0.8 100 "Hello world" #f))
(run args)

[procedure] (transformer token pos config state weights)

Run transformer forward pass for a single token.

token: Token ID to process
pos: Position in sequence
config: Model configuration
state: Runtime state (modified in place)
weights: Model parameters

Returns the updated state.

[procedure] (bpe-encode text vocab vocab-scores)

Tokenize text using Byte-Pair Encoding.

text: Input text string
vocab: List of vocabulary strings
vocab-scores: List of BPE merge scores

Returns list of token IDs

Transformer Components

The modular architecture provides fine-grained control:

[procedure] (token-embedding-lookup state weights token)

Load token embedding into state.

[procedure] (get-rope-frequencies weights pos head-size)

Extract RoPE frequency rows for given position. Returns two values: real and imaginary frequency vectors.

[procedure] (attention-rmsnorm state weights layer-idx config)

Apply RMS normalization for attention layer.

[procedure] (compute-qkv state weights layer-idx config)

Compute Query, Key, Value matrices for given layer.

[procedure] (apply-rope state config freq-real freq-imag)

Apply Rotary Position Embedding to Q and K vectors.

[procedure] (cache-kv state layer-idx pos config)

Store current key and value vectors in attention cache.

[procedure] (compute-attention state layer-idx pos config)

Compute multi-head attention scores and apply to values.

[procedure] (attention-output state weights layer-idx config)

Apply final attention output projection.

[procedure] (ffn-rmsnorm state weights layer-idx config)

Apply RMS normalization for feed-forward network.

[procedure] (compute-ffn-w1w3 state weights layer-idx config)

Compute first part of FFN: W1(x) and W3(x).

[procedure] (apply-swiglu state config)

Apply SwiGLU activation: SiLU(W1(x)) * W3(x).

[procedure] (ffn-output state weights layer-idx config)

Apply final FFN linear transformation.

[procedure] (process-transformer-layer state weights layer-idx pos config freq-real freq-imag)

Process complete transformer layer (attention + FFN blocks).

[procedure] (final-rmsnorm state weights)

Apply final RMS normalization before classification.

[procedure] (compute-logits state weights config)

Compute final classification logits.

Utility Functions

[procedure] (rmsnorm output input weights)

RMS normalization with learnable weights.

[procedure] (matmul output input matrix rows cols)

Matrix-vector multiplication using BLAS.

[procedure] (softmax output input size)

Softmax activation with numerical stability.

[procedure] (accum target source)

Vector accumulation for residual connections.

[procedure] (argmax vector)

Find index of maximum element (greedy sampling).

[procedure] (sample probabilities random-state)

Probabilistic sampling from probability distribution.

[procedure] (verify-checkpoint-data checkpoint-file [detailed])

Load and analyze checkpoint file, printing weight statistics.

Command-Line Interface

The llama-cli command provides easy access to text generation:

llama-cli [options]

Options:
  -h, --help            Show help message
  -c, --checkpoint FILE Model checkpoint file (required)  
  -k, --tokenizer FILE  Tokenizer file (default: tokenizer.bin)
  -t, --temperature NUM Sampling temperature (default: 0.0)
  -s, --steps NUM       Number of tokens to generate (default: 256)
  -p, --prompt TEXT     Input prompt text (default: empty)
  --seed NUM            Random seed for sampling
  --verify-checkpoint   Verify checkpoint integrity

Examples

Basic Usage

(import llama)

;; Simple text generation
(define args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 "Once upon a time" #f))
(run args)

Interactive REPL Usage

(import llama)

;; Load model components
(define config (make-config 288 768 6 6 6 32000 256 #t))
(define weights (load-checkpoint "stories15M.bin"))
(define state (make-run-state ...))

;; Generate single token
(transformer 1 0 config state weights)
(argmax (run-state-logits state))

;; Custom sampling with temperature
(define logits (run-state-logits state))
(do ((i 0 (+ i 1)))
    ((= i (f32vector-length logits)))
  (f32vector-set! logits i (/ (f32vector-ref logits i) 0.8)))

(define probs (softmax (make-f32vector 32000) logits 32000))
(sample probs random-state)

Batch Processing

;; Process multiple prompts
(define prompts '("Hello world" "The meaning of life" "Once upon a time"))

(for-each (lambda (prompt)
            (printf "Prompt: ~A~%" prompt)
            (let ((args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 prompt #f)))
              (run args)
              (newline)))
          prompts)

Component-Level Usage

;; Fine-grained control over generation
(define (custom-generation token config state weights)
  ;; Custom attention processing
  (attention-rmsnorm state weights 0 config)
  (compute-qkv state weights 0 config)
  
  ;; Skip some layers for faster inference  
  (let-values (((freq-real freq-imag) (get-rope-frequencies weights 0 2)))
    (process-transformer-layer state weights 0 0 config freq-real freq-imag)
    (process-transformer-layer state weights 2 0 config freq-real freq-imag))
  
  ;; Custom final processing
  (final-rmsnorm state weights)
  (compute-logits state weights config))

Configuration

Temperature Guidelines

0.0: Deterministic (greedy sampling)
0.1-0.3: Focused, coherent output
0.5-0.8: Balanced creativity and coherence
0.9-1.2: Creative, diverse output
1.5+: Highly random, experimental

1.0: Initial release with complete LLAMA2 implementation