llama

A high-performance LLAMA2 inference implementation in CHICKEN Scheme, based on Andrej Karpathy's llama2.c and its OCaml port llama2.ml.

Description

This egg provides a complete implementation of the LLAMA2 transformer architecture for text generation. It features modular components and uses BLAS integration for high performance.

Requirements

System Dependencies

CHICKEN Extensions

Installation

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install chicken-bin libchicken-dev libopenblas-dev

# Install CHICKEN extensions
chicken-install llama

# Download example model (15M parameters, ~60MB) and tokenizer.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://github.com/iraikov/llama-chicken/raw/refs/heads/main/tokenizer.bin

Quick Start

# Basic text generation
llama-cli -c stories15M.bin -p "Once upon a time"

# Creative generation with temperature
llama-cli -c stories15M.bin -t 0.8 -s 100 -p "The meaning of life is"

# Verify model integrity
llama-cli -c stories15M.bin --verify-checkpoint

API

Data Types

config
[record] config

Model configuration parameters.

[procedure] (make-config dim hidden-dim n-layers n-heads n-kv-heads vocab-size seq-len shared-weights)

Creates a new configuration object.

dim
Model embedding dimension
hidden-dim
FFN hidden layer dimension
n-layers
Number of transformer layers
n-heads
Number of attention heads
n-kv-heads
Number of key-value heads
vocab-size
Vocabulary size
seq-len
Maximum sequence length
shared-weights
Whether to share input/output embeddings
[procedure] (config-dim config)
[procedure] (config-hidden-dim config)
[procedure] (config-n-layers config)
[procedure] (config-n-heads config)
[procedure] (config-n-kv-heads config)
[procedure] (config-vocab-size config)
[procedure] (config-seq-len config)
[procedure] (config-shared-weights config)

Accessors for configuration fields.

transformer-weights
[record] transformer-weights

Container for all model parameters including embeddings, attention weights, FFN weights, and RoPE frequencies.

[procedure] (make-transformer-weights token-embedding-table rms-att-weight wq wk wv wo rms-ffn-weight w1 w2 w3 rms-final-weight freq-cis-real freq-cis-imag wcls)

Creates a new transformer weights object with all parameter matrices.

run-state
[record] run-state

Runtime state for transformer computation including hidden states, attention caches, and output logits.

[procedure] (make-run-state x xb q k v att key-cache value-cache xb2 hb hb2 logits)

Creates a new runtime state object.

[procedure] (run-state-x state)
[procedure] (run-state-logits state)
[procedure] (run-state-key-cache state)
[procedure] (run-state-value-cache state)

Accessors for runtime state fields.

args
[record] args

Runtime configuration for text generation runs.

[procedure] (make-args checkpoint tokenizer temperature steps prompt seed)

Creates text generation arguments.

checkpoint
Path to model checkpoint file
tokenizer
Path to tokenizer file
temperature
Sampling temperature (0.0 = greedy)
steps
Number of tokens to generate
prompt
Input text prompt
seed
Random seed (optional)

High-Level Functions

[procedure] (run args)

Main inference function. Takes an args object and performs text generation.

(define args (make-args "model.bin" "tokenizer.bin" 0.8 100 "Hello world" #f))
(run args)
[procedure] (transformer token pos config state weights)

Run transformer forward pass for a single token.

token
Token ID to process
pos
Position in sequence
config
Model configuration
state
Runtime state (modified in place)
weights
Model parameters

Returns the updated state.

[procedure] (bpe-encode text vocab vocab-scores)

Tokenize text using Byte-Pair Encoding.

text
Input text string
vocab
List of vocabulary strings
vocab-scores
List of BPE merge scores

Returns list of token IDs

Transformer Components

The modular architecture provides fine-grained control:

[procedure] (token-embedding-lookup state weights token)

Load token embedding into state.

[procedure] (get-rope-frequencies weights pos head-size)

Extract RoPE frequency rows for given position. Returns two values: real and imaginary frequency vectors.

[procedure] (attention-rmsnorm state weights layer-idx config)

Apply RMS normalization for attention layer.

[procedure] (compute-qkv state weights layer-idx config)

Compute Query, Key, Value matrices for given layer.

[procedure] (apply-rope state config freq-real freq-imag)

Apply Rotary Position Embedding to Q and K vectors.

[procedure] (cache-kv state layer-idx pos config)

Store current key and value vectors in attention cache.

[procedure] (compute-attention state layer-idx pos config)

Compute multi-head attention scores and apply to values.

[procedure] (attention-output state weights layer-idx config)

Apply final attention output projection.

[procedure] (ffn-rmsnorm state weights layer-idx config)

Apply RMS normalization for feed-forward network.

[procedure] (compute-ffn-w1w3 state weights layer-idx config)

Compute first part of FFN: W1(x) and W3(x).

[procedure] (apply-swiglu state config)

Apply SwiGLU activation: SiLU(W1(x)) * W3(x).

[procedure] (ffn-output state weights layer-idx config)

Apply final FFN linear transformation.

[procedure] (process-transformer-layer state weights layer-idx pos config freq-real freq-imag)

Process complete transformer layer (attention + FFN blocks).

[procedure] (final-rmsnorm state weights)

Apply final RMS normalization before classification.

[procedure] (compute-logits state weights config)

Compute final classification logits.

Utility Functions

[procedure] (rmsnorm output input weights)

RMS normalization with learnable weights.

[procedure] (matmul output input matrix rows cols)

Matrix-vector multiplication using BLAS.

[procedure] (softmax output input size)

Softmax activation with numerical stability.

[procedure] (accum target source)

Vector accumulation for residual connections.

[procedure] (argmax vector)

Find index of maximum element (greedy sampling).

[procedure] (sample probabilities random-state)

Probabilistic sampling from probability distribution.

[procedure] (verify-checkpoint-data checkpoint-file [detailed])

Load and analyze checkpoint file, printing weight statistics.

Command-Line Interface

The llama-cli command provides easy access to text generation:

llama-cli [options]

Options:
  -h, --help            Show help message
  -c, --checkpoint FILE Model checkpoint file (required)  
  -k, --tokenizer FILE  Tokenizer file (default: tokenizer.bin)
  -t, --temperature NUM Sampling temperature (default: 0.0)
  -s, --steps NUM       Number of tokens to generate (default: 256)
  -p, --prompt TEXT     Input prompt text (default: empty)
  --seed NUM            Random seed for sampling
  --verify-checkpoint   Verify checkpoint integrity

Examples

Basic Usage

(import llama)

;; Simple text generation
(define args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 "Once upon a time" #f))
(run args)

Interactive REPL Usage

(import llama)

;; Load model components
(define config (make-config 288 768 6 6 6 32000 256 #t))
(define weights (load-checkpoint "stories15M.bin"))
(define state (make-run-state ...))

;; Generate single token
(transformer 1 0 config state weights)
(argmax (run-state-logits state))

;; Custom sampling with temperature
(define logits (run-state-logits state))
(do ((i 0 (+ i 1)))
    ((= i (f32vector-length logits)))
  (f32vector-set! logits i (/ (f32vector-ref logits i) 0.8)))

(define probs (softmax (make-f32vector 32000) logits 32000))
(sample probs random-state)

Batch Processing

;; Process multiple prompts
(define prompts '("Hello world" "The meaning of life" "Once upon a time"))

(for-each (lambda (prompt)
            (printf "Prompt: ~A~%" prompt)
            (let ((args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 prompt #f)))
              (run args)
              (newline)))
          prompts)

Component-Level Usage

;; Fine-grained control over generation
(define (custom-generation token config state weights)
  ;; Custom attention processing
  (attention-rmsnorm state weights 0 config)
  (compute-qkv state weights 0 config)
  
  ;; Skip some layers for faster inference  
  (let-values (((freq-real freq-imag) (get-rope-frequencies weights 0 2)))
    (process-transformer-layer state weights 0 0 config freq-real freq-imag)
    (process-transformer-layer state weights 2 0 config freq-real freq-imag))
  
  ;; Custom final processing
  (final-rmsnorm state weights)
  (compute-logits state weights config))

Configuration

Temperature Guidelines

License

MIT License

Author

Ivan Raikov

Repository

https://github.com/iraikov/llama-chicken

Version History

1.0
Initial release with complete LLAMA2 implementation

See Also