llama
A high-performance LLAMA2 inference implementation in CHICKEN Scheme, based on Andrej Karpathy's llama2.c and its OCaml port llama2.ml.
Description
This egg provides a complete implementation of the LLAMA2 transformer architecture for text generation. It features modular components and uses BLAS integration for high performance.
Requirements
System Dependencies
- BLAS library (OpenBLAS, Intel MKL, or system BLAS)
CHICKEN Extensions
- srfi-1 - List library
- srfi-4 - Numeric vectors
- srfi-42 - Comprehensions
- srfi-69 - Hash tables
- vector-lib - Vector utilities
- blas - BLAS bindings
- endian-blob - Endian-aware blob operations
- endian-port - Endian-aware port operations
- random-mtzig - Random number generation
- getopt-long - Command-line option parsing
- test - Unit testing
Installation
# Install system dependencies (Ubuntu/Debian) sudo apt-get install chicken-bin libchicken-dev libopenblas-dev # Install CHICKEN extensions chicken-install llama # Download example model (15M parameters, ~60MB) and tokenizer.bin wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin wget https://github.com/iraikov/llama-chicken/raw/refs/heads/main/tokenizer.bin
Quick Start
# Basic text generation llama-cli -c stories15M.bin -p "Once upon a time" # Creative generation with temperature llama-cli -c stories15M.bin -t 0.8 -s 100 -p "The meaning of life is" # Verify model integrity llama-cli -c stories15M.bin --verify-checkpoint
API
Data Types
config
[record] configModel configuration parameters.
[procedure] (make-config dim hidden-dim n-layers n-heads n-kv-heads vocab-size seq-len shared-weights)Creates a new configuration object.
- dim
- Model embedding dimension
- hidden-dim
- FFN hidden layer dimension
- n-layers
- Number of transformer layers
- n-heads
- Number of attention heads
- n-kv-heads
- Number of key-value heads
- vocab-size
- Vocabulary size
- seq-len
- Maximum sequence length
- shared-weights
- Whether to share input/output embeddings
[procedure] (config-hidden-dim config)
[procedure] (config-n-layers config)
[procedure] (config-n-heads config)
[procedure] (config-n-kv-heads config)
[procedure] (config-vocab-size config)
[procedure] (config-seq-len config)
[procedure] (config-shared-weights config)
Accessors for configuration fields.
transformer-weights
[record] transformer-weightsContainer for all model parameters including embeddings, attention weights, FFN weights, and RoPE frequencies.
[procedure] (make-transformer-weights token-embedding-table rms-att-weight wq wk wv wo rms-ffn-weight w1 w2 w3 rms-final-weight freq-cis-real freq-cis-imag wcls)Creates a new transformer weights object with all parameter matrices.
run-state
[record] run-stateRuntime state for transformer computation including hidden states, attention caches, and output logits.
[procedure] (make-run-state x xb q k v att key-cache value-cache xb2 hb hb2 logits)Creates a new runtime state object.
[procedure] (run-state-x state)[procedure] (run-state-logits state)
[procedure] (run-state-key-cache state)
[procedure] (run-state-value-cache state)
Accessors for runtime state fields.
args
[record] argsRuntime configuration for text generation runs.
[procedure] (make-args checkpoint tokenizer temperature steps prompt seed)Creates text generation arguments.
- checkpoint
- Path to model checkpoint file
- tokenizer
- Path to tokenizer file
- temperature
- Sampling temperature (0.0 = greedy)
- steps
- Number of tokens to generate
- prompt
- Input text prompt
- seed
- Random seed (optional)
High-Level Functions
[procedure] (run args)Main inference function. Takes an args object and performs text generation.
(define args (make-args "model.bin" "tokenizer.bin" 0.8 100 "Hello world" #f)) (run args)[procedure] (transformer token pos config state weights)
Run transformer forward pass for a single token.
- token
- Token ID to process
- pos
- Position in sequence
- config
- Model configuration
- state
- Runtime state (modified in place)
- weights
- Model parameters
Returns the updated state.
[procedure] (bpe-encode text vocab vocab-scores)Tokenize text using Byte-Pair Encoding.
- text
- Input text string
- vocab
- List of vocabulary strings
- vocab-scores
- List of BPE merge scores
Returns list of token IDs
Transformer Components
The modular architecture provides fine-grained control:
[procedure] (token-embedding-lookup state weights token)Load token embedding into state.
[procedure] (get-rope-frequencies weights pos head-size)Extract RoPE frequency rows for given position. Returns two values: real and imaginary frequency vectors.
[procedure] (attention-rmsnorm state weights layer-idx config)Apply RMS normalization for attention layer.
[procedure] (compute-qkv state weights layer-idx config)Compute Query, Key, Value matrices for given layer.
[procedure] (apply-rope state config freq-real freq-imag)Apply Rotary Position Embedding to Q and K vectors.
[procedure] (cache-kv state layer-idx pos config)Store current key and value vectors in attention cache.
[procedure] (compute-attention state layer-idx pos config)Compute multi-head attention scores and apply to values.
[procedure] (attention-output state weights layer-idx config)Apply final attention output projection.
[procedure] (ffn-rmsnorm state weights layer-idx config)Apply RMS normalization for feed-forward network.
[procedure] (compute-ffn-w1w3 state weights layer-idx config)Compute first part of FFN: W1(x) and W3(x).
[procedure] (apply-swiglu state config)Apply SwiGLU activation: SiLU(W1(x)) * W3(x).
[procedure] (ffn-output state weights layer-idx config)Apply final FFN linear transformation.
[procedure] (process-transformer-layer state weights layer-idx pos config freq-real freq-imag)Process complete transformer layer (attention + FFN blocks).
[procedure] (final-rmsnorm state weights)Apply final RMS normalization before classification.
[procedure] (compute-logits state weights config)Compute final classification logits.
Utility Functions
[procedure] (rmsnorm output input weights)RMS normalization with learnable weights.
[procedure] (matmul output input matrix rows cols)Matrix-vector multiplication using BLAS.
[procedure] (softmax output input size)Softmax activation with numerical stability.
[procedure] (accum target source)Vector accumulation for residual connections.
[procedure] (argmax vector)Find index of maximum element (greedy sampling).
[procedure] (sample probabilities random-state)Probabilistic sampling from probability distribution.
[procedure] (verify-checkpoint-data checkpoint-file [detailed])Load and analyze checkpoint file, printing weight statistics.
Command-Line Interface
The llama-cli command provides easy access to text generation:
llama-cli [options] Options: -h, --help Show help message -c, --checkpoint FILE Model checkpoint file (required) -k, --tokenizer FILE Tokenizer file (default: tokenizer.bin) -t, --temperature NUM Sampling temperature (default: 0.0) -s, --steps NUM Number of tokens to generate (default: 256) -p, --prompt TEXT Input prompt text (default: empty) --seed NUM Random seed for sampling --verify-checkpoint Verify checkpoint integrity
Examples
Basic Usage
(import llama) ;; Simple text generation (define args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 "Once upon a time" #f)) (run args)
Interactive REPL Usage
(import llama) ;; Load model components (define config (make-config 288 768 6 6 6 32000 256 #t)) (define weights (load-checkpoint "stories15M.bin")) (define state (make-run-state ...)) ;; Generate single token (transformer 1 0 config state weights) (argmax (run-state-logits state)) ;; Custom sampling with temperature (define logits (run-state-logits state)) (do ((i 0 (+ i 1))) ((= i (f32vector-length logits))) (f32vector-set! logits i (/ (f32vector-ref logits i) 0.8))) (define probs (softmax (make-f32vector 32000) logits 32000)) (sample probs random-state)
Batch Processing
;; Process multiple prompts (define prompts '("Hello world" "The meaning of life" "Once upon a time")) (for-each (lambda (prompt) (printf "Prompt: ~A~%" prompt) (let ((args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 prompt #f))) (run args) (newline))) prompts)
Component-Level Usage
;; Fine-grained control over generation (define (custom-generation token config state weights) ;; Custom attention processing (attention-rmsnorm state weights 0 config) (compute-qkv state weights 0 config) ;; Skip some layers for faster inference (let-values (((freq-real freq-imag) (get-rope-frequencies weights 0 2))) (process-transformer-layer state weights 0 0 config freq-real freq-imag) (process-transformer-layer state weights 2 0 config freq-real freq-imag)) ;; Custom final processing (final-rmsnorm state weights) (compute-logits state weights config))
Configuration
Temperature Guidelines
- 0.0: Deterministic (greedy sampling)
- 0.1-0.3: Focused, coherent output
- 0.5-0.8: Balanced creativity and coherence
- 0.9-1.2: Creative, diverse output
- 1.5+: Highly random, experimental
License
MIT License
Author
Repository
https://github.com/iraikov/llama-chicken
Version History
- 1.0
- Initial release with complete LLAMA2 implementation
See Also
- llama2.c - Original C implementation
- llama2.ml - OCaml port
- llama.cl - Common Lisp port
- blas - BLAS bindings for CHICKEN Scheme
- LLAMA2 Paper - Original research paper