Wiki
Download
Manual
Eggs
API
Tests
Bugs
show
edit
history
You can edit this page using
wiki syntax
for markup.
Article contents:
[[tags: ai machine-learning nlp llm transformer blas]] == llama A high-performance LLAMA2 inference implementation in CHICKEN Scheme, based on Andrej Karpathy's [[https://github.com/karpathy/llama2.c|llama2.c]] and its OCaml port [[https://github.com/jackpeck/llama2.ml|llama2.ml]]. === Description This egg provides a complete implementation of the LLAMA2 transformer architecture for text generation. It features modular components and uses BLAS integration for high performance. === Requirements ==== System Dependencies * BLAS library (OpenBLAS, Intel MKL, or system BLAS) ==== CHICKEN Extensions * [[/eggref/5/srfi-1|srfi-1]] - List library * [[/eggref/5/srfi-4|srfi-4]] - Numeric vectors * [[/eggref/5/srfi-42|srfi-42]] - Comprehensions * [[/eggref/5/srfi-69|srfi-69]] - Hash tables * [[/eggref/5/vector-lib|vector-lib]] - Vector utilities * [[/eggref/5/blas|blas]] - BLAS bindings * [[/eggref/5/endian-blob|endian-blob]] - Endian-aware blob operations * [[/eggref/5/endian-port|endian-port]] - Endian-aware port operations * [[/eggref/5/random-mtzig|random-mtzig]] - Random number generation * [[/eggref/5/getopt-long|getopt-long]] - Command-line option parsing * [[/eggref/5/test|test]] - Unit testing === Installation <enscript highlight="shell"> # Install system dependencies (Ubuntu/Debian) sudo apt-get install chicken-bin libchicken-dev libopenblas-dev # Install CHICKEN extensions chicken-install llama # Download example model (15M parameters, ~60MB) and tokenizer.bin wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin wget https://github.com/iraikov/llama-chicken/raw/refs/heads/main/tokenizer.bin </enscript> === Quick Start <enscript highlight="shell"> # Basic text generation llama-cli -c stories15M.bin -p "Once upon a time" # Creative generation with temperature llama-cli -c stories15M.bin -t 0.8 -s 100 -p "The meaning of life is" # Verify model integrity llama-cli -c stories15M.bin --verify-checkpoint </enscript> === API ==== Data Types ===== config <record>config</record> Model configuration parameters. <procedure>(make-config dim hidden-dim n-layers n-heads n-kv-heads vocab-size seq-len shared-weights)</procedure> Creates a new configuration object. ; dim : Model embedding dimension ; hidden-dim : FFN hidden layer dimension ; n-layers : Number of transformer layers ; n-heads : Number of attention heads ; n-kv-heads : Number of key-value heads ; vocab-size : Vocabulary size ; seq-len : Maximum sequence length ; shared-weights : Whether to share input/output embeddings <procedure>(config-dim config)</procedure> <procedure>(config-hidden-dim config)</procedure> <procedure>(config-n-layers config)</procedure> <procedure>(config-n-heads config)</procedure> <procedure>(config-n-kv-heads config)</procedure> <procedure>(config-vocab-size config)</procedure> <procedure>(config-seq-len config)</procedure> <procedure>(config-shared-weights config)</procedure> Accessors for configuration fields. ===== transformer-weights <record>transformer-weights</record> Container for all model parameters including embeddings, attention weights, FFN weights, and RoPE frequencies. <procedure>(make-transformer-weights token-embedding-table rms-att-weight wq wk wv wo rms-ffn-weight w1 w2 w3 rms-final-weight freq-cis-real freq-cis-imag wcls)</procedure> Creates a new transformer weights object with all parameter matrices. ===== run-state <record>run-state</record> Runtime state for transformer computation including hidden states, attention caches, and output logits. <procedure>(make-run-state x xb q k v att key-cache value-cache xb2 hb hb2 logits)</procedure> Creates a new runtime state object. <procedure>(run-state-x state)</procedure> <procedure>(run-state-logits state)</procedure> <procedure>(run-state-key-cache state)</procedure> <procedure>(run-state-value-cache state)</procedure> Accessors for runtime state fields. ===== args <record>args</record> Runtime configuration for text generation runs. <procedure>(make-args checkpoint tokenizer temperature steps prompt seed)</procedure> Creates text generation arguments. ; checkpoint : Path to model checkpoint file ; tokenizer : Path to tokenizer file ; temperature : Sampling temperature (0.0 = greedy) ; steps : Number of tokens to generate ; prompt : Input text prompt ; seed : Random seed (optional) ==== High-Level Functions <procedure>(run args)</procedure> Main inference function. Takes an {{args}} object and performs text generation. <enscript highlight="scheme"> (define args (make-args "model.bin" "tokenizer.bin" 0.8 100 "Hello world" #f)) (run args) </enscript> <procedure>(transformer token pos config state weights)</procedure> Run transformer forward pass for a single token. ; token : Token ID to process ; pos : Position in sequence ; config : Model configuration ; state : Runtime state (modified in place) ; weights : Model parameters Returns the updated state. <procedure>(bpe-encode text vocab vocab-scores)</procedure> Tokenize text using Byte-Pair Encoding. ; text : Input text string ; vocab : List of vocabulary strings ; vocab-scores : List of BPE merge scores Returns list of token IDs ==== Transformer Components The modular architecture provides fine-grained control: <procedure>(token-embedding-lookup state weights token)</procedure> Load token embedding into state. <procedure>(get-rope-frequencies weights pos head-size)</procedure> Extract RoPE frequency rows for given position. Returns two values: real and imaginary frequency vectors. <procedure>(attention-rmsnorm state weights layer-idx config)</procedure> Apply RMS normalization for attention layer. <procedure>(compute-qkv state weights layer-idx config)</procedure> Compute Query, Key, Value matrices for given layer. <procedure>(apply-rope state config freq-real freq-imag)</procedure> Apply Rotary Position Embedding to Q and K vectors. <procedure>(cache-kv state layer-idx pos config)</procedure> Store current key and value vectors in attention cache. <procedure>(compute-attention state layer-idx pos config)</procedure> Compute multi-head attention scores and apply to values. <procedure>(attention-output state weights layer-idx config)</procedure> Apply final attention output projection. <procedure>(ffn-rmsnorm state weights layer-idx config)</procedure> Apply RMS normalization for feed-forward network. <procedure>(compute-ffn-w1w3 state weights layer-idx config)</procedure> Compute first part of FFN: W1(x) and W3(x). <procedure>(apply-swiglu state config)</procedure> Apply SwiGLU activation: SiLU(W1(x)) * W3(x). <procedure>(ffn-output state weights layer-idx config)</procedure> Apply final FFN linear transformation. <procedure>(process-transformer-layer state weights layer-idx pos config freq-real freq-imag)</procedure> Process complete transformer layer (attention + FFN blocks). <procedure>(final-rmsnorm state weights)</procedure> Apply final RMS normalization before classification. <procedure>(compute-logits state weights config)</procedure> Compute final classification logits. ==== Utility Functions <procedure>(rmsnorm output input weights)</procedure> RMS normalization with learnable weights. <procedure>(matmul output input matrix rows cols)</procedure> Matrix-vector multiplication using BLAS. <procedure>(softmax output input size)</procedure> Softmax activation with numerical stability. <procedure>(accum target source)</procedure> Vector accumulation for residual connections. <procedure>(argmax vector)</procedure> Find index of maximum element (greedy sampling). <procedure>(sample probabilities random-state)</procedure> Probabilistic sampling from probability distribution. <procedure>(verify-checkpoint-data checkpoint-file [detailed])</procedure> Load and analyze checkpoint file, printing weight statistics. ==== Command-Line Interface The {{llama-cli}} command provides easy access to text generation: <enscript highlight="shell"> llama-cli [options] Options: -h, --help Show help message -c, --checkpoint FILE Model checkpoint file (required) -k, --tokenizer FILE Tokenizer file (default: tokenizer.bin) -t, --temperature NUM Sampling temperature (default: 0.0) -s, --steps NUM Number of tokens to generate (default: 256) -p, --prompt TEXT Input prompt text (default: empty) --seed NUM Random seed for sampling --verify-checkpoint Verify checkpoint integrity </enscript> === Examples ==== Basic Usage <enscript highlight="scheme"> (import llama) ;; Simple text generation (define args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 "Once upon a time" #f)) (run args) </enscript> ==== Interactive REPL Usage <enscript highlight="scheme"> (import llama) ;; Load model components (define config (make-config 288 768 6 6 6 32000 256 #t)) (define weights (load-checkpoint "stories15M.bin")) (define state (make-run-state ...)) ;; Generate single token (transformer 1 0 config state weights) (argmax (run-state-logits state)) ;; Custom sampling with temperature (define logits (run-state-logits state)) (do ((i 0 (+ i 1))) ((= i (f32vector-length logits))) (f32vector-set! logits i (/ (f32vector-ref logits i) 0.8))) (define probs (softmax (make-f32vector 32000) logits 32000)) (sample probs random-state) </enscript> ==== Batch Processing <enscript highlight="scheme"> ;; Process multiple prompts (define prompts '("Hello world" "The meaning of life" "Once upon a time")) (for-each (lambda (prompt) (printf "Prompt: ~A~%" prompt) (let ((args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 prompt #f))) (run args) (newline))) prompts) </enscript> ==== Component-Level Usage <enscript highlight="scheme"> ;; Fine-grained control over generation (define (custom-generation token config state weights) ;; Custom attention processing (attention-rmsnorm state weights 0 config) (compute-qkv state weights 0 config) ;; Skip some layers for faster inference (let-values (((freq-real freq-imag) (get-rope-frequencies weights 0 2))) (process-transformer-layer state weights 0 0 config freq-real freq-imag) (process-transformer-layer state weights 2 0 config freq-real freq-imag)) ;; Custom final processing (final-rmsnorm state weights) (compute-logits state weights config)) </enscript> === Configuration ==== Temperature Guidelines * '''0.0''': Deterministic (greedy sampling) * '''0.1-0.3''': Focused, coherent output * '''0.5-0.8''': Balanced creativity and coherence * '''0.9-1.2''': Creative, diverse output * '''1.5+''': Highly random, experimental === License MIT License === Author [[https://github.com/iraikov|Ivan Raikov]] === Repository [[https://github.com/iraikov/llama-chicken|https://github.com/iraikov/llama-chicken]] === Version History ; 1.0 : Initial release with complete LLAMA2 implementation === See Also * [[https://github.com/karpathy/llama2.c|llama2.c]] - Original C implementation * [[https://github.com/jackpeck/llama2.ml|llama2.ml]] - OCaml port * [[https://github.com/snunez1/llama.cl|llama.cl]] - Common Lisp port * [[/eggref/5/blas|blas]] - BLAS bindings for CHICKEN Scheme * [[https://arxiv.org/abs/2307.09288|LLAMA2 Paper]] - Original research paper
Description of your changes:
I would like to authenticate
Authentication
Username:
Password:
Spam control
What do you get when you subtract 0 from 13?