Editing page: llama - The CHICKEN Scheme wiki

You can edit this page using wiki syntax for markup.

Article contents:

[[tags: ai machine-learning nlp llm transformer blas]]

== llama

A high-performance LLAMA2 inference implementation in CHICKEN Scheme,
based on Andrej Karpathy's
[[https://github.com/karpathy/llama2.c|llama2.c]] and its OCaml port
[[https://github.com/jackpeck/llama2.ml|llama2.ml]].

=== Description

This egg provides a complete implementation of the LLAMA2 transformer
architecture for text generation. It features modular components and
uses BLAS integration for high performance.

=== Requirements

==== System Dependencies

* BLAS library (OpenBLAS, Intel MKL, or system BLAS)

==== CHICKEN Extensions

* [[/eggref/5/srfi-1|srfi-1]] - List library
* [[/eggref/5/srfi-4|srfi-4]] - Numeric vectors  
* [[/eggref/5/srfi-42|srfi-42]] - Comprehensions
* [[/eggref/5/srfi-69|srfi-69]] - Hash tables
* [[/eggref/5/vector-lib|vector-lib]] - Vector utilities
* [[/eggref/5/blas|blas]] - BLAS bindings
* [[/eggref/5/endian-blob|endian-blob]] - Endian-aware blob operations
* [[/eggref/5/endian-port|endian-port]] - Endian-aware port operations
* [[/eggref/5/random-mtzig|random-mtzig]] - Random number generation
* [[/eggref/5/getopt-long|getopt-long]] - Command-line option parsing
* [[/eggref/5/test|test]] - Unit testing

=== Installation

<enscript highlight="shell">
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install chicken-bin libchicken-dev libopenblas-dev

# Install CHICKEN extensions
chicken-install llama

# Download example model (15M parameters, ~60MB) and tokenizer.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://github.com/iraikov/llama-chicken/raw/refs/heads/main/tokenizer.bin
</enscript>

=== Quick Start

<enscript highlight="shell">
# Basic text generation
llama-cli -c stories15M.bin -p "Once upon a time"

# Creative generation with temperature
llama-cli -c stories15M.bin -t 0.8 -s 100 -p "The meaning of life is"

# Verify model integrity
llama-cli -c stories15M.bin --verify-checkpoint
</enscript>

=== API

==== Data Types

===== config

<record>config</record>

Model configuration parameters.

<procedure>(make-config dim hidden-dim n-layers n-heads n-kv-heads vocab-size seq-len shared-weights)</procedure>

Creates a new configuration object.

; dim : Model embedding dimension
; hidden-dim : FFN hidden layer dimension
; n-layers : Number of transformer layers  
; n-heads : Number of attention heads
; n-kv-heads : Number of key-value heads
; vocab-size : Vocabulary size
; seq-len : Maximum sequence length
; shared-weights : Whether to share input/output embeddings

<procedure>(config-dim config)</procedure>
<procedure>(config-hidden-dim config)</procedure>
<procedure>(config-n-layers config)</procedure>
<procedure>(config-n-heads config)</procedure>
<procedure>(config-n-kv-heads config)</procedure>
<procedure>(config-vocab-size config)</procedure>
<procedure>(config-seq-len config)</procedure>
<procedure>(config-shared-weights config)</procedure>

Accessors for configuration fields.

===== transformer-weights

<record>transformer-weights</record>

Container for all model parameters including embeddings, attention weights, FFN weights, and RoPE frequencies.

<procedure>(make-transformer-weights token-embedding-table rms-att-weight wq wk wv wo rms-ffn-weight w1 w2 w3 rms-final-weight freq-cis-real freq-cis-imag wcls)</procedure>

Creates a new transformer weights object with all parameter matrices.

===== run-state

<record>run-state</record>

Runtime state for transformer computation including hidden states, attention caches, and output logits.

<procedure>(make-run-state x xb q k v att key-cache value-cache xb2 hb hb2 logits)</procedure>

Creates a new runtime state object.

<procedure>(run-state-x state)</procedure>
<procedure>(run-state-logits state)</procedure>
<procedure>(run-state-key-cache state)</procedure>
<procedure>(run-state-value-cache state)</procedure>

Accessors for runtime state fields.

===== args

Runtime configuration for text generation runs.

<procedure>(make-args checkpoint tokenizer temperature steps prompt seed)</procedure>

Creates text generation arguments.

; checkpoint : Path to model checkpoint file
; tokenizer : Path to tokenizer file  
; temperature : Sampling temperature (0.0 = greedy)
; steps : Number of tokens to generate
; prompt : Input text prompt
; seed : Random seed (optional)

==== High-Level Functions

Main inference function. Takes an {{args}} object and performs text generation.

<enscript highlight="scheme">
(define args (make-args "model.bin" "tokenizer.bin" 0.8 100 "Hello world" #f))
(run args)
</enscript>

<procedure>(transformer token pos config state weights)</procedure>

Run transformer forward pass for a single token.

; token : Token ID to process
; pos : Position in sequence  
; config : Model configuration
; state : Runtime state (modified in place)
; weights : Model parameters

Returns the updated state.

<procedure>(bpe-encode text vocab vocab-scores)</procedure>

Tokenize text using Byte-Pair Encoding.

; text : Input text string
; vocab : List of vocabulary strings
; vocab-scores : List of BPE merge scores

Returns list of token IDs

==== Transformer Components

The modular architecture provides fine-grained control:

<procedure>(token-embedding-lookup state weights token)</procedure>

Load token embedding into state.

<procedure>(get-rope-frequencies weights pos head-size)</procedure>

Extract RoPE frequency rows for given position. Returns two values: real and imaginary frequency vectors.

<procedure>(attention-rmsnorm state weights layer-idx config)</procedure>

Apply RMS normalization for attention layer.

<procedure>(compute-qkv state weights layer-idx config)</procedure>

Compute Query, Key, Value matrices for given layer.

<procedure>(apply-rope state config freq-real freq-imag)</procedure>

Apply Rotary Position Embedding to Q and K vectors.

<procedure>(cache-kv state layer-idx pos config)</procedure>

Store current key and value vectors in attention cache.

<procedure>(compute-attention state layer-idx pos config)</procedure>

Compute multi-head attention scores and apply to values.

<procedure>(attention-output state weights layer-idx config)</procedure>

Apply final attention output projection.

<procedure>(ffn-rmsnorm state weights layer-idx config)</procedure>

Apply RMS normalization for feed-forward network.

<procedure>(compute-ffn-w1w3 state weights layer-idx config)</procedure>

Compute first part of FFN: W1(x) and W3(x).

<procedure>(apply-swiglu state config)</procedure>

Apply SwiGLU activation: SiLU(W1(x)) * W3(x).

<procedure>(ffn-output state weights layer-idx config)</procedure>

Apply final FFN linear transformation.

<procedure>(process-transformer-layer state weights layer-idx pos config freq-real freq-imag)</procedure>

Process complete transformer layer (attention + FFN blocks).

<procedure>(final-rmsnorm state weights)</procedure>

Apply final RMS normalization before classification.

<procedure>(compute-logits state weights config)</procedure>

Compute final classification logits.

==== Utility Functions

<procedure>(rmsnorm output input weights)</procedure>

RMS normalization with learnable weights.

<procedure>(matmul output input matrix rows cols)</procedure>

Matrix-vector multiplication using BLAS.

<procedure>(softmax output input size)</procedure>

Softmax activation with numerical stability.

<procedure>(accum target source)</procedure>

Vector accumulation for residual connections.

<procedure>(argmax vector)</procedure>

Find index of maximum element (greedy sampling).

<procedure>(sample probabilities random-state)</procedure>

Probabilistic sampling from probability distribution.

<procedure>(verify-checkpoint-data checkpoint-file [detailed])</procedure>

Load and analyze checkpoint file, printing weight statistics.

==== Command-Line Interface

The {{llama-cli}} command provides easy access to text generation:

<enscript highlight="shell">
llama-cli [options]

Options:
  -h, --help            Show help message
  -c, --checkpoint FILE Model checkpoint file (required)  
  -k, --tokenizer FILE  Tokenizer file (default: tokenizer.bin)
  -t, --temperature NUM Sampling temperature (default: 0.0)
  -s, --steps NUM       Number of tokens to generate (default: 256)
  -p, --prompt TEXT     Input prompt text (default: empty)
  --seed NUM            Random seed for sampling
  --verify-checkpoint   Verify checkpoint integrity
</enscript>

=== Examples

==== Basic Usage

<enscript highlight="scheme">
(import llama)

;; Simple text generation
(define args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 "Once upon a time" #f))
(run args)
</enscript>

==== Interactive REPL Usage

<enscript highlight="scheme">
(import llama)

;; Load model components
(define config (make-config 288 768 6 6 6 32000 256 #t))
(define weights (load-checkpoint "stories15M.bin"))
(define state (make-run-state ...))

;; Generate single token
(transformer 1 0 config state weights)
(argmax (run-state-logits state))

;; Custom sampling with temperature
(define logits (run-state-logits state))
(do ((i 0 (+ i 1)))
    ((= i (f32vector-length logits)))
  (f32vector-set! logits i (/ (f32vector-ref logits i) 0.8)))

(define probs (softmax (make-f32vector 32000) logits 32000))
(sample probs random-state)
</enscript>

==== Batch Processing

<enscript highlight="scheme">
;; Process multiple prompts
(define prompts '("Hello world" "The meaning of life" "Once upon a time"))

(for-each (lambda (prompt)
            (printf "Prompt: ~A~%" prompt)
            (let ((args (make-args "stories15M.bin" "tokenizer.bin" 0.5 50 prompt #f)))
              (run args)
              (newline)))
          prompts)
</enscript>

==== Component-Level Usage

<enscript highlight="scheme">
;; Fine-grained control over generation
(define (custom-generation token config state weights)
  ;; Custom attention processing
  (attention-rmsnorm state weights 0 config)
  (compute-qkv state weights 0 config)
  
  ;; Skip some layers for faster inference  
  (let-values (((freq-real freq-imag) (get-rope-frequencies weights 0 2)))
    (process-transformer-layer state weights 0 0 config freq-real freq-imag)
    (process-transformer-layer state weights 2 0 config freq-real freq-imag))
  
  ;; Custom final processing
  (final-rmsnorm state weights)
  (compute-logits state weights config))
</enscript>

=== Configuration

==== Temperature Guidelines

* '''0.0''': Deterministic (greedy sampling)
* '''0.1-0.3''': Focused, coherent output  
* '''0.5-0.8''': Balanced creativity and coherence
* '''0.9-1.2''': Creative, diverse output
* '''1.5+''': Highly random, experimental

=== License

MIT License

=== Author

[[https://github.com/iraikov|Ivan Raikov]]

=== Repository

[[https://github.com/iraikov/llama-chicken|https://github.com/iraikov/llama-chicken]]

=== Version History

; 1.0 : Initial release with complete LLAMA2 implementation

=== See Also

* [[https://github.com/karpathy/llama2.c|llama2.c]] - Original C implementation  
* [[https://github.com/jackpeck/llama2.ml|llama2.ml]] - OCaml port
* [[https://github.com/snunez1/llama.cl|llama.cl]] - Common Lisp port
* [[/eggref/5/blas|blas]] - BLAS bindings for CHICKEN Scheme
* [[https://arxiv.org/abs/2307.09288|LLAMA2 Paper]] - Original research paper

Description of your changes:

I would like to authenticate

Authentication

Username:Password:

Spam control

What do you get when you subtract 0 from 13?