You are looking at historical revision 16530 of this page. It may differ significantly from its current revision.
lexgen
Description
lexgen is a lexer generator comprised in its core of only four small procedures. The programmer combines these procedures into regular expression pattern matchers.
A pattern matcher procedure takes a list of streams, and returns a new list of streams advanced by every combination allowed by the pattern matcher function. A stream is defined as a list that contains a list of characters consumed by the pattern matcher, and a list of characters not yet consumed.
Note that the number of streams returned by a pattern matcher typically won't match the number of streams passed in. If the pattern doesn't match at all, the empty list is returned.
Library Procedures
Every combinator procedure in this library returns a procedure that takes in a continuation procedure and a list of streams as arguments.
Basic procedures
[procedure] (tok TOKEN PROC) => MATCHERProcedure tok builds a pattern matcher function that, for each stream given, applies a procedure to the given token TOKEN and an input character. If the procedure returns a true value, that value is prepended to the list of consumed elements, and the input character is removed from the list of input elements.
[procedure] (seq MATCHER1 MATCHER2) => MATCHERseq builds a matcher that matches a sequence of patterns.
[procedure] (bar MATCHER1 MATCHER2) => MATCHERbar matches either of two patterns. It's analogous to patterns separated by | in traditional regular expressions.
[procedure] (star MATCHER) => MATCHERstar is an implementation of the Kleene closure. It is analogous to * in traditional regular expressions.
Convenience procedures
These procedures are built from the previous four and are provided for convenience.
[procedure] (try PROC) => PROCConverts a binary predicate procedure to a binary procedure that returns its right argument when the predicate is true, and false otherwise.
[procedure] (char CHAR) => MATCHERMatches a single character.
[procedure] (lst MATCHER-LIST) => MATCHERConstructs a matcher for the sequence of matchers in MATCHER-LIST.
[procedure] (pos MATCHER) => MATCHERPositive closure. Analogous to +.
[procedure] (opt MATCHER) => MATCHEROptional pattern. Analogous to ?.
[procedure] (set CHAR-SET) => MATCHERMatches any of a SRFI-14 set of characters.
[procedure] (range CHAR CHAR) => MATCHERMatches a range of characters. Analogous to character class [].
[procedure] (lit STRING) => MATCHERMatches a literal string s.
[procedure] (bind F P) => MATCHERGiven a rule P and function F, returns a matcher that first applies P to the input stream, then applies F to the returned list of consumed tokens, and returns the result and the remainder of the input stream.
[procedure] (rebind F G P) => MATCHERGiven a rule P and procedures F and G, returns a matcher that first applies F to the input stream, then applies P to the resulting stream, then applies G to the resulting list of consumed elements and returns the result along with the remainder of the input stream.
[procedure] (drop P) => MATCHERGiven a rule P, returns a matcher that always returns an empty list of consumed tokens when P succeeds.
Lexer procedures
[procedure] (longest STREAM-LIST) => STREAMTakes the resulting streams produced by the application of a pattern on a stream (or streams) and selects the longest match if one exists. If STREAM-LIST is empty, it returns #F.
[procedure] (lex MATCHER ERROR STRING) => CHAR-LISTlex takes a pattern and a string, turns the string into a list of streams (containing one stream), applies the pattern, and returns the longest match. Argument ERROR is a single-argument procedure called when the pattern does not match anything.
Streams
A stream is a list that can take one of two forms:
1) A list of two elements: the first element is a list of elements consumed by the pattern matcher; the second element is a list of elements not yet consumed. E.g., the list
((#\a) (#\b #\c #\d #\e))
represents a stream that contains the consumed character a, and the unconsumed characters b c d e.
2) A list of three elements: the first two elements are as before, but the third element is a procedure that is applied to the tail of the unconsumed list, in order to obtain the next character. E.g., the list:
((#\a) (#\b <port>) <procedure (lambda (in) (list (read-char in) in))>
represents a stream that contains the consumed character a, the unconsumed character b, and an input port to read subsequent characters from; and a procedure that reads one character from the input port, and returns it along with the modified port. Note that the use of side-effecting structures such as ports will lead to erroneous results with backtracking parsers.
[procedure] (stream-unfold INIT F G)A helper procedure to transform streams from one format to another. Procedure F must be a procedure of two arguments, a state and a list of unconsumed characters. Procedure G is applied to an unconsumed element, and is expected to return the original element representation, before F was applied to the unconsumed stream.
Examples
(use srfi-1 lexgen) ;; A pattern to match floating point numbers. ;; "-"?(([0-9]+(\\.[0-9]+)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)? (define numpat (let* ((digit (range #\0 #\9)) (digits (pos digit)) (fraction (seq (char #\.) digits)) (significand (bar (seq digits (opt fraction)) fraction)) (exp (seq (set "eE") (seq (opt (set "+-")) digits))) (sign (opt (char #\-)))) (seq sign (seq significand (opt exp))))) (define (err s) (print "lexical error on stream: " s) (list)) (lex numpat err "-123.45e-6") ;; Tokens with position information (define-record-type postok (make-postok pos token) postok? (pos postok-pos ) (token postok-token ) ) (define-record-printer (postok x out) (fprintf out "#<token ~A: ~A>" (postok-pos x) (postok-token x))) (define pos? pair?) (define pos-row car) (define pos-col cdr) (define make-pos cons) ;; Converts an input stream to a stream with position information: (define (stream-pos begtok) (and (postok? begtok) (stream-unfold begtok (lambda (begtok strm) (if (null? strm) '() (let* ((pos0 (postok-pos begtok)) (pos1 (let ((row0 (pos-row pos0)) (col0 (pos-col pos0))) (case (car strm) ((#\newline) (make-pos (+ 1 row0) 1)) ((#\return) (make-pos row0 1)) (else (make-pos row0 (+ 1 col0)))))) (res (cons (make-postok pos1 (car strm)) (cdr strm)))) res))) postok-token))) (define begpos (make-pos 1 0)) (define (getpos p) (let ((f (lambda (in) (and (pair? in) (pair? (cdr in)) (postok-pos (cadr in))))) (g (lambda (i s) (list (make-postok i (car s)))))) (rebind f g p))) (define pos-numpat-stream (list ((stream-pos (make-postok begpos 'start)) `(() ,(string->list "-123.45e-6"))))) (define pbnumpat (let* ((digit (range #\0 #\9)) (digits (star digit)) (fraction (seq (char #\.) digits)) (significand (bar (seq digits (opt fraction)) fraction)) (exp (seq (set "eE") (seq (opt (set "+-")) digits))) (sign (opt (char #\-)) ) (pat (seq (getpos (bind make-sign sign)) (seq (getpos (bind make-significand (longest significand))) (getpos (bind make-exp (longest (opt exp)))))))) pat)) (define (pos-num-parser s) (car (lex pbnumpat err s)))
Requires
Version History
- 3.3 Bug fixes in stream comparison
- 3.2 Improved input stream comparison procedures
- 3.1 Added rebind combinator and stream-unfold procedure
- 3.0 Added an extension mechanism for input streams of different types (to be elaborated and documented in subsequent versions).
- 2.6 Added bind and drop combinators
- 2.5 The seq combinator checks whether the first parser in the sequence has failed
- 2.4 Added (require-library srfi-1); using lset<= instead of equal? in star
- 2.3 Bug fix in procedure range; added procedure cps-table
- 2.2 Bug fix in procedure star
- 2.1 Added procedure lst
- 2.0 Core procedures rewritten in continuation-passing style
- 1.5 Using (require-extension srfi-1)
- 1.4 Ported to Chicken 4
- 1.2 Added procedures try and tok (supersedes pred)
- 1.0 Initial release
License
Based on the SML lexer generator by Thant Tessman.
Copyright 2009 Ivan Raikov and the Okinawa Institute of Science and Technology. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. A full copy of the GPL license can be found at <http://www.gnu.org/licenses/>.