comparse

Comparse is a library of PEG parser combinators similar to Haskell's Parsec. Its implementation is based on Drew Crampsie's smug for Common Lisp but slightly deviates in naming and provides a number of additional convenience combinators.

  1. comparse
    1. Caveat
    2. Introduction
    3. A note on variadic combinators
    4. API
    5. Examples
      1. RFC 3339 timestamps
    6. About this egg
      1. Source
      2. Author
      3. Version history
      4. License

Caveat

As of now the API documentation is still incomplete, notably missing the main entry point procedure parse.

Introduction

Comparse provides a library of parsers which can be applied to inputs in order to produce parse results. It also provides parser combinators to create new parsers. In Comparse inputs are represented by a parser-input type. It can hold a sequence of any kinds of items (often characters). For example, we can define an input sequence of characters like this:

(import comparse)

(define some-input
  (->parser-input "hello world"))

A parser then is a function which takes an input as its argument and, in case the parse succeeds, returns a parse result (which can be a value of any kind) and the remainder of the input (which is a parser-input again). In case the parse fails it returns #f.

A very basic parser provided by Comparse is the item parser which consumes one item from the given input and returns it as its parse result. If the input is empty, item fails. We can apply it to the input we defined above using the parse procedure:

(parse item some-input)

Which gives us these results:

#\h
#<parser-input #\e #\l #\l #\o #\space #\w #\o #\r #\l #\d ...>

There is another basic parser called fail which always fails, no matter what's passed to it. These primitive parsers are not very useful on their own, of course. In order to do more interesting things Comparse provides a selection of parser combinators. Just like regular function combinators (such as compose or complement), parser combinators are functions which take one or more parsers as their arguments and combine them into a new parser. One of the basic parser combinators in Comparse is bind. It takes a parser and a procedure as its arguments and returns a parser which first applies the passed parser and, if it succeeds, calls the procedure with its parse result. The procedure then must return another parser which will be applied to the remainder of the parser that was passed to bind. Okay, that sounds pretty dazzling, so an example is in order:

(define double-item
  (bind item (lambda (x)
               (lambda (remainder)
                 ;; Note that unlike the `parse' procedure, the
                 ;; internal parser API does not expect multiple
                 ;; return values but rather a pair of values on
                 ;; successful parse.
                 (cons (list x x) remainder)))))

(parse double-item some-input)

Which gives us:

(#\h #\h)
#<parser-input #\e #\l #\l #\o #\space #\w #\o #\r #\l #\d ...>

So the procedure we pass to bind returns a parser which constructs a two-item list from the parse result and returns it with the untouched remainder. As returning a result from a parser without touching the remainder is needed quite frequently, Comparse provides the result parser combinator which does just that. Using it we can write double-item like this:

(define double-item
  (bind item (lambda (x)
               (result (list x x)))))

Comparse provides many convenient and more high-level parser combinators than the ones presented in this introduction. See the Examples section on how to use them to create parsers for practical tasks. Also note that you would usually not manually convert raw inputs (like strings or input ports) into a parser-input object like we did above. Instead you can pass the raw input object directly to the parse procedure which will automatically turn it into a parser-input for you.

A note on variadic combinators

Some parser combinators accept one or more parsers as their arguments. As a convenience when just one argument is passed and it is a list instead of a parser, that list is treated as the argument list. In other words those combinators implicitly apply themselves to the given (single) list argument.

API

[procedure] (result value)

Returns a parser which always succeeds, returning value as the parse result and the untouched input as the remainder.

[procedure] (fail input)

A parser which always fails.

[procedure] (item input)

A parser which consumes and returns the first item of input, i.e. it returns that item as the parse result and the tail of input as the remainder. Fails if input is empty.

[procedure] (end-of-input input)

A parser which returns #t as the parse result and the untouched input as the remainder if input is empty.

[procedure] (bind parser proc)

Returns a parser which applies parser and calls proc with its parse result if it succeeds. proc must return a parser which will be applied to the remainder returned by parser.

[procedure] (satisfies condition . args)

Returns a parser which consumes the next item of its input and applies condition to it like this: (apply condition item args). If that call returns #f, the parser fails. Otherwise it succeeds and returns the item as its result.

[procedure] (is x)

Returns a parser which succeeds when the next item of its input is eqv? to x.

[procedure] (in collection . items)

Returns a parser which succeeds when the next item of the given input is contained in collection. Collection can be a SRFI 14 char-set or a list which is checked for membership with memq. Given more than one argument, membership of the next item is checked for with memq in (cons collection items).

[procedure] (sequence parser . parsers)

Returns a parser which applies the given parsers from left to right and returns the list of their results when all succeed. See also the note on variadic combinators.

[syntax] (sequence* ((binding parser) ...) body ...)

Similar to let* but the bindings are made to the results of a sequence of parsers. Body should use a result form to yield a parse result.

[procedure] (char-seq str)

Returns a parser which consumes the chars of the given string str in the order contained therein, i.e. it parses a literal string. On success the parser returns str.

[procedure] (char-seq-match rx)

Returns a parser which matches an irregex regular expression.

[procedure] (char-seq-split str #!optional allow-end-of-input?)

To-do: document this.

[procedure] (char-seq-until cs #!optional allow-end-of-input?)

To-do: document this.

[procedure] (any-of parser . parsers)

Returns a parser which applies the given parsers from left to right and returns the result of the first one that succeeds. See also the note on variadic combinators.

[procedure] (all-of parser . parsers)

Returns a parser which applies the given parsers from left to right and returns the result of the last one if all succeed. See also the note on variadic combinators.

[procedure] (none-of parser . parsers)

Returns a parser which applies the given parsers from left to right and returns #t as its result if none of them succeed while it doesn't consume any items from the input. See also the note on variadic combinators.

[procedure] (none-of* parser [parsers ...] but)

Returns a parser which applies the given parsers from left to right. If none of them succeed, it applies the parser but and returns its result.

[procedure] (preceded-by parser . parsers)

Returns a parser which applies the given parsers from left to right to the input or the remainder of the preceding parser, respectively, and returns the result of the last one. See also the note on variadic combinators.

[procedure] (followed-by parser following-parser . following-parsers)

Returns a parser which applies parser if the given following-parser[s] succeed on its remainder from left to right. Note that the following-parsers don't consume any input, i.e. they only perform a look-ahead.

See also the note on variadic combinators which applies to the following-parser[s] argument(s) in this case.

[procedure] (enclosed-by open content close)

Returns a parser which applies the given parsers open, content, and close in that order to the input or the remainder of the preceding parser, respectively, and returns the result of content with the remainder of close.

[procedure] (maybe parser #!optional default)

Returns a parser which applies parser to the input and returns its result and remainder if it succeeds. Otherwise it returns default or #f if not specified as the parse result and the untouched input as the remainder.

[procedure] (zero-or-more parser)

Returns a parser which applies parser to its input. If that fails, it returns () as the parse result and the untouched input as the remainder. Otherwise it applies parser again to the remainder of the previous application and so forth, collecting all parse result in a list. That list is then returned as the parse result and the remainder of the last successful application of parser as the remainder.

[procedure] (one-or-more parser)

Returns a parser which applies parser to its input. If that fails, it also fails. Otherwise it applies parser again to the remainder of the previous application and so forth, collecting all parse result in a list. That list is then returned as the parse result and the remainder of the last successful application of parser as the remainder.

[procedure] (repeated parser #!rest args #!key (min 0) max until)

To-do: document this.

[procedure] (skip parser . parsers)

To-do: document this.

[procedure] (as-string parser)

Returns a parser that turns the result of the given parser into a string.

[syntax] (recursive-parser body ...)

To-do: document this.

Examples

RFC 3339 timestamps

RFC 3339 specifies the Internet timestamp format. It looks like this:

 2015-04-19T19:50:36Z

Let's try to implement a parser for it. Thankfully, section 5.6 of the RFC contains an ABNF grammar which is a good thing to start from:

 date-fullyear   = 4DIGIT
 date-month      = 2DIGIT  ; 01-12
 date-mday       = 2DIGIT  ; 01-28, 01-29, 01-30, 01-31 based on
                           ; month/year
 time-hour       = 2DIGIT  ; 00-23
 time-minute     = 2DIGIT  ; 00-59
 time-second     = 2DIGIT  ; 00-58, 00-59, 00-60 based on leap second
                           ; rules
 time-secfrac    = "." 1*DIGIT
 time-numoffset  = ("+" / "-") time-hour ":" time-minute
 time-offset     = "Z" / time-numoffset
 
 partial-time    = time-hour ":" time-minute ":" time-second
                   [time-secfrac]
 full-date       = date-fullyear "-" date-month "-" date-mday
 full-time       = partial-time time-offset
 
 date-time       = full-date "T" full-time

We can start by transcribing the grammar into Comparse. To that end, we start by defining two utility parsers for parsing digits:

(import comparse srfi-14)

(define digit
  (in char-set:digit))

(define (digits n)
  (as-string (repeated digit n)))

Using these and Comparse's built-in combinators we can define our RFC 3339 parser along the ABNF grammar in a pretty straight forward fashion:

(define date-fullyear
  (digits 4))

(define date-month
  (digits 2))

(define date-mday
  (digits 2))

(define time-hour
  (digits 2))

(define time-minute
  (digits 2))

(define time-second
  (digits 2))

(define time-secfrac
  (preceded-by (is #\.) (as-string (one-or-more digit))))

(define time-numoffset
  (sequence (in #\+ #\-) time-hour (is #\:) time-minute))

(define time-offset
  (any-of (is #\Z) time-numoffset))

(define partial-time
  (sequence time-hour (is #\:)
            time-minute (is #\:)
            time-second
            (maybe time-secfrac)))

(define full-date
  (sequence date-fullyear (is #\-)
            date-month (is #\-)
            date-mday))

(define full-time
  (sequence partial-time time-offset))

(define date-time
  (sequence full-date (is #\T) full-time))

Note that each rule is a parser itself which we can test individually. Let's try some examples in the REPL:

 #;1> (parse date-fullyear "1891")
 "1891"
 #<parser-input-end>
 ; 2 values
 
 #;2> (parse full-date "1927-09-04")
 ("1927" #\- "09" #\- "04")
 #<parser-input-end>
 ; 2 values
 
 #;3> (parse time-offset "+00:30")
 (#\+ "00" #\: "30")
 #<parser-input-end>
 ; 2 values

And finally, a full timestamp:

 #;4> (parse date-time "2015-04-19T19:50:36Z")
 (("2015" #\- "04" #\- "19") #\T (("19" #\: "50" #\: "36" #f) #\Z))
 #<parser-input-end>
 ; 2 values

That's pretty good already. It leaves a few things to be desired, though. For one, it would be nicer to have the date and time components to be returned as numbers rather than strings. One way to achieve this would be to walk the parse result and turn all strings into numbers in a separate post-processing step. A nicer approach would be to have the digits parser yield numbers directly. We could define an as-number combinator that works like the built-in as-string combinator:

(define (as-number parser)
  (bind (as-string parser)
        (lambda (s)
          (result (string->number s)))))

The resulting parser tries to apply the given parser, turns its result into a string if it succeeded and then passes that string to our anonymous procedure which turns it into a number via string->number, returning it as the parse result via the result combinator. Let's give it a whirl:

 #;5> (parse (as-number (one-or-more digit)) "123")
 123
 #<parser-input-end>
 ; 2 values

Looks good! We could even get extra fancy and apply some more higher-order functional programming voodoo by using the o procedure from the data-structures unit:

(define (as-number parser)
  (bind (as-string parser)
        (o result string->number)))

Now doesn't that look neat? OK, now all we have to do is swap all as-string calls with as-number calls:

(define (digits n)
  (as-number (repeated digit n)))

(define time-secfrac
  (preceded-by (is #\.) (as-number (one-or-more digit))))

Then we re-evaluate the dependent definitions. Et voilà:

 #;6> (parse date-time "2015-04-19T19:50:36Z")
 ((2015 #\- 4 #\- 19) #\T ((19 #\: 50 #\: 36 #f) #\Z))
 #<parser-input-end>
 ; 2 values

Much better! Now wouldn't it be even nicer to have a slightly less messy result? How about something like this:

(rfc3999-timestamp (2015 4 19) (19 50 36 #f) Z)

Again, we could post-process the parse result to massage it into the desired shape. But let's try to do it right in the parser, too! This is a task for which the sequence* syntax is really handy. First, let's redefine full-date so that it only returns the date components as a list:

(define full-date
  (sequence* ((year date-fullyear)
              (_ (is #\-))
              (month date-month)
              (_ (is #\-))
              (day date-mday))
    (result (list year month day))))

This will apply each parser on the right-hand side to the input in sequence and bind their results to the identifier on the left-hand side. We use _ for parse results we don't care about. Let's give it a try:

 #;7> (parse full-date "2015-04-19")
 (2015 4 19)
 #<parser-input-end>
 ; 2 values

Very well! Now we do the same for full-time, time-numoffset and time-offset:

(define full-time
  (sequence* ((hour time-hour)
              (_ (is #\:))
              (minute time-minute)
              (_ (is #\:))
              (second time-second)
              (secfrac (maybe time-secfrac)))
    (result (list hour minute second secfrac))))

(define time-numoffset
  (sequence* ((signum (in #\+ #\-))
              (hour time-hour)
              (_ (is #\:))
              (minute time-minute))
    (result (list (if (eq? #\+ signum) '+ '-)
                  hour
                  minute))))

(define time-offset
  (any-of (bind (is #\Z) (constantly (result 'Z)))
          time-numoffset))

And finally we can put together the new date-time:

(define date-time
  (sequence* ((date full-date)
              (_ (is #\T))
              (time full-time)
              (offset time-offset))
    (result (list 'rfc3339-timestamp
                  date
                  time
                  offset))))

Let's see if it actually works:

 #;8> (parse date-time "2015-04-19T19:50:36Z")
 (rfc3339-timestamp (2015 4 19) (19 50 36 #f) Z)
 #<parser-input-end>
 ; 2 values

Rejoice! But there's another issue. Take a look a the ABNF grammar again and note the comments next to some of the rules which further constrain the various timestamp components. Currently, our parser also accepts bogus timestamps such as 2015-38-76T91:68:82Z. It would certainly be nice if we could enforce those constraints in the parser, too—and it looks like the RFC authors would have liked to be able to do that, as well, given that they put the comments right next to the grammar rules.

Again, there are various ways to approach this. As always, we could just validate the timestamp components in a post-processing procedure but that would of course spatially distance the constraints from the grammar rules they refer to.

One way to do it in the parser itself would be to encode the valid digit sequences in the grammar, e.g. time-hour could be changed like this to satisfy the constraints:

(define time-hour
  (as-string
   (any-of (sequence (is #\0) digit)
           (sequence (is #\1) digit)
           (sequence (is #\2) (in #\1 #\2 #\3 #\4)))))

As you can imagine, this gets boring quickly, and in the case of date-mday this approach wouldn't work anyway since the range of valid values depends on the given year and month.

A more convenient approach is to put the constraint logic directly into a bind or sequence* body. Applied to the time-hour parser we get something like this:

(define time-hour
  (bind (digits 2)
        (lambda (hour)
          (if (<= 0 hour 23)
              (result hour)
              fail))))

So in case hour is within the given range, we return a result parser with the valid hour as the value, otherwise we return the fail parser which—as the name suggests—always fails. Since we need this kind of range check for various parsers, we could create a helper procedure constrained like this:

(define (constrained parser pred)
  (bind parser
        (lambda (val)
          (if (pred val)
              (result val)
              fail))))

And now we can use it to define our various static constraints:

(define date-month
  (constrained (digits 2)
               (lambda (month)
                 (<= 1 month 12))))

(define time-hour
  (constrained (digits 2)
               (lambda (hour)
                 (<= 0 hour 23))))

(define time-minute
  (constrained (digits 2)
               (lambda (minute)
                 (<= 0 minute 59))))

Now what about constraints for a parser like date-mday which depends on the result of other parsers? Easy: We just turn it into a procedure which dynamically constructs the parser based on the year and month passed to it (see RFC 3339 section 5.7 for details):

(define (leap-year? year)
  (and (zero? (modulo year 4))
       (or (not (zero? (modulo year 100)))
           (zero? (modulo year 400)))))

(define (date-mday year month)
  (constrained (digits 2)
               (lambda (day)
                 (<= 1 day
                     (case month
                       ((1 3 5 7 8 10 12) 31)
                       ((4 6 9 11) 30)
                       ((2) (if (leap-year? year)
                                29
                                28)))))))

And now we can parse days of the Gregorian calendar with proper constraint checks:

 #;9> (parse (date-mday 2012 2) "29")
 29
 #<parser-input-end>
 ; 2 values
 
 #;10> (parse (date-mday 2013 2) "29")
 #f
 #<parser-input char-seq-cursor "29" ...>
 ; 2 values
 
 #;11> (parse (date-mday 2013 8) "30")
 30
 #<parser-input-end>
 ; 2 values

Integrating it into the full-date parser is trivial, we just need to pass the year and month we parsed and bound before parsing the day:

(define full-date
  (sequence* ((year date-fullyear)
              (_ (is #\-))
              (month date-month)
              (_ (is #\-))
              (day (date-mday year month)))
    (result (list year month day))))

There we have it, a fairly complete and compliant RFC 3339 timestamp parser! The astute reader may have noticed that we didn't implement constraint checks for the time-second parser, yet. This is left as an exercise for the reader—don't forget about leap seconds!

On another note, it usually makes sense to provide a version of the top-most parser that's anchored at the input's end for stand-alone use. Consider the following example:

 #;12> (parse date-time "2015-04-21T16:39:57Z some trailing garbage")
 (rfc3339-timestamp (2015 4 21) (16 39 57 #f) Z)
 #<parser-input char-seq-cursor " some trailing garbage" ...>
 ; 2 values

As you can see, the date-time parser happily accepts this input which probably isn't what you want in this case. This is easily remedied by anchoring it at the end of input like this:

 #;13> (parse (followed-by date-time end-of-input) "2015-04-21T16:39:57Z some trailing garbage")
 #f
 #<parser-input char-seq-cursor "2015-04-21T16:39:57Z some trailing garbage" ...>
 ; 2 values
 
 #;14> (parse (followed-by date-time end-of-input) "2015-04-21T16:39:57Z")
 (rfc3339-timestamp (2015 4 21) (16 39 57 #f) Z)
 #<parser-input-end>
 ; 2 values

About this egg

Source

The source code is hosted at Bitbucket. Feel free to fork it and send pull requests there.

Author

Moritz Heidkamp

Version history

2
First version to be compatible with CHICKEN 5.

For previous versions, see documentation of the CHICKEN 4 version of this egg.

License

 Copyright (c) 2013-2018, Moritz Heidkamp
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are
 met:
 
 Redistributions of source code must retain the above copyright
 notice, this list of conditions and the following disclaimer.
 
 Redistributions in binary form must reproduce the above copyright
 notice, this list of conditions and the following disclaimer in the
 documentation and/or other materials provided with the distribution.
 
 Neither the name of the author nor the names of its contributors may
 be used to endorse or promote products derived from this software
 without specific prior written permission.
 
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
 COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
 INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
 STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
 OF THE POSSIBILITY OF SUCH DAMAGE.