html-parser

  1. html-parser
    1. Description
    2. Author
    3. Documentation
      1. Main interface
        1. make-html-parser
      2. Convenience functions
        1. html->sxml
        2. html-strip
    4. Examples
    5. Changelog
    6. License

Description

A permissive, scalable HTML parser.

Author

Alex Shinn

Documentation

html-parser is intended as a permissive HTML parser for people who prefer the scalable interface described in Oleg Kiselyov's SSAX parser, as well as providing simple convenience utilities. It correctly handles all invalid HTML, inserting "virtual" starting and closing tags as needed to maintain the proper tree structure needed for the foldts down/up logic. A major goal of this parser is bug-for-bug compatibility with the way common web browsers parse HTML.

Main interface

make-html-parser
[procedure] (make-html-parser . keys)

Returns a procedure of two arguments, an initial seed and an optional input port, which parses the HTML document from the port with the callbacks specified in the plist KEYS (using normal, quoted symbols, for portability and to avoid making this a macro). The following callbacks are recognized:

 START: TAG ATTRS SEED VIRTUAL?
     fdown in foldts, called when a start-tag is encountered.
   TAG:         tag name
   ATTRS:       tag attributes as a alist
   SEED:        current seed value
   VIRTUAL?:    #t iff this start tag was inserted to fix the HTML tree
 END: TAG ATTRS PARENT-SEED SEED VIRTUAL?
     fup in foldts, called when an end-tag is encountered.
   TAG:         tag name
   ATTRS:       tag attributes of the corresponding start tag
   PARENT-SEED: parent seed value (i.e. seed passed to the start tag)
   SEED:        current seed value
   VIRTUAL?:    #t iff this end tag was inserted to fix the HTML tree
 TEXT: TEXT SEED
     fhere in foldts, called when any text is encountered.  May be
     called multiple times between a start and end tag, so you need
     to string-append yourself if desired.
   TEXT:        entity-decoded text
   SEED:        current seed value
 COMMENT: TEXT SEED
     fhere on comment data
 DECL: NAME ATTRS SEED
     fhere on declaration data
     
 PROCESS: LIST SEED
     fhere on process-instruction data

In addition, entity-mappings may be overriden with the ENTITIES: keyword.

Convenience functions

html->sxml
[procedure] (html->sxml [port])

Returns the SXML representation of the document from PORT, using the default parsing options.

html-strip
[procedure] (html-strip [port])

Returns a string representation of the document from PORT with all tags removed. No whitespace reduction or other rendering is done.

Examples

This is the definition of the html->sxml convenience function included in the egg:

 (define html->sxml
   (let ((parse
          (make-html-parser
           'start: (lambda (tag attrs seed virtual?) '())
           'end:   (lambda (tag attrs parent-seed seed virtual?)
                     `((,tag ,@(if (pair? attrs)
                                   `((@ ,@attrs) ,@(reverse seed))
                                   (reverse seed)))
                       ,@parent-seed))
           'decl:    (lambda (tag attrs seed) `((*DECL* ,tag ,@attrs) ,@seed))
           'process: (lambda (attrs seed) `((*PI* ,@attrs) ,@seed))
           'comment: (lambda (text seed) `((*COMMENT* ,text) ,@seed))
           'text:    (lambda (text seed) (cons text seed))
           )))
     (lambda o
       (reverse (apply parse '() o)))))

The parser for html-strip could be defined as:

 (make-html-parser
   'start: (lambda (tag attrs seed virtual?) seed)
   'end:   (lambda (tag attrs parent-seed seed virtual?) seed)
   'text:  (lambda (text seed) (display text)))

Changelog

License

BSD-style license: http://synthcode.com/license.txt.