Wiki
Download
Manual
Eggs
API
Tests
Bugs
show
edit
history
You can edit this page using
wiki syntax
for markup.
Article contents:
[[tags: egg]] == html-parser [[toc:]] === Description A permissive, scalable HTML parser. === Author [[/users/alex-shinn|Alex Shinn]] === Repository This egg is hosted on the CHICKEN Subversion repository: [[https://anonymous@code.call-cc.org/svn/chicken-eggs/release/6/html-parser|https://anonymous@code.call-cc.org/svn/chicken-eggs/release/6/html-parser]] If you want to check out the source code repository of this egg and you are not familiar with Subversion, see [[/egg-svn-checkout|this page]]. === Documentation {{html-parser}} is intended as a permissive HTML parser for people who prefer the scalable interface described in Oleg Kiselyov's SSAX parser, as well as providing simple convenience utilities. It correctly handles all invalid HTML, inserting "virtual" starting and closing tags as needed to maintain the proper tree structure needed for the foldts down/up logic. A major goal of this parser is bug-for-bug compatibility with the way common web browsers parse HTML. ==== Main interface ===== make-html-parser <procedure>(make-html-parser . keys)</procedure> Returns a procedure of two arguments, an initial seed and an optional input port, which parses the HTML document from the port with the callbacks specified in the plist {{KEYS}} (using normal, quoted symbols, for portability and to avoid making this a macro). The following callbacks are recognized: START: TAG ATTRS SEED VIRTUAL? fdown in foldts, called when a start-tag is encountered. TAG: tag name ATTRS: tag attributes as a alist SEED: current seed value VIRTUAL?: #t iff this start tag was inserted to fix the HTML tree END: TAG ATTRS PARENT-SEED SEED VIRTUAL? fup in foldts, called when an end-tag is encountered. TAG: tag name ATTRS: tag attributes of the corresponding start tag PARENT-SEED: parent seed value (i.e. seed passed to the start tag) SEED: current seed value VIRTUAL?: #t iff this end tag was inserted to fix the HTML tree TEXT: TEXT SEED fhere in foldts, called when any text is encountered. May be called multiple times between a start and end tag, so you need to string-append yourself if desired. TEXT: entity-decoded text SEED: current seed value COMMENT: TEXT SEED fhere on comment data DECL: NAME ATTRS SEED fhere on declaration data PROCESS: LIST SEED fhere on process-instruction data In addition, entity-mappings may be overriden with the {{ENTITIES:}} keyword. ==== Convenience functions ===== html->sxml <procedure>(html->sxml [port])</procedure> Returns the SXML representation of the document from {{PORT}}, using the default parsing options. ===== html-strip <procedure>(html-strip [port])</procedure> Returns a string representation of the document from PORT with all tags removed. No whitespace reduction or other rendering is done. ==== Misc ===== make-string-reader/ci <procedure>(make-string-reader/ci str)</procedure> Generates a KMP reader that works on ports, returning the text read up until the search string (or the entire port if the search string isn't found). This is O(n) in the length of the string returned, as opposed to the {{find-string-from-port?}} in SSAX which uses backtracking for an O(nm) algorithm. This is hard-coded to case-insensitively match, since that's what we need for HTML. A more general utility would abstract the character matching predicate and possibly provide a limit on the length of the string read. ===== html-display-escaped-string <procedure>(html-display-escaped-string str out)</procedure> Writes a HTML escaped string to the output port, replacing the characters {{<>&"'}} with the appropriate HTML entities. ===== html-escape <procedure>(html-escape str)</procedure> Returns a HTML escaped string. Equivalent to using {{html-display-escaped-string}} with a string output port. ===== html-attr->string <procedure>(html-attr->string attr)</procedure> Format an attribute pair as string. Both {{(name . "value")}} and {{(name "value")}} are supported and are turned into {{"name=\"value\""}}. ===== html-tag->string <procedure>(html-tag->string tag attrs)</procedure> Format tag and attribute list as string. The tag must be a symbol, the attribute list is processed using {{html-attr->string}}. For example {{(html-tag->string 'a '((href "#")))}} is turned into {{"<a href=\"#\">"}}. ===== sxml-display-as-html <procedure>(sxml-display-as-html sxml [port])</procedure> Write the HTML representation of {{sxml}}, with the optional port argument defaulting to {{(current-output-port)}}. Processing instructions, declarations, comments, top nodes and regular nodes are handled. ===== sxml->html <procedure>(sxml->html sxml)</procedure> Convert the HTML representation of {{sxml}} to a string. Equivalent to using {{sxml-display-as-html}} with a string output port. === Examples This is the definition of the {{html->sxml}} convenience function included in the egg: <enscript highlight=scheme> (define html->sxml (let ((parse (make-html-parser 'start: (lambda (tag attrs seed virtual?) '()) 'end: (lambda (tag attrs parent-seed seed virtual?) `((,tag ,@(if (pair? attrs) `((@ ,@attrs) ,@(reverse seed)) (reverse seed))) ,@parent-seed)) 'decl: (lambda (tag attrs seed) `((*DECL* ,tag ,@attrs) ,@seed)) 'process: (lambda (attrs seed) `((*PI* ,@attrs) ,@seed)) 'comment: (lambda (text seed) `((*COMMENT* ,text) ,@seed)) 'text: (lambda (text seed) (cons text seed)) ))) (lambda o (reverse (apply parse '() o))))) </enscript> The parser for {{html-strip}} could be defined as: <enscript highlight=scheme> (make-html-parser 'start: (lambda (tag attrs seed virtual?) seed) 'end: (lambda (tag attrs parent-seed seed virtual?) seed) 'text: (lambda (text seed) (display text))) </enscript> === Changelog * 0.4.1 Remove UTF-8 hacks which are no longer needed * 0.4 ported to CHICKEN 6 (by felix) * 0.3 Add {{meta}} and {{link}} tag handling * 0.2 ported to CHICKEN 5 (by felix) * 0.1 Import upstream as of 2009-01-25 === License BSD-style license: [[http://synthcode.com/license.txt]].
Description of your changes:
I would like to authenticate
Authentication
Username:
Password:
Spam control
What do you get when you add 23 to 6?