ssax
Oleg Kiselyov's XML parser.
Documentation
See the official SSAX homepage for comprehensive documentation.
Requirements
Requires the following extensions:
ssax:xml->sxml
[procedure] (ssax:xml->sxml PORT NAMESPACE-PREFIX-ASSIG)This procedure reads XML data from PORT and returns an SXML representation. NAMESPACE-PREFIX-ASSIG is an alist (USER-PREFIX . URI-STRING) that maps user prefixes (symbols) to namespaces (URI strings), which can be an empty list.
ssax:make-parser
[syntax] (ssax:make-parser TAG1 VAL1 [TAG2 VAL2 ...])Create a custom XML parser instance of the XML parsing framework. This will be a SAX, a DOM or a specialized parser depending on the supplied user-handlers.
The arguments to ssax::make-parser are symbol-value pairs, interleaved in the argument list. In other words, TAG1, TAG2 etc are unquoted(!) symbols that identify the type of value that follows the tag. See below for the list of allowed tags.
The output of this macro is a procedure that represents a parser which accepts two arguments, PORT and SEED. PORT is the port from which to read the XML data and SEED is the initial value of an accumulator that will be passed into the first procedure, where it can be appended to and returned. Then this value will be passed on to the next procedure and so on to eventually obtain a result, in a FOLD-like fashion.
Given below are tags and signatures of the corresponding values. Not all tags have to be specified. If some are omitted, reasonable defaults will apply. SEED always represents the current value of the accumulator that will eventually be returned by the parser.
DOCTYPE
procedure PORT DOCNAME SYSTEMID INTERNAL-SUBSET? SEED
If INTERNAL-SUBSET? is #t, the current position in the port is right after we have read #\[ that begins the internal DTD subset. We must finish reading of this subset before we return (or must call ssax:skip-internal-dtd if we aren't interested in reading it).
The port at exit must be at the first symbol after the whole DOCTYPE declaration. The handler-procedure must generate four values:
ELEMS ENTITIES NAMESPACES SEED
See xml-decl::elems for ELEMS. It may be #f to switch off the validation. NAMESPACES will typically contain user prefixes for selected URI symbols. The default handler-procedure skips the internal subset, if any, and returns (values #f '() '() SEED).
UNDECL-ROOT
procedure ELEM-GI SEED
where ELEM-GI is an UNRES-NAME of the root element. This procedure is called when an XML document under parsing contains no DOCTYPE declaration. The handler-procedure, as a DOCTYPE handler procedure above, must generate four values:
ELEMS ENTITIES NAMESPACES SEED
The default handler-procedure returns (values #f '() '() seed)
NEW-LEVEL-SEED
procedure ELEM-GI ATTRIBUTES NAMESPACES EXPECTED-CONTENT SEED
where ELEM-GI is a RES-NAME of the element about to be processed. This procedure is to generate the seed to be passed to handlers that process the content of the element.
FINISH-ELEMENT
procedure ELEM-GI ATTRIBUTES NAMESPACES PARENT-SEED SEED
This procedure is called when parsing of ELEM-GI is finished. The SEED is the result from the last content parser (or from new-level-seed if the element has the empty content). PARENT-SEED is the same seed as was passed to new-level-seed. The procedure is to generate a seed that will be the result of the element parser.
CHAR-DATA-HANDLER
procedure STRING1 STRING2 SEED
The procedure is supposed to handle a chunk of character data STRING1 followed by a chunk of character data STRING2. STRING2 is a short string, often "\n" and even "" Returns a new SEED.
PI
association-list ((PI-TAG . PI-HANDLER) ...)
where PI-TAG is the name of the processing instruction and PI-HANDLER is a procedure PORT PI-TAG SEED.
The handler should read the rest of the PI from PORT, up to and including the combination "?>" that terminates the PI. The handler should return a new seed.
One of the PI-TAGs may be the symbol *DEFAULT*. The corresponding handler will handle PIs that no other handler will. If the *DEFAULT* PI-TAG is not specified, ssax:make-pi-parser will assume the default handler that skips the body of the PI.
ssax:warn
[procedure] (ssax:warn port message . other-messages))Prints message to port. Notice that port is just ignored by the current implementation ((current-error-port) is actually used. It's probably there just for API compatibility reasons.
xml-token-kind
[syntax] (xml-token-kind TOKEN)Returns the TAG-KIND of the supplied TOKEN.
- TAG-KIND A symbol 'START, 'END, 'PI, 'DECL, 'COMMENT, 'CDSECT or 'ENTITY-REF that identifies a markup token.
xml-token-head
[syntax] (xml-token-head TOKEN) =>Returns the UNRES-NAME of the supplied TOKEN. For xml-tokens of kinds 'COMMENT and 'CDSECT, the head is #f.
- UNRES-NAME A name (called GI in the XML Recommendation) as given in an xml document for a markup token: start-tag, PI target, attribute name. If a GI is an NCName, UNRES-NAME is this NCName converted into a Scheme symbol. If a GI is a QName, UNRES-NAME is a pair of symbols: (PREFIX . LOCALPART).
html-entity-unicode-chars
[constant] html-entity-unicode-charsAn association list mapping named entities to Unicode characters (actually UTF-8 byte strings). Intended to be used in an ENTITIES context.
Data Types
- TAG-KIND
A symbol 'START, 'END, 'PI, 'DECL, 'COMMENT, 'CDSECT or 'ENTITY-REF that identifies a markup token
- UNRES-NAME
A name (called GI in the XML Recommendation) as given in an xml document for a markup token: start-tag, PI target, attribute name. If a GI is an NCName, UNRES-NAME is this NCName converted into a Scheme symbol. If a GI is a QName, UNRES-NAME is a pair of symbols: (PREFIX . LOCALPART)
- RES-NAME
An expanded name, a resolved version of an UNRES-NAME. For an element or an attribute name with a non-empty namespace URI, RES-NAME is a pair of symbols, (URI-SYMB . LOCALPART). Otherwise, it's a single symbol.
- ELEM-CONTENT-MODEL
A symbol:
ANY - anything goes, expect an END tag. EMPTY-TAG - no content, and no END-tag is coming EMPTY - no content, expect the END-tag as the next token PCDATA - expect character data only, and no children elements MIXED ELEM-CONTENT
- URI-SYMB
A symbol representing a namespace URI -- or other symbol chosen by the user to represent URI. In the former case, URI-SYMB is created by %-quoting of bad URI characters and converting the resulting string into a symbol.
- NAMESPACES
A list representing namespaces in effect. An element of the list has one of the following forms:
(PREFIX URI-SYMB . URI-SYMB) or (PREFIX USER-PREFIX . URI-SYMB) where USER-PREFIX is a symbol chosen by the user to represent the URI.
(#f USER-PREFIX . URI-SYMB) Specification of the user-chosen prefix and a URI-SYMBOL.
(*DEFAULT* USER-PREFIX . URI-SYMB) Declaration of the default namespace
(*DEFAULT* #f . #f) Un-declaration of the default namespace. This notation represents overriding of the previous declaration
A NAMESPACES list may contain several elements for the same PREFIX. The one closest to the beginning of the list takes effect.
- ATTLIST
An ordered collection of (NAME . VALUE) pairs, where NAME is a RES-NAME or an UNRES-NAME. The collection is an ADT.
- STR-HANDLER
A procedure of three arguments, STRING1 STRING2 SEED, returning a new SEED. The procedure is supposed to handle a chunk of character data STRING1 followed by a chunk of character data STRING2. STRING2 is a short string, often "\n" and even ""
- ENTITIES
An assoc list of pairs:
(named-entity-name . named-entity-body)
where named-entity-name is a symbol under which the entity was declared, named-entity-body is either a string, or (for an external entity) a thunk that will return an input port (from which the entity can be read). named-entity-body may also be #f. This is an indication that a named-entity-name is currently being expanded. A reference to this named-entity-name will be an error: violation of the WFC nonrecursion.
Unicode Compatibility
ssax:xml->sxml will convert numeric entities to UTF-8 byte sequences. html-entity-unicode-chars can be used in an ENTITIES context to map named entities to UTF-8 byte sequences.
Example
; Pretty-print the structure of an XML document, disregarding the ; character data. ; This example corresponds to outline.c of the Expat distribution. ; The example demonstrates how to transform an XML document on the ; fly, as we parse it. ; ; $Id: outline.scm,v 1.2 2002/12/10 22:28:14 oleg Exp $ (define (outline xml-port) ; The seed describes the depth of an element relative to the root of the tree ; To be more precise, the seed is the string of space characters ; to output to indent the current element. The indent increases by two ; space characters for the next nested element. ((ssax:make-parser NEW-LEVEL-SEED (lambda (elem-gi attributes namespaces expected-content seed) (display seed) ; indent the element name (display elem-gi) ; print the name of the element (newline) (string-append " " seed)) ; advance the indent level FINISH-ELEMENT (lambda (elem-gi attributes namespaces parent-seed seed) parent-seed) ; restore the indent level CHAR-DATA-HANDLER (lambda (string1 string2 seed) seed) ) xml-port ""))
Author
Oleg Kiselyov, with some Chicken-specific modifications by Kirill Lisovsky. Minor changes by felix winkelmann to make the code suitable as an extension library.
Repository
This egg is hosted on the CHICKEN Subversion repository:
https://anonymous@code.call-cc.org/svn/chicken-eggs/release/6/ssax
If you want to check out the source code repository of this egg and you are not familiar with Subversion, see this page.
Changelog
- 5.1.2 Fix incomplete unhacking of UTF8 support
- 5.1.1 Port/unhack UTF8 support for CHICKEN 6
- 5.1.0 Port to CHICKEN 5
- 5.0.7 Raise more detailed exceptions (thanks to Christian Kellermann)
- 5.0.6 Fix exports for make-*parser macros [Peter Bex, with thanks to Taylor Venable and Stephen Ramsay].
- 5.0.5 Add html-entity-unicode-chars [Jim Ursetto]
- 5.0.0 Port to Chicken a fresh import of the clean upstream CVS tree (which now has downcased names)
- 4.9.8 Convert numeric entities > 255 to UTF-8 [Jim Ursetto]
- 4.9.7 Using ##sys#read/peek-char instead of read/peek-char [Daishi Kato]
- 4.9.6 parser-error now raises a condition [Daishi Kato]
- 4.9.5 Fixed bug in error-reporting function
- 4.9.4 Replaced (apply string-append ...) calls with string-concatenate
- 4.9.3 Adapted to new setup scheme. Fixed a reentrancy-bug [Thanks to Bruce Butterfield]
- 4.9.2 SSAX:warn adds newline [Thanks to Sunnan]
- 4.9.1 Fixed exports for case-sensitivity.
License
Public Domain