srfi-14

Character set library. An abbreviated version of the SRFI is provided in this document. Full documentation is available in the original SRFI-14 document.

On systems that support dynamic loading, the srfi-14 unit can be made available in the interpreter (csi) by entering

(require-extension srfi-14)

This library provides only the Latin-1 character set. To get Unicode semantics, see the utf8 egg. However, information on Unicode character sets is still provided in this document.

Specification

In the following procedure specifications:

A CS parameter is a character set.
An S parameter is a string.
A CHAR parameter is a character.
A CHAR-LIST parameter is a list of characters.
A PRED parameter is a unary character predicate procedure, returning a true/false value when applied to a character.
An OBJ parameter may be any value at all.

Passing values to procedures with these parameters that do not satisfy these types is an error.

Unless otherwise noted in the specification of a procedure, procedures always return character sets that are distinct (from the point of view of the linear-update operations) from the parameter character sets. For example, char-set-adjoin is guaranteed to provide a fresh character set, even if it is not given any character parameters.

Parameters given in square brackets are optional. Unless otherwise noted in the text describing the procedure, any prefix of these optional parameters may be supplied, from zero arguments to the full list. When a procedure returns multiple values, this is shown by listing the return values in square brackets, as well. So, for example, the procedure with signature

 halts? F [X INIT-STORE] -> [BOOLEAN INTEGER]

would take one (F), two (F, X) or three (F, X, INIT-STORE) input parameters, and return two values, a boolean and an integer.

A parameter followed by "..." means zero-or-more elements. So the procedure with the signature

 sum-squares X ...  -> NUMBER

takes zero or more arguments (X ...), while the procedure with signature

 spell-check DOC DICT_1 DICT_2 ... -> STRING-LIST

takes two required parameters (DOC and DICT_1) and zero or more optional parameters (DICT_2 ...).

General procedures

[procedure] (char-set? obj) -> boolean

Is the object OBJ a character set?

[procedure] (char-set= cs_1 ...) -> boolean

Are the character sets equal?

Boundary cases:

 (char-set=) => TRUE
 (char-set= cs) => TRUE

Rationale: transitive binary relations are generally extended to n-ary relations in Scheme, which enables clearer, more concise code to be written. While the zero-argument and one-argument cases will almost certainly not arise in first-order uses of such relations, they may well arise in higher-order cases or macro-generated code. E.g., consider

 (apply char-set= cset-list)

This is well-defined if the list is empty or a singleton list. Hence we extend these relations to any number of arguments. Implementors have reported actual uses of n-ary relations in higher-order cases allowing for fewer than two arguments. The way of Scheme is to handle the general case; we provide the fully general extension.

A counter-argument to this extension is that R5RS's transitive binary arithmetic relations (=, <, etc.) require at least two arguments, hence this decision is a break with the prior convention -- although it is at least one that is backwards-compatible.

[procedure] (char-set<= cs_1 ...) -> boolean

Returns true if every character set CS_I is a subset of character set CS_I+1.

Boundary cases:

 (char-set<=) => TRUE
 (char-set<= cs) => TRUE

Rationale: See char-set= for discussion of zero- and one-argument applications. Consider testing a list of char-sets for monotonicity with

(apply char-set<= cset-list)

[procedure] (char-set-hash cs [bound]) -> integer

Compute a hash value for the character set CS. BOUND is a non-negative exact integer specifying the range of the hash function. A positive value restricts the return value to the range [0,BOUND).

If BOUND is either zero or not given, the implementation may use an implementation-specific default value, chosen to be as large as is efficiently practical. For instance, the default range might be chosen for a given implementation to map all strings into the range of integers that can be represented with a single machine word.

Invariant:

 (char-set= cs_1 cs_2) => (= (char-set-hash cs_1 b) (char-set-hash cs_2 b))

A legal but nonetheless discouraged implementation:

(define (char-set-hash cs . maybe-bound) 1)

Rationale: allowing the user to specify an explicit bound simplifies user code by removing the mod operation that typically accompanies every hash computation, and also may allow the implementation of the hash function to exploit a reduced range to efficiently compute the hash value. E.g., for small bounds, the hash function may be computed in a fashion such that intermediate values never overflow into bignum integers, allowing the implementor to provide a fixnum-specific "fast path" for computing the common cases very rapidly.

Iterating over character sets

[procedure] (char-set-cursor cset) -> cursor
[procedure] (char-set-ref cset cursor) -> char
[procedure] (char-set-cursor-next cset cursor) -> cursor
[procedure] (end-of-char-set? cursor) -> boolean

Cursors are a low-level facility for iterating over the characters in a set. A cursor is a value that indexes a character in a char set. char-set-cursor produces a new cursor for a given char set. The set element indexed by the cursor is fetched with char-set-ref. A cursor index is incremented with char-set-cursor-next; in this way, code can step through every character in a char set. Stepping a cursor "past the end" of a char set produces a cursor that answers true to end-of-char-set?. It is an error to pass such a cursor to char-set-ref or to char-set-cursor-next.

A cursor value may not be used in conjunction with a different character set; if it is passed to char-set-ref or char-set-cursor-next with a character set other than the one used to create it, the results and effects are undefined.

Cursor values are not necessarily distinct from other types. They may be integers, linked lists, records, procedures or other values. This license is granted to allow cursors to be very "lightweight" values suitable for tight iteration, even in fairly simple implementations.

Note that these primitives are necessary to export an iteration facility for char sets to loop macros.

Example:

(define cs (char-set #\G #\a #\T #\e #\c #\h))
 
;; Collect elts of CS into a list.
(let lp ((cur (char-set-cursor cs)) (ans '()))
  (if (end-of-char-set? cur) ans
      (lp (char-set-cursor-next cs cur)
          (cons (char-set-ref cs cur) ans))))
  => (#\G #\T #\a #\c #\e #\h)
 
;; Equivalently, using a list unfold (from SRFI 1):
(unfold-right end-of-char-set? 
              (curry char-set-ref cs)
              (curry char-set-cursor-next cs)
              (char-set-cursor cs))
  => (#\G #\T #\a #\c #\e #\h)

Rationale: Note that the cursor API's four functions "fit" the functional protocol used by the unfolders provided by the list, string and char-set SRFIs (see the example above). By way of contrast, here is a simpler, two-function API that was rejected for failing this criterion. Besides char-set-cursor, it provided a single function that mapped a cursor and a character set to two values, the indexed character and the next cursor. If the cursor had exhausted the character set, then this function returned false instead of the character value, and another end-of-char-set cursor. In this way, the other three functions of the current API were combined together.

[procedure] (char-set-fold kons knil cs) -> object

This is the fundamental iterator for character sets. Applies the function KONS across the character set CS using initial state value KNIL. That is, if CS is the empty set, the procedure returns KNIL. Otherwise, some element C of CS is chosen; let CS' be the remaining, unchosen characters. The procedure returns

 (char-set-fold KONS (KONS C KNIL) CS')

Examples:

;; CHAR-SET-MEMBERS
(lambda (cs) (char-set-fold cons '() cs))
 
;; CHAR-SET-SIZE
(lambda (cs) (char-set-fold (lambda (c i) (+ i 1)) 0 cs))
 
;; How many vowels in the char set?
(lambda (cs) 
  (char-set-fold (lambda (c i) (if (vowel? c) (+ i 1) i))
                 0 cs))

[procedure] (char-set-unfold f p g seed [base-cs]) -> char-set
[procedure] (char-set-unfold! f p g seed base-cs) -> char-set

This is a fundamental constructor for char-sets.

G is used to generate a series of "seed" values from the initial seed: SEED, (G SEED), (G^2 SEED), (G^3 SEED), ...
P tells us when to stop -- when it returns true when applied to one of these seed values.
F maps each seed value to a character. These characters are added to the base character set BASE-CS to form the result; BASE-CS defaults to the empty set. char-set-unfold! adds the characters to BASE-CS in a linear-update -- it is allowed, but not required, to side-effect and use BASE-CS's storage to construct the result.

More precisely, the following definitions hold, ignoring the optional-argument issues:

(define (char-set-unfold p f g seed base-cs) 
  (char-set-unfold! p f g seed (char-set-copy base-cs)))
  
(define (char-set-unfold! p f g seed base-cs)
  (let lp ((seed seed) (cs base-cs))
        (if (p seed) cs                                 ; P says we are done.
            (lp (g seed)                                ; Loop on (G SEED).
                (char-set-adjoin! cs (f seed))))))      ; Add (F SEED) to set.

(Note that the actual implementation may be more efficient.)

Examples:

 (port->char-set p) = (char-set-unfold eof-object? values
                                       (lambda (x) (read-char p))
                                       (read-char p))
 
 (list->char-set lis) = (char-set-unfold null? car cdr lis)

[procedure] (char-set-for-each proc cs) -> unspecified

Apply procedure PROC to each character in the character set CS. Note that the order in which PROC is applied to the characters in the set is not specified, and may even change from one procedure application to another.

Nothing at all is specified about the value returned by this procedure; it is not even required to be consistent from call to call. It is simply required to be a value (or values) that may be passed to a command continuation, e.g. as the value of an expression appearing as a non-terminal subform of a begin expression. Note that in R5RS, this restricts the procedure to returning a single value; non-R5RS systems may not even provide this restriction.

[procedure] (char-set-map proc cs) -> char-set

PROC is a char->char procedure. Apply it to all the characters in the char-set CS, and collect the results into a new character set.

Essentially lifts PROC from a char->char procedure to a char-set -> char-set procedure.

Example:

 (char-set-map char-downcase cset)

Creating character sets

[procedure] (char-set-copy cs) -> char-set

Returns a copy of the character set CS. "Copy" means that if either the input parameter or the result value of this procedure is passed to one of the linear-update procedures described below, the other character set is guaranteed not to be altered.

A system that provides pure-functional implementations of the linear-operator suite could implement this procedure as the identity function -- so copies are not guaranteed to be distinct by eq?.

[procedure] (char-set char_1 ...) -> char-set

Return a character set containing the given characters.

[procedure] (list->char-set char-list [base-cs]) -> char-set
[procedure] (list->char-set! char-list base-cs) -> char-set

Return a character set containing the characters in the list of characters CHAR-LIST.

If character set BASE-CS is provided, the characters from CHAR-LIST are added to it. list->char-set! is allowed, but not required, to side-effect and reuse the storage in BASE-CS; list->char-set produces a fresh character set.

[procedure] (string->char-set s [base-cs]) -> char-set
[procedure] (string->char-set! s base-cs) -> char-set

Return a character set containing the characters in the string S.

If character set BASE-CS is provided, the characters from S are added to it. string->char-set! is allowed, but not required, to side-effect and reuse the storage in BASE-CS; string->char-set produces a fresh character set.

[procedure] (char-set-filter pred cs [base-cs]) -> char-set
[procedure] (char-set-filter! pred cs base-cs) -> char-set

Returns a character set containing every character C in CS such that (PRED C) returns true.

If character set BASE-CS is provided, the characters specified by PRED are added to it. char-set-filter! is allowed, but not required, to side-effect and reuse the storage in BASE-CS; char-set-filter produces a fresh character set.

An implementation may not save away a reference to PRED and invoke it after char-set-filter or char-set-filter! returns -- that is, "lazy," on-demand implementations are not allowed, as PRED may have external dependencies on mutable data or have other side-effects.

Rationale: This procedure provides a means of converting a character predicate into its equivalent character set; the CS parameter allows the programmer to bound the predicate's domain. Programmers should be aware that filtering a character set such as char-set:full could be a very expensive operation in an implementation that provided an extremely large character type, such as 32-bit Unicode. An earlier draft of this library provided a simple predicate->char-set procedure, which was rejected in favor of char-set-filter for this reason.

[procedure] (ucs-range->char-set lower upper [error? base-cs]) -> char-set
[procedure] (ucs-range->char-set! lower upper error? base-cs) -> char-set

LOWER and UPPER are exact non-negative integers; LOWER <= UPPER.

Returns a character set containing every character whose ISO/IEC 10646 UCS-4 code lies in the half-open range [LOWER,UPPER).

If the requested range includes unassigned UCS values, these are silently ignored (the current UCS specification has "holes" in the space of assigned codes).
If the requested range includes "private" or "user space" codes, these are handled in an implementation-specific manner; however, a UCS- or Unicode-based Scheme implementation should pass them through transparently.
If any code from the requested range specifies a valid, assigned UCS character that has no corresponding representative in the implementation's character type, then (1) an error is raised if ERROR? is true, and (2) the code is ignored if ERROR? is false (the default). This might happen, for example, if the implementation uses ASCII characters, and the requested range includes non-ASCII characters.

If character set BASE-CS is provided, the characters specified by the range are added to it. ucs-range->char-set! is allowed, but not required, to side-effect and reuse the storage in BASE-CS; ucs-range->char-set produces a fresh character set.

Note that ASCII codes are a subset of the Latin-1 codes, which are in turn a subset of the 16-bit Unicode codes, which are themselves a subset of the 32-bit UCS-4 codes. We commit to a specific encoding in this routine, regardless of the underlying representation of characters, so that client code using this library will be portable. I.e., a conformant Scheme implementation may use EBCDIC or SHIFT-JIS to encode characters; it must simply map the UCS characters from the given range into the native representation when possible, and report errors when not possible.

[procedure] (->char-set x) -> char-set

Coerces X into a char-set. X may be a string, character or char-set. A string is converted to the set of its constituent characters; a character is converted to a singleton set; a char-set is returned as-is. This procedure is intended for use by other procedures that want to provide "user-friendly," wide-spectrum interfaces to their clients.

Querying character sets

[procedure] (char-set-size cs) -> integer

Returns the number of elements in character set CS.

[procedure] (char-set-count pred cs) -> integer

Apply PRED to the chars of character set CS, and return the number of chars that caused the predicate to return true.

[procedure] (char-set->list cs) -> character-list

This procedure returns a list of the members of character set CS. The order in which CS's characters appear in the list is not defined, and may be different from one call to another.

[procedure] (char-set->string cs) -> string

This procedure returns a string containing the members of character set CS. The order in which CS's characters appear in the string is not defined, and may be different from one call to another.

[procedure] (char-set-contains? cs char) -> boolean

This procedure tests CHAR for membership in character set CS.

The MIT Scheme character-set package called this procedure CHAR-SET-MEMBER?, but the argument order isn't consistent with the name.

[procedure] (char-set-every pred cs) -> boolean
[procedure] (char-set-any pred cs) -> boolean

The char-set-every procedure returns true if predicate PRED returns true of every character in the character set CS. Likewise, char-set-any applies PRED to every character in character set CS, and returns the first true value it finds. If no character produces a true value, it returns false. The order in which these procedures sequence through the elements of CS is not specified.

Note that if you need to determine the actual character on which a predicate returns true, use char-set-any and arrange for the predicate to return the character parameter as its true value, e.g.

(char-set-any (lambda (c) (and (char-upper-case? c) c)) 
              cs)

Character-set algebra

[procedure] (char-set-adjoin cs char_1 ...) -> char-set
[procedure] (char-set-delete cs char_1 ...) -> char-set

Add/delete the CHAR_I characters to/from character set CS.

[procedure] (char-set-adjoin! cs char_1 ...) -> char-set
[procedure] (char-set-delete! cs char_1 ...) -> char-set

Linear-update variants. These procedures are allowed, but not required, to side-effect their first parameter.

[procedure] (char-set-complement cs) -> char-set
[procedure] (char-set-union cs_1 ...) -> char-set
[procedure] (char-set-intersection cs_1 ...) -> char-set
[procedure] (char-set-difference cs_1 cs_2 ...) -> char-set
[procedure] (char-set-xor cs_1 ...) -> char-set
[procedure] (char-set-diff+intersection cs_1 cs_2 ...) -> [char-set char-set]

These procedures implement set complement, union, intersection, difference, and exclusive-or for character sets. The union, intersection and xor operations are n-ary. The difference function is also n-ary, associates to the left (that is, it computes the difference between its first argument and the union of all the other arguments), and requires at least one argument.

Boundary cases:

 (char-set-union) => char-set:empty
 (char-set-intersection) => char-set:full
 (char-set-xor) => char-set:empty
 (char-set-difference CS) => CS

char-set-diff+intersection returns both the difference and the intersection of the arguments -- it partitions its first parameter. It is equivalent to

 (values (char-set-difference CS_1 CS_2 ...)
         (char-set-intersection CS_1 (char-set-union CS_2 ...)))

but can be implemented more efficiently.

Programmers should be aware that char-set-complement could potentially be a very expensive operation in Scheme implementations that provide a very large character type, such as 32-bit Unicode. If this is a possibility, sets can be complimented with respect to a smaller universe using char-set-difference.

[procedure] (char-set-complement! cs) -> char-set
[procedure] (char-set-union! cs_1 cs_2 ...) -> char-set
[procedure] (char-set-intersection! cs_1 cs_2 ...) -> char-set
[procedure] (char-set-difference! cs_1 cs_2 ...) -> char-set
[procedure] (char-set-xor! cs_1 cs_2 ...) -> char-set
[procedure] (char-set-diff+intersection! cs_1 cs_2 cs_3 ...) -> [char-set char-set]

These are linear-update variants of the set-algebra functions. They are allowed, but not required, to side-effect their first (required) parameter.

char-set-diff+intersection! is allowed to side-effect both of its two required parameters, CS_1 and CS_2.

Standard character sets

Several character sets are predefined for convenience:

`char-set:lower-case`	Lower-case letters
`char-set:upper-case`	Upper-case letters
`char-set:title-case`	Title-case letters
`char-set:letter`	Letters
`char-set:digit`	Digits
`char-set:letter+digit`	Letters and digits
`char-set:graphic`	Printing characters except spaces
`char-set:printing`	Printing characters including spaces
`char-set:whitespace`	Whitespace characters
`char-set:iso-control`	The ISO control characters
`char-set:punctuation`	Punctuation characters
`char-set:symbol`	Symbol characters
`char-set:hex-digit`	A hexadecimal digit: 0-9, A-F, a-f
`char-set:blank`	Blank characters -- horizontal whitespace
`char-set:ascii`	All characters in the ASCII set.
`char-set:empty`	Empty set
`char-set:full`	All characters

In Unicode Scheme implementations, the base character sets are compatible with Java's Unicode specifications. For ASCII or Latin-1, we simply restrict the Unicode set specifications to their first 128 or 256 codes, respectively.

Here are the definitions for some of the sets in an ASCII implementation:

`char-set:lower-case`	a-z
`char-set:upper-case`	A-Z
`char-set:letter`	A-Z and a-z
`char-set:digit`	0123456789
`char-set:punctuation`	`!"#%&'()*,-./:;?@[\]_{`}
`char-set:symbol`	$+<=>^`\|~
`char-set:whitespace`	Space, newline, tab, form feed, vertical tab, carriage return
`char-set:blank`	Space and tab
`char-set:graphic`	letter + digit + punctuation + symbol
`char-set:printing`	graphic + whitespace
`char-set:iso-control`	ASCII 0-31 and 127

Character set constants

[constant] char-set:lower-case

For Unicode, a character is lowercase if

it is not in the range [U+2000,U+2FFF], and
the Unicode attribute table does not give a lowercase mapping for it, and
at least one of the following is true:
- the Unicode attribute table gives a mapping to uppercase for the character, or
- the name for the character in the Unicode attribute table contains the words "SMALL LETTER" or "SMALL LIGATURE".

The lower-case ASCII characters are

abcdefghijklmnopqrstuvwxyz

Latin-1 adds another 33 lower-case characters to the ASCII set:

00B5	MICRO SIGN
00DF	LATIN SMALL LETTER SHARP S
00E0	LATIN SMALL LETTER A WITH GRAVE
00E1	LATIN SMALL LETTER A WITH ACUTE
00E2	LATIN SMALL LETTER A WITH CIRCUMFLEX
00E3	LATIN SMALL LETTER A WITH TILDE
00E4	LATIN SMALL LETTER A WITH DIAERESIS
00E5	LATIN SMALL LETTER A WITH RING ABOVE
00E6	LATIN SMALL LETTER AE
00E7	LATIN SMALL LETTER C WITH CEDILLA
00E8	LATIN SMALL LETTER E WITH GRAVE
00E9	LATIN SMALL LETTER E WITH ACUTE
00EA	LATIN SMALL LETTER E WITH CIRCUMFLEX
00EB	LATIN SMALL LETTER E WITH DIAERESIS
00EC	LATIN SMALL LETTER I WITH GRAVE
00ED	LATIN SMALL LETTER I WITH ACUTE
00EE	LATIN SMALL LETTER I WITH CIRCUMFLEX
00EF	LATIN SMALL LETTER I WITH DIAERESIS
00F0	LATIN SMALL LETTER ETH
00F1	LATIN SMALL LETTER N WITH TILDE
00F2	LATIN SMALL LETTER O WITH GRAVE
00F3	LATIN SMALL LETTER O WITH ACUTE
00F4	LATIN SMALL LETTER O WITH CIRCUMFLEX
00F5	LATIN SMALL LETTER O WITH TILDE
00F6	LATIN SMALL LETTER O WITH DIAERESIS
00F8	LATIN SMALL LETTER O WITH STROKE
00F9	LATIN SMALL LETTER U WITH GRAVE
00FA	LATIN SMALL LETTER U WITH ACUTE
00FB	LATIN SMALL LETTER U WITH CIRCUMFLEX
00FC	LATIN SMALL LETTER U WITH DIAERESIS
00FD	LATIN SMALL LETTER Y WITH ACUTE
00FE	LATIN SMALL LETTER THORN
00FF	LATIN SMALL LETTER Y WITH DIAERESIS

Note that three of these have no corresponding Latin-1 upper-case character:

00B5	MICRO SIGN
00DF	LATIN SMALL LETTER SHARP S
00FF	LATIN SMALL LETTER Y WITH DIAERESIS

(The compatibility micro character uppercases to the non-Latin-1 Greek capital mu; the German sharp s character uppercases to the pair of characters "SS," and the capital y-with-diaeresis is non-Latin-1.)

[constant] char-set:upper-case

For Unicode, a character is uppercase if

it is not in the range [U+2000,U+2FFF], and
the Unicode attribute table does not give an uppercase mapping for it (this excludes titlecase characters), and
at least one of the following is true:
- the Unicode attribute table gives a mapping to lowercase for the character, or
- the name for the character in the Unicode attribute table contains the words "CAPITAL LETTER" or "CAPITAL LIGATURE".

The upper-case ASCII characters are

ABCDEFGHIJKLMNOPQRSTUVWXYZ

Latin-1 adds another 30 upper-case characters to the ASCII set:

00C0	LATIN CAPITAL LETTER A WITH GRAVE
00C1	LATIN CAPITAL LETTER A WITH ACUTE
00C2	LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3	LATIN CAPITAL LETTER A WITH TILDE
00C4	LATIN CAPITAL LETTER A WITH DIAERESIS
00C5	LATIN CAPITAL LETTER A WITH RING ABOVE
00C6	LATIN CAPITAL LETTER AE
00C7	LATIN CAPITAL LETTER C WITH CEDILLA
00C8	LATIN CAPITAL LETTER E WITH GRAVE
00C9	LATIN CAPITAL LETTER E WITH ACUTE
00CA	LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00CB	LATIN CAPITAL LETTER E WITH DIAERESIS
00CC	LATIN CAPITAL LETTER I WITH GRAVE
00CD	LATIN CAPITAL LETTER I WITH ACUTE
00CE	LATIN CAPITAL LETTER I WITH CIRCUMFLEX
00CF	LATIN CAPITAL LETTER I WITH DIAERESIS
00D0	LATIN CAPITAL LETTER ETH
00D1	LATIN CAPITAL LETTER N WITH TILDE
00D2	LATIN CAPITAL LETTER O WITH GRAVE
00D3	LATIN CAPITAL LETTER O WITH ACUTE
00D4	LATIN CAPITAL LETTER O WITH CIRCUMFLEX
00D5	LATIN CAPITAL LETTER O WITH TILDE
00D6	LATIN CAPITAL LETTER O WITH DIAERESIS
00D8	LATIN CAPITAL LETTER O WITH STROKE
00D9	LATIN CAPITAL LETTER U WITH GRAVE
00DA	LATIN CAPITAL LETTER U WITH ACUTE
00DB	LATIN CAPITAL LETTER U WITH CIRCUMFLEX
00DC	LATIN CAPITAL LETTER U WITH DIAERESIS
00DD	LATIN CAPITAL LETTER Y WITH ACUTE
00DE	LATIN CAPITAL LETTER THORN

[constant] char-set:title-case

In Unicode, a character is titlecase if it has the category Lt in the character attribute database. There are very few of these characters; here is the entire 31-character list as of Unicode 3.0:

01C5	LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
01C8	LATIN CAPITAL LETTER L WITH SMALL LETTER J
01CB	LATIN CAPITAL LETTER N WITH SMALL LETTER J
01F2	LATIN CAPITAL LETTER D WITH SMALL LETTER Z
1F88	GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
1F89	GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
1F8A	GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
1F8B	GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
1F8C	GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
1F8D	GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
1F8E	GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
1F8F	GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
1F98	GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
1F99	GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
1F9A	GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
1F9B	GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
1F9C	GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
1F9D	GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
1F9E	GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
1F9F	GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
1FA8	GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
1FA9	GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
1FAA	GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
1FAB	GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
1FAC	GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
1FAD	GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
1FAE	GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
1FAF	GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
1FBC	GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
1FCC	GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
1FFC	GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI

There are no ASCII or Latin-1 titlecase characters.

[constant] char-set:letter

In Unicode, a letter is any character with one of the letter categories (Lu, Ll, Lt, Lm, Lo) in the Unicode character database.

There are 52 ASCII letters

abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

There are 117 Latin-1 letters. These are the 115 characters that are members of the Latin-1 char-set:lower-case and char-set:upper-case sets, plus

00AA	FEMININE ORDINAL INDICATOR
00BA	MASCULINE ORDINAL INDICATOR

(These two letters are considered lower-case by Unicode, but not by SRFI 14.)

[constant] char-set:digit

In Unicode, a character is a digit if it has the category Nd in the character attribute database. In Latin-1 and ASCII, the only such characters are 0123456789. In Unicode, there are other digit characters in other code blocks, such as Gujarati digits and Tibetan digits.

[constant] char-set:hex-digit

The only hex digits are 0123456789abcdefABCDEF.

[constant] char-set:letter+digit

The union of char-set:letter and char-set:digit.

[constant] char-set:graphic

A graphic character is one that would put ink on paper. The ASCII and Latin-1 graphic characters are the members of

char-set:letter

char-set:digit

char-set:punctuation

char-set:symbol

[constant] char-set:printing

A printing character is one that would occupy space when printed, i.e., a graphic character or a space character. char-set:printing is the union of char-set:whitespace and char-set:graphic.

[constant] char-set:whitespace

In Unicode, a whitespace character is either

a character with one of the space, line, or paragraph separator categories (Zs, Zl or Zp) of the Unicode character database.
U+0009 Horizontal tabulation (\t control-I)
U+000A Line feed (\n control-J)
U+000B Vertical tabulation (\v control-K)
U+000C Form feed (\f control-L)
U+000D Carriage return (\r control-M)

There are 24 whitespace characters in Unicode 3.0:

0009	HORIZONTAL TABULATION	\t control-I
000A	LINE FEED	\n control-J
000B	VERTICAL TABULATION	\v control-K
000C	FORM FEED	\f control-L
000D	CARRIAGE RETURN	\r control-M
0020	SPACE	Zs
00A0	NO-BREAK SPACE	Zs
1680	OGHAM SPACE MARK	Zs
2000	EN QUAD	Zs
2001	EM QUAD	Zs
2002	EN SPACE	Zs
2003	EM SPACE	Zs
2004	THREE-PER-EM SPACE	Zs
2005	FOUR-PER-EM SPACE	Zs
2006	SIX-PER-EM SPACE	Zs
2007	FIGURE SPACE	Zs
2008	PUNCTUATION SPACE	Zs
2009	THIN SPACE	Zs
200A	HAIR SPACE	Zs
200B	ZERO WIDTH SPACE	Zs
2028	LINE SEPARATOR	Zl
2029	PARAGRAPH SEPARATOR	Zp
202F	NARROW NO-BREAK SPACE	Zs
3000	IDEOGRAPHIC SPACE	Zs

The ASCII whitespace characters are the first six characters in the above list -- line feed, horizontal tabulation, vertical tabulation, form feed, carriage return, and space. These are also exactly the characters recognised by the Posix isspace() procedure. Latin-1 adds the no-break space.

[constant] char-set:iso-control

The ISO control characters are the Unicode/Latin-1 characters in the ranges [U+0000,U+001F] and [U+007F,U+009F].

ASCII restricts this set to the characters in the range [U+0000,U+001F] plus the character U+007F.

Note that Unicode defines other control characters which do not belong to this set (hence the qualifying prefix "iso-" in the name).

[constant] char-set:punctuation

In Unicode, a punctuation character is any character that has one of the punctuation categories in the Unicode character database (Pc, Pd, Ps, Pe, Pi, Pf, or Po.)

ASCII has 23 punctuation characters:

 !"#%&'()*,-./:;?@[\]_{}

Latin-1 adds six more:

00A1	INVERTED EXCLAMATION MARK
00AB	LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
00AD	SOFT HYPHEN
00B7	MIDDLE DOT
00BB	RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
00BF	INVERTED QUESTION MARK

Note that the nine ASCII characters $+<=>^`|~ are not punctuation. They are "symbols."

[constant] char-set:symbol

In Unicode, a symbol is any character that has one of the symbol categories in the Unicode character database (Sm, Sc, Sk, or So). There are nine ASCII symbol characters:

 $+<=>^`|~

Latin-1 adds 18 more:

00A2	CENT SIGN
00A3	POUND SIGN
00A4	CURRENCY SIGN
00A5	YEN SIGN
00A6	BROKEN BAR
00A7	SECTION SIGN
00A8	DIAERESIS
00A9	COPYRIGHT SIGN
00AC	NOT SIGN
00AE	REGISTERED SIGN
00AF	MACRON
00B0	DEGREE SIGN
00B1	PLUS-MINUS SIGN
00B4	ACUTE ACCENT
00B6	PILCROW SIGN
00B8	CEDILLA
00D7	MULTIPLICATION SIGN
00F7	DIVISION SIGN

[constant] char-set:blank

Blank chars are horizontal whitespace. In Unicode, a blank character is either

a character with the space separator category (Zs) in the Unicode character database.
U+0009 Horizontal tabulation (\t control-I)

There are eighteen blank characters in Unicode 3.0:

0009	HORIZONTAL TABULATION	\t control-I
0020	SPACE	Zs
00A0	NO-BREAK SPACE	Zs
1680	OGHAM SPACE MARK	Zs
2000	EN QUAD	Zs
2001	EM QUAD	Zs
2002	EN SPACE	Zs
2003	EM SPACE	Zs
2004	THREE-PER-EM SPACE	Zs
2005	FOUR-PER-EM SPACE	Zs
2006	SIX-PER-EM SPACE	Zs
2007	FIGURE SPACE	Zs
2008	PUNCTUATION SPACE	Zs
2009	THIN SPACE	Zs
200A	HAIR SPACE	Zs
200B	ZERO WIDTH SPACE	Zs
202F	NARROW NO-BREAK SPACE	Zs
3000	IDEOGRAPHIC SPACE	Zs

The ASCII blank characters are the first two characters above -- horizontal tab and space. Latin-1 adds the no-break space.

Author

The SRFI-14 library author is Olin Shivers, but a lot of code was inherited from MIT Scheme via SCSH into the SRFI process. As Olin himself put it:

 - Ported from MIT Scheme runtime by Brian D. Carlstrom.
 - Massively rehacked & extended by Olin Shivers 6/98.
 - Massively redesigned and rehacked 5/2000 during SRFI process.
 At this point, the code bears the following relationship to the
 MIT Scheme code: "This is my grandfather's axe. My father replaced
 the head, and I have replaced the handle." Nonetheless, we preserve
 the MIT Scheme copyright:
    Copyright (c) 1988-1995 Massachusetts Institute of Technology

The CHICKEN extension is based on the code mentioned above, and maintained by the CHICKEN Team.

Repository

This egg is hosted on the CHICKEN Subversion repository:

https://anonymous@code.call-cc.org/svn/chicken-eggs/release/5/srfi-14

If you want to check out the source code repository of this egg and you are not familiar with Subversion, see this page.

License

 Copyright (c) 1988-1994 Massachusetts Institute of Technology.
 Copyright (c) 1988-1995 Massachusetts Institute of Technology
 This material was developed by the Scheme project at the Massachusetts
 Institute of Technology, Department of Electrical Engineering and
 Computer Science.  Permission to copy and modify this software, to
 redistribute either the original software or a modified version, and
 to use this software for any purpose is granted, subject to the
 following restrictions and understandings.
 
 1. Any copy made of this software must include this copyright notice
 in full.
 
 2. Users of this software agree to make their best efforts (a) to
 return to the MIT Scheme project any improvements or extensions that
 they make, so that these may be included in future releases; and (b)
 to inform MIT of noteworthy uses of this software.
 
 3. All materials developed as a consequence of the use of this
 software shall duly acknowledge such use, in accordance with the usual
 standards of acknowledging credit in academic research.
 
 4. MIT has made no warrantee or representation that the operation of
 this software will be error-free, and MIT is under no obligation to
 provide any services, by way of maintenance, update, or otherwise.
 
 5. In conjunction with products arising from the use of this material,
 there shall be no use of the name of the Massachusetts Institute of
 Technology nor of any adaptation thereof in any advertising,
 promotional, or sales literature without prior written consent from
 MIT in each case.

Version History

0.2.1: Fix segfault when calling certain procedures with non-fixnum as index (fixes #1631)
0.2: Compile with similar performance options as in CHICKEN 4.
0.1: Taken from the srfi-14 core library unit and released as an egg for CHICKEN 5.

Introduction

Specification

General procedures

Iterating over character sets

Creating character sets

Querying character sets

Character-set algebra

Standard character sets

Character set constants

Author

Repository

License

Version History