utf8

Unicode support

Repository

This egg is hosted on the CHICKEN Subversion repository:

https://anonymous@code.call-cc.org/svn/chicken-eggs/release/5/utf8

If you want to check out the source code repository of this egg and you are not familiar with Subversion, see this page.

Documentation

To make your code Unicode aware, just do the following:

 (import utf8)

then all core, extra and regex string operations will be Unicode aware. string-length will return the number of codepoints, not the number of bytes, string-ref will index by codepoints and return a char with an integer value up to 2^21, regular expressions will match single codepoints rather than bytes and understand Unicode character classes, etc.

Strings are still native strings and may be passed to external libraries (either Scheme or foreign) perfectly safely. Libraries that do parsing invariably do so on ASCII character boundaries and are thus guaranteed to be compatible. Libraries that reference strings by index would need to be modified with a UTF-8 version. Currently all existing eggs are UTF-8 safe to my knowledge.

This extension does not load into the toplevel, it is composed of modules. So it must be imported. Since exported identifiers match those of common Chicken imports the conflicts must be excluded. Use the examples below to stop such conflicts:

 (import
   (except scheme
     string-length string-ref string-set! make-string string substring
     string->list list->string string-fill! write-char read-char display)
   (except (chicken base)
     print print*)
   (except (chicken string)
     reverse-list->string ->string conc string-chop string-split
     string-translate substring=? substring-ci=? substring-index
     substring-index-ci)
   (except (chicken io)
     read-string write-string read-token))

If you are using the regex egg:

 (import
   (except regex
     grep regexp string-substitute string-substitute* string-split-fields
     string-match string-match-positions string-search string-search-positions))

Note that not all Chicken string routines have a utf8 version yet:

Module (chicken string): string-chomp, string-compare3, reverse-string-append
Module (chicken pretty-print): pretty-print
Module (chicken format): printf, sprintf, fprintf
Module (chicken io): read-line, write-line, read-lines
Module (chicken irregex): (already utf8 aware, unless disabled)

To use Unicode-aware SRFI-13 and SRFI-14 using UTF-8 semantics:

 (import utf8-srfi-13)
 (import utf8-srfi-14)

The SRFI-14 module provides an alternative to the standard Chicken SRFI-14. As a pure superset which handles arbitrary-sized characters it should be usable as a drop-in replacement. The only aspect related to UTF-8 is STRING->CHAR-SET assumes the string is UTF-8 encoded.

R7RS support

The core scheme forms exported by this egg conform to R7RS-small; in particular, string->list and string-fill! accept start and end arguments.

Unicode char-sets

The default SRFI-14 char-sets are defined using ASCII-only characters, since this is both useful and lighter-weight. To obtain full Unicode char-set definitions, use the unicode-char-sets unit:

 (import unicode-char-sets)

[Note this is the only extension in this egg with a unicode- prefix, because the char-set handling only depends on individual characters and is independent of the character encoding used in strings.]

The following char-sets are provided based on the Unicode properties:

 char-set:alphabetic
 char-set:arabic
 char-set:armenian
 char-set:ascii-hex-digit
 char-set:bengali
 char-set:bidi-control
 char-set:bopomofo
 char-set:braille
 char-set:buhid
 char-set:canadian-aboriginal
 char-set:cherokee
 char-set:common
 char-set:cypriot
 char-set:cyrillic
 char-set:dash
 char-set:default-ignorable-code-point
 char-set:deprecated
 char-set:deseret
 char-set:devanagari
 char-set:diacritic
 char-set:ethiopic
 char-set:extender
 char-set:georgian
 char-set:gothic
 char-set:grapheme-base
 char-set:grapheme-extend
 char-set:grapheme-link
 char-set:greek
 char-set:gujarati
 char-set:gurmukhi
 char-set:han
 char-set:hangul
 char-set:hanunoo
 char-set:hebrew
 char-set:hex-digit
 char-set:hiragana
 char-set:hyphen
 char-set:id-continue
 char-set:id-start
 char-set:ideographic
 char-set:ids-binary-operator
 char-set:ids-trinary-operator
 char-set:inherited
 char-set:join-control
 char-set:kannada
 char-set:katakana
 char-set:katakana-or-hiragana
 char-set:khmer
 char-set:lao
 char-set:latin
 char-set:limbu
 char-set:linear-b
 char-set:logical-order-exception
 char-set:lowercase
 char-set:malayalam
 char-set:math
 char-set:mongolian
 char-set:myanmar
 char-set:noncharacter-code-point
 char-set:ogham
 char-set:old-italic
 char-set:oriya
 char-set:osmanya
 char-set:quotation-mark
 char-set:radical
 char-set:runic
 char-set:shavian
 char-set:sinhala
 char-set:soft-dotted
 char-set:sterm
 char-set:syriac
 char-set:tagalog
 char-set:tagbanwa
 char-set:tai-le
 char-set:tamil
 char-set:telugu
 char-set:terminal-punctuation
 char-set:thaana
 char-set:thai
 char-set:tibetan
 char-set:ugaritic
 char-set:unified-ideograph
 char-set:uppercase
 char-set:variation-selector
 char-set:white-space
 char-set:xid-continue
 char-set:xid-start
 char-set:yi

Unicode case-mappings

The SRFI-13 case-mapping procedures (string-upcase, etc.) are defined using only ASCII case-mappings, since this is both useful and lighter-weight. To get full Unicode aware case-mappings, do

 (import utf8-case-map)

which provides the utf8-string-upcase, utf8-string-downcase, and utf8-string-titlecase procedures. These take a first argument of either a string or port, and an optional second argument of locale (as a string), returning the appropriate locale-aware case-mapped string.

Byte-strings

Sometimes you may need access to the original string primitives so you can directly access bytes, such as if you were implementing your own regex library or text buffer and wanted optimal performance. For these cases you can simply import and rename or prefix the string procedures from the scheme module, like so:

(import (rename (only scheme string-ref string-set!)
                (string-ref byte-ref)
                (string-set! byte-set!)))

Now, the original string operations which operate at the byte level are available as byte-ref and byte-set!.

For a more general and future-proof way of working with non-Unicode strings, use srfi-207 bytestrings.

Low-level API

Direct manipulation of the utf8 encoding is factored away in the utf8-lolevel unit. This includes an abstract string-pointer API, and an analogous string-pointer implementation for ASCII strings in the string-pointer unit, however as the API is not fixed you use these at your own risk.

Limitations

peek-char currently does not have Unicode semantics (i.e. it peeks only a single byte) to avoid problems with port buffering.

char-sets are not interchangeable between the existing srfi-14 code and Unicode code (i.e. do not pass a Unicode char-set to an external library that directly uses the old srfi-14).

Attempting to mutate literal strings will result in an error if the mutated size does not occupy the same number of bytes as the original. This is standards compliant, since the programmer is not supposed to attempt to mutate literal values, but it may be a little confusing since the error is inconsistent.

Performance

string-length, string-ref and string-set! are all O(n) operations as opposed to the usual O(1) since UTF-8 is a variable width encoding. Use of these should be discouraged - it is much cleaner to use the high-level SRFI-13 procedures and string ports. For examples of how to do common idioms without these procedures look at any string-based code in Gauche.

Furthermore, string-set! and other procedures that modify strings in place may invoke gc if the mutated result does not fit within the same UTF-8 encoding size as the original string. If only mutating 7-bit ASCII strings (or only mutating within fixed encoding sizes such as Cyrillic->Cyrillic) then no gc will occur.

string?, string=?, string-append, all R5RS string comparisons, and read-line are unmodified.

Regular expression matching will be just as fast except in the case of Unicode character classes (which were not possible before anyway).

All other procedures incur zero to minor overhead, but keep the same asymptotic performance.

Discussion

There are two ways to add Unicode string support to an existing language: redefine the strings themselves (i.e. add a new string type), or redefine the operations on the strings. The former causes a schism in your string libraries, dividing them between Unicode-aware and not, either doubling your library implementations or limiting them to one type or the other. You can't freely pass strings to other libraries without keeping track of their types and converting when needed. It becomes slow and unwieldy. C and Perl are the only language I know of who seriously tried this. In Perl the modules which worked with Unicode strings were minimal, frequent type conversions were needed, a general mess ensued, and Perl very quickly switched to the latter approach. In C as well, the libraries supporting wchar are still minimal, while most libraries still only support char.

UTF-8 is ideal for the in-place sort of extension because it is backwards compatible with ASCII. Any ASCII (7-bit) byte found within a UTF-8 string is guaranteed to be that character, not part of a multibyte character, so parsing libraries that work on ASCII characters work unmodified. This includes most existing text formats and network protocols. The EUC (Extended Unix Code) encodings also have this feature so a similar module could be implemented allowing users to (require 'euc-jp) for example and work in Japanese EUC rather than Unicode. Other encodings such as Shift_JIS satisfy the requirement that an ASCII string has the same meaning in the encoding, but multibyte characters in the encoding may include ASCII bytes, breaking the rule we need for safe ASCII parsing. A few encodings like UTF-16 and UTF-32 are completely incompatible. UTF-16 is primarily only used these days by Java, a victim of the unfortunate fact that at first UTF-16 was fixed width but is no longer with the advent of surrogate pairs. Note that even without this module you can write source code in Chicken in any ASCII compatible encoding like ISO-8859-* or UTF-8 and define symbols with that encoding (letting you replace lambda with syntax for a real greek lambda, for example).

Other languages that use UTF-8 include Perl, Python, TCL. XML and increasingly more and more network standards are using UTF-8 by default, and major databases all support UTF-8. Libraries with UTF-8 support include Gtk, SDL, and freetype.

Changelog

3.5.0 ; Port to CHICKEN 5
3.3.0 ;
3.2.0 ;
3.1.0 ;
3.0.0 ; Hello

License

Copyright (c) 2004-2008, Alex Shinn
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
conditions are met:

  Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  Neither the name of the author nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

Description

Author