1. Status of the transition to UNICODE support in the core system
    1. General approach
    2. Internal representation
    3. Modifications needed in the core system
    4. Other changes
    5. Backwards-incompatible changes
    6. Modifications needed in eggs
    7. Status
    8. How to handle invalid UTF-8 sequences
    9. Open questions
    10. Misc

Status of the transition to UNICODE support in the core system

This document describes the necessary steps and current status for full unicode support in the core CHICKEN system.

General approach

The core string type (C_STRING_TYPE changes from a byteblock to a wrapper object, holding a separate bytevector with the actual contents of the string, in UTF-8 encoding. This indirection allows growing or shrinking the content transparently, should modifications of the string require increasing the buffer size for multibyte sequences (decrease in size can be done easily by just adjusting the header size). Additional slots in the string are used for caching character-index / byte-offset information to speed up linear iteration over a strings's characters.

There is no separate byte-string type, as we already have bytevectors ("blobs") that can be used to hold binary data.

This change is likely to require a major release as there are several syntactic and semantic changes involved.

Internal representation

C_STRING_TYPE drops the C_BYTEBLOCK bit. A string is now a block holding 4 slots:

Index/offset are used to cache the most recent indexing (ref/set/...) operation, in the hope of speeding up linear operations over strings. They are initialized on string creation to zero.

The buffer is explicitly terminated with a 0-byte.

Conversion from string to bytevector merely extracts and copies the buffer, conversion from bytevector to string decodes, stores the count and creates a wrapper object.

As mutation of strings may reallocate the buffer. Symbols have a 0-terminated bytevector in the name slot. Passing symbols or strings to foreign code can be done without copying (but still involves checking for embedded zero bytes).

Source code is assumed to be UTF-8 encoded, as this is the default encoding used for all character I/O.

Modifications needed in the core system

Other changes

As inserting and replacing characters may require the need to enlarge the backing store of a string, it is necessary to allocate new byte buffers on the fly, even in non-CPS contexts. For this the "scratchspace" feature of the memory management system originall intended for bignums is re-used but requires removing an assertion in C_mutate_scratch_slot, as the string-slot holding the byte buffer may be in the heap:

@@ -3299,7 +3297,8 @@ C_regparm C_word C_fcall C_mutate_scratch_slot(C_word *slot, C_word val)
 {
   C_word *ptr = (C_word *)val;
   assert(C_in_scratchspacep(val));
-  assert(slot == NULL || C_in_stackp((C_word)slot));
+/* XXX  assert(slot == NULL || C_in_stackp((C_word)slot));
+*/
   if (*(ptr-1) == ALIGNMENT_HOLE_MARKER) --ptr;
   if (*(ptr-1) == (C_word)NULL && slot != NULL)
     C_scratch_usage += *(ptr-2) + 2;
}}

As I understand it, the assertion is not strictly necessary and so far tests seem to run fine.

The following procedures take an optional encoding specifier argument: process, process*, open-input-file, open-output-file, open-input[file*, open-output-file*, tcp-accept, tcp-connect.

number-of-bytes returns the size of the byteblock-buffer of strings and symbols, excluding the implicit zero terminator.

Backwards-incompatible changes

Modifications needed in eggs

Status

The core modifications have been done and the test-suite passes all tests. The created system is able to compile itself (and run the test suite). Tests have been performed so far on x86_64 Linux and OpenBSD systems, only.

All character classification is done in C, using code derived from http://git.suckless.org/ubase/, ctype.h is not used anymore.

New procedures: bytevector I/O (in the "chicken.bytevector" module), port-encoding, char-foldcase, string-foldcase.

Predefined encodings: utf-8, latin-1.

The code for the current state can be found in the utf branch in the CHICKEN git(1) repository. You will need to build a version from the utf-bootstrap branch first, as the way literal strings and symbols are encoded in compiled code has changed due to the new string data representation. After you build a (possibly static) chicken executable, you can then build the code from the utf branch using that compiler:

git checkout utf-bootstrap
gmake PLATFORM=<platform> STATICBUILD=1 chicken
mv chicken chicken-utf-bootstrap
git checkout utf
touch *.scm
gmake PLATFORM=<platform> PREFIX=<prefix> CHICKEN=./chicken-utf-bootstrap

How to handle invalid UTF-8 sequences

As Alex Shinn once said so eloquently: "Strings are bitches". They crop up everywhere and their encoding can only be guessed, depending on where they originate: as results of OS APIs on systems with a non-UTF-8 locale, as results of foreign code, from literal strings containing explicitly invalid code sequences and from reading binary data. Every source of bytes must be assumed to have an arbitrary encoding.

There are several strategies that can be followed to handle this. The first is to enforce a well-formed internal representation, or have multiple representations for different encodings. This complicates type dispatching and internal logic and requires expensive transformations on boundaries to the outside world like file streams and foreign code. Multiple foreign string types would be required, including endless safety checks for correct encoding.

The problem here is that strings can not be transparently passed between these boundaries: if foreign code gives me a string that I do not modify, I expect it to accept the same sequence of bytes as valid input, without any conversions to and from some internal representation. If the encoding of a file is uninteresting to me, I just want to read it and write the data back, if the contents are of no importance.

The strategy that I favor in the moment is to handle all string data injected into the system transparently, the actual bytes are unchanged and unexpected UTF-8 bytes are decoded and marked as a U+DC80 - U+DCFF (low, trailing) UTF-16 surrogate pair half. Encoding converts these marked bytes back to their original value. Errors are not signalled (with the exception of utf8->string), validation must be done manually, if desired. As I understand it, this is the approach uses by the "surrogateescape" error handler in PEP 383, but we force it for all bytevector <-> string conversions. This places some load on the actual code-point indexing when accessing single characters but the hope is that the index-cache in each string somewhat reduces the overhead while making the system resilient in handling strings with an unknown or invalid encoding.

Open questions

Misc

If you have suggestions or comments, please add them here or write a mail to the chicken-hackers mailing list.

Many thanks to bevuta IT GmbH for sponsoring this effort!