Status of the transition to UNICODE support in the core system

Status of the transition to UNICODE support in the core system

Status of the transition to UNICODE support in the core system

This document describes the necessary steps and current status for full unicode support in the core CHICKEN system.

General approach

The core string type (C_STRING_TYPE changes from a byteblock to a wrapper object, holding a separate bytevector with the actual contents of the string, in UTF-8 encoding. This indirection allows growing or shrinking the content transparently, should modifications of the string require increasing the buffer size for multibyte sequences (decrease in size can be done easily by just adjusting the header size). Additional slots in the string are used for caching character-index / byte-offset information to speed up linear iteration over a strings's characters.

There is no separate byte-string type, as we already have bytevectors ("blobs") that can be used to hold binary data.

This change is likely to require a major release as there are several syntactic and semantic changes involved.

Internal representation

C_STRING_TYPE drops the C_BYTEBLOCK bit. A string is now a block holding 4 slots:

buffer (a bytevector)
count (length in unicode codepoints, fixnum)
index (codepoint-index of last indexing operation, fixnum)
offset (byte-index of last indexing operation, fixnum)

Index/offset are used to cache the most recent indexing (ref/set/...) operation, in the hope of speeding up linear operations over strings. They are initialized on string creation to zero.

The buffer is explicitly terminated with a 0-byte.

Conversion from string to bytevector merely extracts and copies the buffer, conversion from bytevector to string decodes, stores the count and creates a wrapper object.

As mutation of strings may reallocate the buffer. Symbols have a 0-terminated bytevector in the name slot. Passing symbols or strings to foreign code can be done without copying (but still involves checking for embedded zero bytes).

Source code is assumed to be UTF-8 encoded, as this is the default encoding used for all character I/O.

Modifications needed in the core system

Implement UTF support library in C (DONE).
Change C_STRING_TYPE to be a wrapper object holding counters and buffer (DONE).
Change low-level C API to use UTF support library, where needed (DONE).
Change string-primitives to use proper low-level functionality (DONE).
Review literal-frame generation in compiler, extend to store UTF strings directly (DONE).
Prepare bootstrapping compiler (DONE).
Provide full bytevector API (use R7RS nomenclature) (DONE).
Remove "blob" API and remove mentions in documentation, use "bytevector" term (DONE).
Add literal syntax for bytevector strings: #u8"..." (DONE).
Add print syntax for bytevectors: #u8(...) (DONE).
Review compiler optimization rewrites that apply to bytevector API (DONE).
Review scrutinizer and type system (DONE).
Check OS API regarding where it is necessary to hack around FS naming constraints (Python uses distinct code-point region ("dc") to mask bytes that can not be encoded properly) (DONE).
Add tests from utf8 egg to core test suite. (DONE)
Special case SRFI-4 u8vector handling to use bytevectors directly (DONE).
Change "write-string" / "read-string!" port method to read bytevectors instead (DONE).
Files can be opened with an encoding (DONE).
A mechanism exists for extending the available file encodings (DONE).
file-encoding support for tcp port (DONE).
file-encoding support for process ports (DONE).
Optimize list->string, reverse-list->string, copy-port.

Other changes

As inserting and replacing characters may require the need to enlarge the backing store of a string, it is necessary to allocate new byte buffers on the fly, even in non-CPS contexts. For this the "scratchspace" feature of the memory management system originall intended for bignums is re-used but requires removing an assertion in C_mutate_scratch_slot, as the string-slot holding the byte buffer may be in the heap:

@@ -3299,7 +3297,8 @@ C_regparm C_word C_fcall C_mutate_scratch_slot(C_word *slot, C_word val)
 {
   C_word *ptr = (C_word *)val;
   assert(C_in_scratchspacep(val));
-  assert(slot == NULL || C_in_stackp((C_word)slot));
+/* XXX  assert(slot == NULL || C_in_stackp((C_word)slot));
+*/
   if (*(ptr-1) == ALIGNMENT_HOLE_MARKER) --ptr;
   if (*(ptr-1) == (C_word)NULL && slot != NULL)
     C_scratch_usage += *(ptr-2) + 2;
}}

As I understand it, the assertion is not strictly necessary and so far tests seem to run fine.

The following procedures take an optional encoding specifier argument: process, process*, open-input-file, open-output-file, open-input[file*, open-output-file*, tcp-accept, tcp-connect.

number-of-bytes returns the size of the byteblock-buffer of strings and symbols, excluding the implicit zero terminator.

Backwards-incompatible changes

file-read, set-pseudo-random-seed! and random-bytes require a byte-vector argument.
The "blob" module has been removed and replaced by the R7RS-compatible "chicken.bytevector" module.
Strings and symbols passed to foreign code are not copied, they are passed directly.
SRFI-4 u8vectors and bytevectors are interchangable.
String-locatives index by character position, not byte-position. Size changes due to destructive string-mutation are not detected or handled.
"blob" read-syntax ("#${...}") has been removed.
read-u8vector, read-u8vector! and write-u8vector have been removed, use the bytevector I/O operations from (chicken io) instead.
Removed set-port-name!, use SRFI-17 setter instead.
make-input-port and make-output-port take additional port methods as keyword arguments.

Modifications needed in eggs

SRFI-13: adapt, probably reuse code from utf8 egg.
SRFI-14: adapt, probably reuse code from utf8 egg.
r7rs: Remove/reexport functionality provided by the core system.
utf8: Remove.
s11n: Needs to be adapted, other serialization eggs, too.
srfi-207
openssl: (and other eggs with custom ports)
Any other bytevector SRFI eggs?
Probably more, as some eggs make assumptions about the internal string or u8vector representation.

Status

The core modifications have been done and the test-suite passes all tests. The created system is able to compile itself (and run the test suite). Tests have been performed so far on x86_64 Linux and OpenBSD systems, only.

All character classification is done in C, using code derived from http://git.suckless.org/ubase/, ctype.h is not used anymore.

New procedures: bytevector I/O (in the "chicken.bytevector" module), port-encoding, char-foldcase, string-foldcase.

Predefined encodings: utf-8, latin-1.

How to handle invalid UTF-8 sequences

As Alex Shinn once said so eloquently: "Strings are bitches". They crop up everywhere and their encoding can only be guessed, depending on where they originate: as results of OS APIs on systems with a non-UTF-8 locale, as results of foreign code, from literal strings containing explicitly invalid code sequences and from reading binary data. Every source of bytes must be assumed to have an arbitrary encoding.

There are several strategies that can be followed to handle this. The first is to enforce a well-formed internal representation, or have multiple representations for different encodings. This complicates type dispatching and internal logic and requires expensive transformations on boundaries to the outside world like file streams and foreign code. Multiple foreign string types would be required, including endless safety checks for correct encoding.

The problem here is that strings can not be transparently passed between these boundaries: if foreign code gives me a string that I do not modify, I expect it to accept the same sequence of bytes as valid input, without any conversions to and from some internal representation. If the encoding of a file is uninteresting to me, I just want to read it and write the data back, if the contents are of no importance.

The strategy that I favor in the moment is to handle all string data injected into the system transparently, the actual bytes are unchanged and unexpected UTF-8 bytes are decoded and marked as a U+DC80 - U+DCFF (low, trailing) UTF-16 surrogate pair half. Encoding converts these marked bytes back to their original value. Errors are not signalled (with the exception of utf8->string), validation must be done manually, if desired. As I understand it, this is the approach uses by the "surrogateescape" error handler in PEP 383, but we force it for all bytevector <-> string conversions. This places some load on the actual code-point indexing when accessing single characters but the hope is that the index-cache in each string somewhat reduces the overhead while making the system resilient in handling strings with an unknown or invalid encoding.

Open questions

Expose ##sys#default-file-encoding?
Make the default file encoding dependent on environment variables specifying the locale?
Expose ##sys#register-encoding?
Measure runtime overhead.

Misc

If you have suggestions or comments, please add them here or write a mail to the chicken-hackers mailing list.

Many thanks to bevuta IT GmbH for sponsoring this effort!