Roadmap for core UNICODE support (historical revision 41027)

You are looking at historical revision 41027 of this page. It may differ significantly from its current revision.

Roadmap for core UNICODE support

Roadmap for core UNICODE support

This document describes the necessary steps for full unicode support in the core system.

General approach

The core string type (C_STRING_TYPE changes from a byteblock to a wrapper object, holding a separate bytevector with the actual contents of the string, in UTF-8 encoding. This indirection allows growing or shrinking the content transparently should modifications of the string require increasing the buffer size for multibyte sequences.

There is no separate byte string type, as we already have bytevectors ("blobs") that can be used to hold binary data.

This change will require a major release. Even though there will be few source-code changes required, the behaviour is much stricter when passing/returning strings to/from foreign code or when reading non-binary files.

Internal representation

C_STRING_TYPE drops the C_BYTEBLOCK bit. A string is now a block holding 4 slots:

buffer (C_BYTEVECTOR_TYPE)
count (length in unicode codepoints, fixnum)
index (codepoint-index of last indexing operation, fixnum)
offset (byte-index of last indexing operation, fixnum)

Index/offset are used to cache the most recent indexing (ref/set/...) operation, in the hope of speeding up linear operations over strings.

Conversion from string to bytevector merely extracts and copies the buffer, conversion from bytevector to string decodes and checks, stores the count and creates a wrapper object. Copying the buffer may in some cases not be necessary if the buffer is not used anywhere else.

As mutation of strings may to reallocate the buffer, it would be nice to reuse the scratchspace machinery used internally for bignums right now to avoid CPS calls.

Modifications needed in the core system

Implement UTF support library in C.
Change C_STRING_TYPE to be a wrapper object holding counters and buffer.
Change low-level C API to use UTF support library, where needed.
Change string-primitives to use proper low-level functionality
Review literal-frame generation in compiler, extend to store UTF strings directly.
Provide full bytevector API (use R7RS nomenclature).
Extend file API to allow multiple encodings (start with binary and UTF-8).
Deprecate "blob" API and remove mentions in documentation, use "bytevector" term.
Add checks where byte-sequences are converted to UTF-8 to catch invalid sequences.
Extend FFI to decode returned strings properly.
Add literal syntax for bytevector strings: #$"...", including escapes.
Review compiler optimization rewrites to apply only to bytevector API.
Review scrutinizer and type system.
Check OS API regarding where it is necessary to hack around FS naming constraints (Python uses distinct code-point region ("dc") to mask bytes that can not be encoded properly).
Add runtime-option to disable decoding checks when converting byte-sequences to UTF to ease migration.
Add internal (C) hook for creating pluggable text codecs.
Add tests from utf8 egg to core test suite.

Modifications needed in eggs

|SRFI-13: adapt, probably reuse code from utf8 egg.
SRFI-14: adapt, probably reuse code from utf8 egg.
r7rs: Remove/reexport functionality provided by the core system.

Open questions

How to make #u8(...) syntax consistently usable? R7RS uses this for bytevectors, SRFI-4 for U8 vectors.
Measure runtime overhead.