You are looking at historical revision 41032 of this page. It may differ significantly from its current revision.

Roadmap for core UNICODE support

This document describes the necessary steps for full unicode support in the core system.

General approach

The core string type (C_STRING_TYPE changes from a byteblock to a wrapper object, holding a separate bytevector with the actual contents of the string, in UTF-8 encoding. This indirection allows growing or shrinking the content transparently should modifications of the string require increasing the buffer size for multibyte sequences.

There is no separate byte string type, as we already have bytevectors ("blobs") that can be used to hold binary data.

This change will require a major release. Even though there will be few source-code changes required, the behaviour is much stricter when passing/returning strings to/from foreign code or when reading non-binary files.

Internal representation

C_STRING_TYPE drops the C_BYTEBLOCK bit. A string is now a block holding 4 slots:

Index/offset are used to cache the most recent indexing (ref/set/...) operation, in the hope of speeding up linear operations over strings.

Conversion from string to bytevector merely extracts and copies the buffer, conversion from bytevector to string decodes and checks, stores the count and creates a wrapper object. Copying the buffer may in some cases not be necessary if the buffer is not used anywhere else.

As mutation of strings may to reallocate the buffer, it would be nice to reuse the scratchspace machinery used internally for bignums right now to avoid CPS calls.

We thought about various clever encoding schemes for transparent handling of byte- vs unicode strings, but in the end it seems cleaner and more in the spirit of scheme to have distinct types with distinct primitives.

Modifications needed in the core system

Modifications needed in eggs

Open questions

Misc

If you have suggestions or comments, please add them here or write a mail to the chicken-hackers mailing list.

Many thanks to bevuta IT GmbH for sponsoring this effort!