charconv

  1. charconv
    1. Description
    2. Author
    3. Requirements
    4. Documentation
      1. Input/output procedures
      2. Utility procedures
      3. Detection procedures
      4. Automatic detection
    5. Changelog
    6. License

Description

Character encoding utilities

Author

Alex Shinn

Requirements

Documentation

This module provides a convenience layer over top of the iconv module, as well as automatic detection of character encoding schemes. It implicitly assumes you are using UTF8 internally for your strings (you can use the utf8 module to change string semantics to use UTF8 as well). Given that, all you need to do is specify the external encoding you are working with.

Input/output procedures

The following are direct analogs of the equivalent R5RS procedures:

[procedure] (open-encoded-input-file FILE ENC)
[procedure] (call-with-encoded-input-file FILE ENC PROC)
[procedure] (with-input-from-encoded-file FILE ENC THUNK)
[procedure] (open-encoded-output-file FILE ENC)
[procedure] (call-with-encoded-output-file FILE ENC PROC)
[procedure] (with-output-to-encoded-file FILE ENC THUNK)

Example:

(use charconv)
(with-input-from-encoded-file "/usr/share/edict/edict" "EUC-JP" read-line)
[procedure] (read-encoded-string ENC [N [PORT]])

An analog of string using byte-count (not character count). May read additional bytes to ensure you read along a character boundary. If you really want exactly N bytes regardless of character boundaries, you should combine read-string with ces-convert below.

Utility procedures

The following are copied from the Gauche API. CES stands for Character Encoding Scheme.

[procedure] (ces-equivalent? CES-A CES-B)

Returns #t if CES-A and CES-B are equivalent (aliases), #f otherwise.

[procedure] (ces-upper-compatible? CES-A CES-B)

Returns #t if a string encoded in CES-B can be considered a string in CES-A without conversion.

[procedure] (ces-convert STR FROM [TO])

Return a new string of STR converted from encoding FROM to encoding TO.

Detection procedures

[procedure] (detect-file-encoding FILE [LOCALE])
[procedure] (detect-encoding STRING [LOCALE])

The detection procedures can correctly identify most common 'types' of encodings, such as UTF-8/16/32, EUC-*, ISO-2022-*, Shift_JIS or single-byte, without any need for specifying the locale. However, currently it doesn't include any statistical or linguistic routines, without which it can't distinguish between EUC-JP and EUC-KR, or between any of the single-byte encodings (including ISO-8859-*). In these cases you can specify a locale, such that in the event of a single-byte encoding a "de" locale would result in the default German single-byte encoding, ISO-8859-1.

The detect-file-encoding procedure also recognizes the Emacs-style

 -*- coding: foo -*-

signature in either of the first two lines.

Automatic detection

You can also use the automatic detection implicitly in the input procedures by specifying an encoding of "*" or "*<LOCALE>". For example,

(open-encoded-input-file file "*")    ; guess with no locale
(open-encoded-input-file file "*DE")  ; guess with a German locale

For compatibility with the Gauche convention, the encoding "*JP" is equivalent to "*JA", the Japanese locale.

Changelog

License

 Copyright (c) 2004-2005, Alex Shinn
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
 conditions are met:
 
   Redistributions of source code must retain the above copyright notice, this list of conditions and the following
     disclaimer. 
   Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
     disclaimer in the documentation and/or other materials provided with the distribution. 
   Neither the name of the author nor the names of its contributors may be used to endorse or promote
     products derived from this software without specific prior written permission. 
 
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
 OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
 AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
 CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
 OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 POSSIBILITY OF SUCH DAMAGE.