accents-substitute (historical revision 22241)

You are looking at historical revision 22241 of this page. It may differ significantly from its current revision.

accents-substitute

Description

Substitutes accented characters (Latin1 and UTF-8) in strings by either non accented ASCII characters or HTML entities.

The current supported accented characters for both latin1 and UTF-8 are: ã, Ã, á, Á, â, Â, à, À, ä, Ä, é, É, ê, Ê, è, È, ë, Ë, í, Í, î, Î, ì, Ì, ï, Ï, õ, Õ, ó, Ó, ô, Ô, ò, Ò, ö, Ö, ú, Ú, û, Û, ù, Ù, ü, Ü, ç and Ç.

The following characters are supported in UTF-8 only: İ, ı, Ğ, ğ, Ş, ş.

Author

Mario Domenech Goulart

Requirements

None

Usage

This extensions provides two modules: accents-substitute-latin1 and accents-substitute-utf8.

If you want to replace accented characters in Latin-1 strings, use:

(require-extension accents-substitute-latin1)

(accents-substitute "ação")
=> "acao"

(accents-substitute "ação" mode: 'html)
=> "a&ccedil;&atilde;o"

If you want to replace accented characters in UTF-8 strings, use:

(require-extension accents-substitute-utf8)

(accents-substitute "ação")
=> "acao"

(accents-substitute "ação" mode: 'html)
=> "a&ccedil;&atilde;o"

You can use accents-substitute from both modules in the same program by renaming the procedures on importing.

Procedure

[procedure] (accents-substitute str #!key mode)

Substitute accented characters in str by non accented ASCII characters (if mode is not given or is given as 'ascii) or by HTML entities (if mode is given as 'html).

Example

Below you can see the code of a practical command line tool which uses accents-substitute.

Here's how to use it:

Usage: accents-substitute [ --encoding=<utf8|latin1> ] [ --mode=<ascii|html> ] [ input file ]

Default values:
   mode: ascii
   encoding: utf8

Here's the code:

#!/bin/sh
#| -*- scheme -*-
exec csi -s $0 "$@"
|#

(use
 (rename
  accents-substitute-latin1
  (accents-substitute accents-substitute-latin1))
 (rename
  accents-substitute-utf8
  (accents-substitute accents-substitute-utf8)))

(use posix regex (srfi 1 13))

(define (command-line-argument option args)
  ;; Return the argument associated to the command line option OPTION
  ;; in ARGS or #f if OPTION is not found in ARGS or doesn't have any
  ;; argument.
  (let ((val (any (cut string-match (string-append option "=(.*)") <>) args)))
    (and val (cadr val))))

(define (usage #!optional exit-code)
  (print "Usage: " (pathname-strip-directory (program-name))
         " [ --encoding=<utf8|latin1> ] [ --mode=<ascii|html> ] [ input file ]")
  (print "\nDefault values:\n"
         "    mode: ascii\n"
         "    encoding: utf8")
  (when exit-code (exit exit-code)))

(let* ((args (command-line-arguments))
       (mode (command-line-argument "--mode" args))
       (encoding (command-line-argument "--encoding" args))
       (paramless-args (remove (cut string-prefix? "--" <>) args))
       (accents-substitute accents-substitute-utf8))

  (when (or (member "-h" args) (member "--help" args))
    (usage 0))

  (when (and encoding (not (member encoding '("utf8" "latin1"))))
    (print "'" encoding "' is not a valid encoding.")
    (exit 1))

  (when (and mode (not (member mode '("ascii" "html"))))
    (print "'" mode "' is not a valid mode.")
    (exit 1))

  (when (equal? encoding "latin1")
    (set! accents-substitute accents-substitute-latin1))

  (let ((port (if (null? paramless-args)
                  (current-input-port)
                  (open-input-file (car paramless-args)))))
    (let loop ()
      (let ((line (read-line port)))
        (unless (eof-object? line)
          (print (accents-substitute line mode: (and mode (string->symbol mode))))
          (loop))))
    (unless (null? paramless-args)
      (close-input-port port))))

License

BSD

Version history

0.3: Added UTF-8 support for turkish characters (İ, ı, Ğ, ğ, Ş, ş). Thanks to Mehmet Köse.
0.2: Use pre compiled regexes for html mode (a lot faster). Added regex requirement for compatibility with chickens >= 4.6.2.
0.1: Initial release