Outdated egg!

This is an egg for CHICKEN 3, the unsupported old release. You're almost certainly looking for the CHICKEN 4 version of this egg, if it exists.

If it does not exist, there may be equivalent functionality provided by another egg; have a look at the egg index. Otherwise, please consider porting this egg to the current version of CHICKEN.

ChaSen

Description

An interface to the ChaSen Japanese morphological analyzer library from NAIST. The ChaSen library can be obtained from its homepage at http://chasen.naist.jp/hiki/ChaSen/.

Synopsis

> (use chasen)
> (chasen-parse "日本語の文字列")
((("日本語" "ニホンゴ" "日本語" "名詞-一般")
  ("の" "ノ" "の" "助詞-連体化")
  ("文字" "モジ" "文字" "名詞-一般")
  ("列" "レツ" "列" "名詞-一般")))
>

Procedures

(chasen-parse <string> [<keyword-options> ...])

Parses <string> into a list of sentences. Each sentence is a list of morphemes. If a multi-line format was used (the default) the morphemes will be a list of morpheme data, otherwise they will each be a single string. So the default result will look like:

 (((<inflected-form> <katakana> <base-form> <part-of-speech>) ...) ...)

The keyword arguments are as described below for chasen-config.

<string> should be utf-8 encoded.

(chasen-parse/all <string> [<keyword-options> ...])

As chasen-parse, but for each sentence returns a list of all possible parsing paths when there are ambiguities.

(chasen-parse/current <string>)

Parse using the same options as in the previous parse, faster when performing multiple parses with the same options.

(chasen-config <keyword-options>)

Directly set options for use with chasen-parse/current. The following keywords are recognized:

multi-line:

Sentences may be spread out over multiple lines and are delimited with Japanese punctuation markers or blank lines. The default is one sentence per line.

cost-width: <cost>

Specify the cost width.

rc-file: <file>

Specify the chasenrc file.

no-rc-file:

Don't load any chasenrc file.

default-format:

Use the default output format.

extended-format:

Provide extended information per morpheme.

format: <fmt-string>

Manually specify a format string. See the output of "chasen -Fh" for details.

The following procedures also provide more direct access to the ChaSen API:

(chasen-config/argv <list-of-strings>)

Equivalent to chasen_getopv_argv. <list-of-strings> should just contain the option strings - a program name of "chasen" is prefixed.

(chasen-parse/string <string>)

Equivalent to chasen_sparse_tostr, returning the result as an unparsed string.

Example

;; Split a haiku into its 5-7-5 syllable phrases

(use chasen syntax-case utf8 srfi-1)
(import utf8)

(define haiku-split
  (let ((non-syllables (string->list "ャュョッン、。！？ 　\t\n")))
    (lambda (str)
      (define (take-n ls n)
        (let lp ((i 0) (ls ls) (res '()))
          (if (or (>= i n) (null? ls))
            (values (reverse res) ls)
            (lp (+ i (length (remove (cut memv <> non-syllables)
                                     (string->list (cadar ls)))))
                (cdr ls)
                (cons (car ls) res)))))
      (receive (first-5 rest) (take-n (car (chasen-parse str)) 5)
        (receive (next-7 last-5) (take-n rest 7)
          (list first-5 next-7 last-5))))))

(for-each (lambda (x) (apply print (map car x)))
          (haiku-split "古池や蛙飛込む水の音"))

1.0: Initial release

Outdated egg!

ChaSen

Description

Download

Synopsis

Procedures

Example

Requirements

Author

License

History