You are looking at historical revision 26526 of this page. It may differ significantly from its current revision.

ChaSen

Description

An interface to the ChaSen Japanese morphological analyzer library from NAIST. The ChaSen library can be obtained from its homepage at http://chasen.naist.jp/hiki/ChaSen/.

Download

chasen.egg

Synopsis

> (use chasen)
> (chasen-parse "日本語の文字列")
((("日本語" "ニホンゴ" "日本語" "名詞-一般")
  ("の" "ノ" "の" "助詞-連体化")
  ("文字" "モジ" "文字" "名詞-一般")
  ("列" "レツ" "列" "名詞-一般")))
>

Procedures

Parses <string> into a list of sentences. Each sentence is a list of morphemes. If a multi-line format was used (the default) the morphemes will be a list of morpheme data, otherwise they will each be a single string. So the default result will look like:

 (((<inflected-form> <katakana> <base-form> <part-of-speech>) ...) ...)

The keyword arguments are as described below for chasen-config.

<string> should be utf-8 encoded.

As chasen-parse, but for each sentence returns a list of all possible parsing paths when there are ambiguities.

Parse using the same options as in the previous parse, faster when performing multiple parses with the same options.

Directly set options for use with chasen-parse/current. The following keywords are recognized:

Sentences may be spread out over multiple lines and are delimited with Japanese punctuation markers or blank lines. The default is one sentence per line.

Specify the cost width.

Specify the chasenrc file.

Don't load any chasenrc file.

Use the default output format.

Provide extended information per morpheme.

Manually specify a format string. See the output of "chasen -Fh" for details.

The following procedures also provide more direct access to the ChaSen API:

Equivalent to chasen_getopv_argv. <list-of-strings> should just contain the option strings - a program name of "chasen" is prefixed.

Equivalent to chasen_sparse_tostr, returning the result as an unparsed string.

Example

;; Split a haiku into its 5-7-5 syllable phrases

(use chasen syntax-case utf8 srfi-1)
(import utf8)

(define haiku-split
  (let ((non-syllables (string->list "ャュョッン、。!?  \t\n")))
    (lambda (str)
      (define (take-n ls n)
        (let lp ((i 0) (ls ls) (res '()))
          (if (or (>= i n) (null? ls))
            (values (reverse res) ls)
            (lp (+ i (length (remove (cut memv <> non-syllables)
                                     (string->list (cadar ls)))))
                (cdr ls)
                (cons (car ls) res)))))
      (receive (first-5 rest) (take-n (car (chasen-parse str)) 5)
        (receive (next-7 last-5) (take-n rest 7)
          (list first-5 next-7 last-5))))))

(for-each (lambda (x) (apply print (map car x)))
          (haiku-split "古池や蛙飛込む水の音"))

Requirements

iconv

Author

Alex Shinn

License

BSD

History

1.0
Initial release