ChaSen
Description
An interface to the ChaSen Japanese morphological analyzer library from NAIST. The ChaSen library can be obtained from its homepage at http://chasen.naist.jp/hiki/ChaSen/.
Download
Synopsis
> (use chasen) > (chasen-parse "日本語の文字列") ((("日本語" "ニホンゴ" "日本語" "名詞-一般") ("の" "ノ" "の" "助詞-連体化") ("文字" "モジ" "文字" "名詞-一般") ("列" "レツ" "列" "名詞-一般"))) >
Procedures
- (chasen-parse <string> [<keyword-options> ...])
Parses <string> into a list of sentences. Each sentence is a list of morphemes. If a multi-line format was used (the default) the morphemes will be a list of morpheme data, otherwise they will each be a single string. So the default result will look like:
(((<inflected-form> <katakana> <base-form> <part-of-speech>) ...) ...)
The keyword arguments are as described below for chasen-config.
<string> should be utf-8 encoded.
- (chasen-parse/all <string> [<keyword-options> ...])
As chasen-parse, but for each sentence returns a list of all possible parsing paths when there are ambiguities.
- (chasen-parse/current <string>)
Parse using the same options as in the previous parse, faster when performing multiple parses with the same options.
- (chasen-config <keyword-options>)
Directly set options for use with chasen-parse/current. The following keywords are recognized:
- multi-line:
Sentences may be spread out over multiple lines and are delimited with Japanese punctuation markers or blank lines. The default is one sentence per line.
- cost-width: <cost>
Specify the cost width.
- rc-file: <file>
Specify the chasenrc file.
- no-rc-file:
Don't load any chasenrc file.
- default-format:
Use the default output format.
- extended-format:
Provide extended information per morpheme.
- format: <fmt-string>
Manually specify a format string. See the output of "chasen -Fh" for details.
The following procedures also provide more direct access to the ChaSen API:
- (chasen-config/argv <list-of-strings>)
Equivalent to chasen_getopv_argv. <list-of-strings> should just contain the option strings - a program name of "chasen" is prefixed.
- (chasen-parse/string <string>)
Equivalent to chasen_sparse_tostr, returning the result as an unparsed string.
Example
;; Split a haiku into its 5-7-5 syllable phrases (use chasen syntax-case utf8 srfi-1) (import utf8) (define haiku-split (let ((non-syllables (string->list "ャュョッン、。!? \t\n"))) (lambda (str) (define (take-n ls n) (let lp ((i 0) (ls ls) (res '())) (if (or (>= i n) (null? ls)) (values (reverse res) ls) (lp (+ i (length (remove (cut memv <> non-syllables) (string->list (cadar ls))))) (cdr ls) (cons (car ls) res))))) (receive (first-5 rest) (take-n (car (chasen-parse str)) 5) (receive (next-7 last-5) (take-n rest 7) (list first-5 next-7 last-5)))))) (for-each (lambda (x) (apply print (map car x))) (haiku-split "古池や蛙飛込む水の音"))
Requirements
Author
License
BSD
History
- 1.0
- Initial release