Wiki
Download
Manual
Eggs
API
Tests
Bugs
show
edit
history
You can edit this page using
wiki syntax
for markup.
Article contents:
== Outdated egg! This is an egg for CHICKEN 3, the unsupported old release. You're almost certainly looking for [[/eggref/4/chasen|the CHICKEN 4 version of this egg]], if it exists. If it does not exist, there may be equivalent functionality provided by another egg; have a look at the [[https://wiki.call-cc.org/chicken-projects/egg-index-4.html|egg index]]. Otherwise, please consider porting this egg to the current version of CHICKEN. [[tags: egg]] == ChaSen === Description An interface to the ChaSen Japanese morphological analyzer library from NAIST. The ChaSen library can be obtained from its homepage at [[http://chasen.naist.jp/hiki/ChaSen/]]. === Download [[http://code.call-cc.org/legacy-eggs/3/chasen.egg|chasen.egg]] === Synopsis <enscript language=scheme> > (use chasen) > (chasen-parse "日本語の文字列") ((("日本語" "ニホンゴ" "日本語" "名詞-一般") ("の" "ノ" "の" "助詞-連体化") ("文字" "モジ" "文字" "名詞-一般") ("列" "レツ" "列" "名詞-一般"))) > </enscript> === Procedures * (chasen-parse <string> [<keyword-options> ...]) Parses <string> into a list of sentences. Each sentence is a list of morphemes. If a multi-line format was used (the default) the morphemes will be a list of morpheme data, otherwise they will each be a single string. So the default result will look like: (((<inflected-form> <katakana> <base-form> <part-of-speech>) ...) ...) The keyword arguments are as described below for chasen-config. <string> should be utf-8 encoded. * (chasen-parse/all <string> [<keyword-options> ...]) As chasen-parse, but for each sentence returns a list of all possible parsing paths when there are ambiguities. * (chasen-parse/current <string>) Parse using the same options as in the previous parse, faster when performing multiple parses with the same options. * (chasen-config <keyword-options>) Directly set options for use with chasen-parse/current. The following keywords are recognized: ** multi-line: Sentences may be spread out over multiple lines and are delimited with Japanese punctuation markers or blank lines. The default is one sentence per line. ** cost-width: <cost> Specify the cost width. ** rc-file: <file> Specify the chasenrc file. ** no-rc-file: Don't load any chasenrc file. ** default-format: Use the default output format. ** extended-format: Provide extended information per morpheme. ** format: <fmt-string> Manually specify a format string. See the output of "chasen -Fh" for details. The following procedures also provide more direct access to the ChaSen API: * (chasen-config/argv <list-of-strings>) Equivalent to chasen_getopv_argv. <list-of-strings> should just contain the option strings - a program name of "chasen" is prefixed. * (chasen-parse/string <string>) Equivalent to chasen_sparse_tostr, returning the result as an unparsed string. === Example <enscript language=scheme> ;; Split a haiku into its 5-7-5 syllable phrases (use chasen syntax-case utf8 srfi-1) (import utf8) (define haiku-split (let ((non-syllables (string->list "ャュョッン、。!? \t\n"))) (lambda (str) (define (take-n ls n) (let lp ((i 0) (ls ls) (res '())) (if (or (>= i n) (null? ls)) (values (reverse res) ls) (lp (+ i (length (remove (cut memv <> non-syllables) (string->list (cadar ls))))) (cdr ls) (cons (car ls) res))))) (receive (first-5 rest) (take-n (car (chasen-parse str)) 5) (receive (next-7 last-5) (take-n rest 7) (list first-5 next-7 last-5)))))) (for-each (lambda (x) (apply print (map car x))) (haiku-split "古池や蛙飛込む水の音")) </enscript> === Requirements [[iconv]] === Author [[Alex Shinn]] === License BSD === History ; 1.0 : Initial release
Description of your changes:
I would like to authenticate
Authentication
Username:
Password:
Spam control
What do you get when you multiply 0 by 2?