Editing page: Outdated egg! - The CHICKEN Scheme wiki

You can edit this page using wiki syntax for markup.

Article contents:

== Outdated egg!

This is an egg for CHICKEN 3, the unsupported old release.  You're almost certainly looking for [[/eggref/4/chasen|the CHICKEN 4 version of this egg]], if it exists.

If it does not exist, there may be equivalent functionality provided by another egg; have a look at the [[https://wiki.call-cc.org/chicken-projects/egg-index-4.html|egg index]]. Otherwise, please consider porting this egg to the current version of CHICKEN.

[[tags: egg]]

== ChaSen

=== Description

An interface to the ChaSen Japanese morphological analyzer
library from NAIST.  The ChaSen library can be obtained from its
homepage at [[http://chasen.naist.jp/hiki/ChaSen/]].

=== Download

[[http://code.call-cc.org/legacy-eggs/3/chasen.egg|chasen.egg]]

=== Synopsis

<enscript language=scheme>
> (use chasen)
> (chasen-parse "日本語の文字列")
((("日本語" "ニホンゴ" "日本語" "名詞-一般")
  ("の" "ノ" "の" "助詞-連体化")
  ("文字" "モジ" "文字" "名詞-一般")
  ("列" "レツ" "列" "名詞-一般")))
>
</enscript>

=== Procedures

* (chasen-parse <string> [<keyword-options> ...])

Parses <string> into a list of sentences.  Each sentence is a
list of morphemes.  If a multi-line format was used (the default)
the morphemes will be a list of morpheme data, otherwise they
will each be a single string.  So the default result will look
like:

(((<inflected-form> <katakana> <base-form> <part-of-speech>) ...) ...)

The keyword arguments are as described below for chasen-config.

<string> should be utf-8 encoded.

* (chasen-parse/all <string> [<keyword-options> ...])

As chasen-parse, but for each sentence returns a list of all
possible parsing paths when there are ambiguities.

* (chasen-parse/current <string>)

Parse using the same options as in the previous parse, faster
when performing multiple parses with the same options.

* (chasen-config <keyword-options>)

Directly set options for use with chasen-parse/current.  The
following keywords are recognized:

** multi-line:

Sentences may be spread out over multiple lines and are delimited
with Japanese punctuation markers or blank lines.  The default is
one sentence per line.

** cost-width: <cost>

Specify the cost width.

** rc-file: <file>

Specify the chasenrc file.

** no-rc-file:

Don't load any chasenrc file.

** default-format:

Use the default output format.

** extended-format:

Provide extended information per morpheme.

** format: <fmt-string>

Manually specify a format string.  See the output of "chasen -Fh"
for details.

The following procedures also provide more direct access to the
ChaSen API:

* (chasen-config/argv <list-of-strings>)

Equivalent to chasen_getopv_argv.  <list-of-strings> should just
contain the option strings - a program name of "chasen" is
prefixed.

* (chasen-parse/string <string>)

Equivalent to chasen_sparse_tostr, returning the result as an
unparsed string.

=== Example

<enscript language=scheme>
;; Split a haiku into its 5-7-5 syllable phrases

(use chasen syntax-case utf8 srfi-1)
(import utf8)

(define haiku-split
  (let ((non-syllables (string->list "ャュョッン、。！？ 　\t\n")))
    (lambda (str)
      (define (take-n ls n)
        (let lp ((i 0) (ls ls) (res '()))
          (if (or (>= i n) (null? ls))
            (values (reverse res) ls)
            (lp (+ i (length (remove (cut memv <> non-syllables)
                                     (string->list (cadar ls)))))
                (cdr ls)
                (cons (car ls) res)))))
      (receive (first-5 rest) (take-n (car (chasen-parse str)) 5)
        (receive (next-7 last-5) (take-n rest 7)
          (list first-5 next-7 last-5))))))

(for-each (lambda (x) (apply print (map car x)))
          (haiku-split "古池や蛙飛込む水の音"))
</enscript>

=== Requirements

[[iconv]]

=== Author

[[Alex Shinn]]

=== License

BSD

=== History

; 1.0 : Initial release

Description of your changes:

I would like to authenticate

Authentication

Username:Password:

Spam control

What do you get when you multiply 4 by 2?