Wiki
Download
Manual
Eggs
API
Tests
Bugs
show
edit
history
You can edit this page using
wiki syntax
for markup.
Article contents:
== levenshtein Levenshtein edit distance [[toc:]] == Documentation Levenshtein is a collection of procedures providing various forms of the Levenshtein edit distance calculation. The Levenshtein edit distance has been used for areas as diverse as soil sample and language dialect analysis. Not just for text strings. === 8-bit Values Only Performs edit distance calculation for byte strings & blobs. All return the total edit cost. ==== levenshtein-distance/byte ===== Usage <enscript language=scheme> (import levenshtein-byte) </enscript> <procedure>(levenshtein-distance/byte SOURCE TARGET)</procedure> Calculates the edit distance from the {{SOURCE}} to the {{TARGET}}. All costs are unitary. ==== levenshtein-distance/transpose-byte ===== Usage <enscript language=scheme> (import levenshtein-transpose-byte) </enscript> <procedure>(levenshtein-distance/transpose-byte SOURCE TARGET)</procedure> Calculates the edit distance from the {{SOURCE}} to the {{TARGET}}, taking into account the Transpose operation. All costs are unitary. By using the Transpose operation the total edit cost is not at least the difference of the sizes of the two strings. === Parameterized Edit Distance A functor implementing an edit distance algorithm parameterized by a cost and sequence operation modules. Performs edit distance calculation for sequences. ==== Usage <enscript language=scheme> (import trunk/levenshtein-sequence-functor) </enscript> ==== levenshtein-distance/sequence <procedure>(levenshtein-distance/sequence SOURCE TARGET [:insert-cost INSERT-COST] [#:delete-cost DELETE-COST] [#:substitute-cost SUBSTITUTE-COST] (#:get-work-vector GET-WORK-VECTOR) [#:elm-eql ELM-EQL] [#:limit-cost LIMIT-COST]) -> (or false cost)</procedure> ; {{SOURCE}} : {{string}}. ; {{TARGET}} : {{string}}. ; {{INSERT-COST}} : {{number}}, default {{1}}. ; {{DELETE-COST}} : {{number}}, default {{1}}. ; {{SUBSTITUTE-COST}} : {{number}}, default {{1}}. ; {{ELM-EQL}} : {{procedure}}; {{(-> object object boolean)}}, default {{eqv?}}. The equality predicate. ; {{GET-WORK-VECTOR}} : {{procedure}}, default {{make-vector}}. ; {{LIMIT-COST}} : {{number}} or {{#f}}, default {{#f}}. Quit when cost over limit & return {{#f}}. The {{SOURCE}} & {{TARGET}} must be the same type, which the instantiating sequence module supports. '''Note''' that the element comparison procedure is passed via the argument list, and not via the sequence implementation module. Annoying when using strings but useful when using vectors. === Parameterized Edit Distance with Trailing A functor implementing an edit distance algorithm parameterized by a cost operation module. Performs edit distance calculation for vectors. Allows definition of new edit operations. Will keep track of edit operations performed; the {{path-matrix}}. Primarily a toy. ==== Usage <enscript language=scheme> (import levenshtein-vector-functor) </enscript> ==== levenshtein-distance/vector* <procedure>(levenshtein-distance/vector* SOURCE TARGET [EDIT-OPER ...] [#:elm-eql ELM-EQL] [#:operations? OPERATIONS] [#:limit-cost LIMIT-COST]) -> (or false cost) path-matrix</procedure> Calculates the edit distance from the source vector {{SOURCE}} to the target vector {{TARGET}}. Returns the total edit cost or (values <total edit cost> <performed operations matrix>). ; {{SOURCE}} : {{vector}}. ; {{TARGET}} : {{vector}}. ; {{EDIT-OPER}} : {{levenshtein-operator}}. Edit operation definitions to apply. Defaults are the basic Insert, Delete, and Substitute. ; {{ELM-EQL}} : {{procedure}}; {{(-> object object boolean)}}, default {{char=?}}. The equality predicate. ; {{OPERATIONS}} : {{boolean}}. Include the matrix of edit operations performed? Default {{#f}}. ; {{LIMIT-COST}} : {{(or number false)}}, default {{#f}}. Quit when cost over limit & return {{#f}}. === Interface Implementation Modules ==== Cost Implementation Modules For use with {{levenshtein-distance/sequence}} & {{levenshtein-distance/vector*}}. * levenshtein-cost-fixnum * levenshtein-cost-generic * levenshtein-cost-numbers ==== Sequence Implementation Modules For use with {{levenshtein-distance/sequence}}. * levenshtein-sequence-string * levenshtein-sequence-utf8 * levenshtein-sequence-vector === Edit Operators Edit operation specification. A set of base operations is predefined, but may be overridden. The base set is identified by the keys Insert, Delete, Substitute, and Transpose. A printer and reader are provided for edit operations. ==== Usage <enscript language=scheme> (import levenshtein-operators) </enscript> ==== levenshtein-operator <record>levenshtein-operator</record> <procedure>(levenshtein-operator-key OPER) -> {{symbol}}</procedure> <procedure>(levenshtein-operator-name OPER) -> {{string}}</procedure> <procedure>(levenshtein-operator-cost OPER) -> {{number}}</procedure> <procedure>(levenshtein-operator-above OPER) -> {{fixnum}}</procedure> <procedure>(levenshtein-operator-left OPER) -> {{fixnum}}</procedure> ==== make-levenshtein-operator <procedure>(make-levenshtein-operator KEY NAME COST ABOVE LEFT)</procedure> Returns a new edit operator. ; {{KEY}} : {{symbol}}. Key for the operation. ; {{NAME}} : {{string}}. Describes the operation. ; {{COST}} : {{number}}. The cost of the operation. ; {{ABOVE}} : {{non-negative-fixnum}}. How far back in the source. ; {{LEFT}} : {{non-negative-fixnum}}. How far back in the target. ==== levenshtein-operator? <procedure>(levenshtein-operator? OBJECT)</procedure> Is the {{OBJECT}} a levenshtein operator? ==== clone-levenshtein-operator <procedure>(clone-levenshtein-operator EDIT-OPERATION [#:key KEY] [#:name NAME] [#:cost COST] [#:above ABOVE] [#:left LEFT])</procedure> Returns a duplicate of the {{EDIT-OPERATION}}, with field values provided by the optional keyword arguments. {{EDIT-OPERATION}} may be the key of the already defined edit operation. ==== levenshtein-operator-ref <procedure>(levenshtein-operator-ref KEY)</procedure> Get the definition of an edit operation. ==== levenshtein-operator-set! <procedure>(levenshtein-operator-set! EDIT-OPERATION)</procedure> Define an edit operation. ==== levenshtein-operator-delete! <procedure>(levenshtein-operator-delete! EDIT-OPERATION)</procedure> Removes the {{EDIT-OPERATION}} definition. {{EDIT-OPERATION}} may be the {{KEY}} of the already defined edit operation. ==== levenshtein-operator-reset <procedure>(levenshtein-operator-reset)</procedure> Restore defined edit operations to the base set. ==== levenshtein-operator=? <procedure>(levenshtein-operator=? A B)</procedure> Are the {{levenshtein-operator}} {{A}} & {{levenshtein-operator}} {{B}} equal for all fields? ==== levenshtein-insert-operator? ==== levenshtein-delete-operator? ==== levenshtein-substitute-operator? ==== levenshtein-required-operator? <procedure>(levenshtein-insert-operator? OBJECT)</procedure> <procedure>(levenshtein-delete-operator? OBJECT)</procedure> <procedure>(levenshtein-substitute-operator? OBJECT)</procedure> <procedure>(levenshtein-required-operator? OBJECT)</procedure> ==== levenshtein-required-operators <procedure>(levenshtein-required-operators) -> (vector-of levenshtein-operator)</procedure> ==== levenshtein-base-operators <procedure>(levenshtein-base-operators) -> (vector-of levenshtein-operator)</procedure> ==== levenshtein-extended-operators <procedure>(levenshtein-extended-operators OPERLIST) -> (vector-of levenshtein-operator)</procedure> ; {{OPERLIST}} : {{(list-of levenshtein-operator)}} ; operations to add to the required === Path Iterator ==== Usage <enscript language=scheme> (import levenshtein-path-iterator) </enscript> ==== levenshtein-path-iterator <procedure>(levenshtein-path-iterator PATH-MATRIX) -> (-> (or false list))</procedure> Creates an optimal edit distance operation path iterator over the performed operations matrix {{PATH-MATRIX}}. The matrix is usually the result of an invocation of {{(levenshtein-distance/vector* ... operations: #t)}}. Each invocation of the iterator will generate a list of the form: {{((cost source-index target-index levenshtein-operator) ...)}}. The last invocation will return {{#f}}. === Path Matrix Print ==== Usage <enscript language=scheme> (import levenshtein-print) </enscript> ==== print-levenshtein-matrix <procedure>(print-levenshtein-matrix PATH-MATRIX)</procedure> Displays a readable representation of the {{PATH-MATRIX}} on the {{current-output-port}}. == Bugs & Limitations * The functor implementation modules are supposed to be installed in Chicken Home but until CHICKEN 5.4 (i hope) this must be done manually from the retrieved egg source: <enscript language=bash> C5_CACHE="$HOME/.cache/chicken-install/levenshtein" C5_HOME=$(csi -n -R chicken.platform -p '(chicken-home)') cp "$C5_CACHE/levenshtein-cost-*.scm" \ "$C5_CACHE/levenshtein-sequence-*.scm" "$C5_HOME" </enscript> * {{levenshtein-print}} assumes a {{levenshtein-operator-key}} print-name is <= 15 characters and that the cost prints in <= 2 characters. == Examples * Byte Sequence (string or blob) Only <enscript language=scheme> (import levenshtein-byte) (levenshtein-distance/byte "ctas" "cats") ;=> 2 </enscript> <enscript language=scheme> (import levenshtein-transpose-byte) (levenshtein-distance/transpose-byte "ctas" "cats") ;=> 1 </enscript> * Generics using Functors (assume available below) <enscript language=scheme> ;until R⁷RS (define (string->vector s) (list->vector (string->list s))) </enscript> * Generic Sequence & Cost <enscript language=scheme> (import levenshtein-sequence-functor levenshtein-cost-fixnum levenshtein-sequence-vector) (module levenshtein-sequence-fixnum-vector = (levenshtein-sequence-functor levenshtein-cost-fixnum levenshtein-sequence-vector)) (import (prefix levenshtein-sequence-fixnum-vector vsfx:)) (vsfx:levenshtein-distance/sequence (string->vector "ctas") (string->vector "cats") #:elm-eql char=?) ;=> 2 </enscript> * Generic Cost (vector sequence only) w/ Additional Edit Operations <enscript language=scheme> (import levenshtein-operators) (import levenshtein-vector-functor levenshtein-cost-fixnum) (module levenshtein-vector-fixnum = (levenshtein-vector-functor levenshtein-cost-fixnum)) (import (prefix levenshtein-vector-fixnum vcfx:)) (vcfx:levenshtein-distance/vector* (string->vector "ctas") (string->vector "cats") #:elm-eql char=? ;NOTE if supplying any additional operations ;then must supply all operations beyond those ;required (insert & delete). While substitute ;is a base operation, it is not necessary. (levenshtein-operator-ref 'Substitute) (levenshtein-operator-ref 'Transpose)) ;=> 1 ;=> #f ;=> ; 2 values </enscript> * Edit Path Printing <enscript language=scheme> (import levenshtein-print) ;only useful for the path-matrix (import levenshtein-vector-functor levenshtein-cost-fixnum) (module levenshtein-vector-fixnum = (levenshtein-vector-functor levenshtein-cost-fixnum)) (import (prefix levenshtein-vector-fixnum fixn:)) ;NOTE default #:elm-eql keyword parameter is char=? (let-values (((cost pm) (fixn:levenshtein-distance/vector* (string->vector "ctas") (string->vector "cats") operations: #t)) ) (print cost) (print-levenshtein-matrix pm) ) ;=> 2 ;=> ( 0 Substitute) ( 1 Delete) ( 2 Delete) ( 3 Delete) ;=> ( 1 Insert) ( 1 Substitute) ( 1 Substitute) ( 2 Delete) ;=> ( 2 Insert) ( 1 Substitute) ( 2 Substitute) ( 2 Substitute) ;=> ( 3 Insert) ( 2 Insert) ( 2 Substitute) ( 2 Substitute) </enscript> * Path Iteration <enscript language=scheme> ; Instantiate the distance measure algorithm (import levenshtein-path-iterator) (import levenshtein-vector-functor levenshtein-cost-fixnum) (module levenshtein-vector-fixnum = (levenshtein-vector-functor levenshtein-cost-fixnum)) (import (prefix levenshtein-vector-fixnum vcfx:)) (define iter (levenshtein-path-iterator (vcfx:levenshtein-distance/vector* (string->vector "YWCQPGK") (string->vector "LAWYQQKPGKA") operations: #t))) ; ignoring interpreter feedback & we know the distance is 6 (define r0 (iter)) (define t0 r0) (define r1 (iter)) (define r2 (iter)) (define r3 (iter)) (define r4 (iter)) (define r5 (iter)) (iter) ; r0 now has #f, since the iterator finishes by returning to the initial caller, ; which is the body of '(define r0 (iter))', thus re-binding r0. However, t0 has ; the original returned value. </enscript> == Requirements [check-errors]] [vector-lib]] [srfi-1]] [srfi-13]] [srfi-63]] [srfi-69]] [utf8]] [miscmacros]] [[record-variants]] [test]] [test-utils]] == Author [[/users/kon-lovett|Kon Lovett]] == Version history ; 2.4.2 : Fix {{levenshtein-operator}} printing. ; 2.4.1 : Correctly type cost interface ''constants''. {{sequence-string}} is {{sequence-utf8}}. ; 2.4.0 : Operations extend required insert & delete. Add {{levenshtein-extended-operators}} & others. ; 2.3.0 : Imported implementation modules. {{levenshtein-sequence-string}} is {{levenshtein-sequence-utf8}}. ; 2.2.6 : Deactivate functors tests (see Bugs & Limitations). ; 2.2.7 : Activate functors tests. ; 2.2.6 : Fix {{total-cost}} result. ; 2.2.5 : . ; 2.2.4 : . ; 2.2.3 : . ; 2.2.2 : Fix {{levenshtein-distance/vector*}} type. Add tests, ; 2.2.1 : Fix {{levenshtein-sequence-vector}}, ; 2.2.0 : Use {{record-variants}}, ; 2.1.3 : Fix {{levenshtein-path-iterator}}, ; 2.1.2 : . ; 2.0.0 : Chicken 5 release. ; 1.0.3 : Added types. Re-flow. ; 1.0.2 : Added an "egg tag". ; 1.0.1 : Drop "format-compiler". ; 1.0.0 : Chicken 4 release. == License Copyright (c) 2012-2024, Kon Lovett. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED ASIS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Description of your changes:
I would like to authenticate
Authentication
Username:
Password:
Spam control
What do you get when you multiply 7 by 0?