SRFI-130: Cursor-based string library

This SRFI provides a string library derived from SRFI 13. The biggest difference is that it allows subsequences of strings to be specified by cursors as well as the traditional string indexes.

The implementation provided by this egg uses opaque cursors and is built on top of the utf8 egg, but provides better performance for indexing operations like string-ref/cursor. All forms are fully Unicode-aware (UTF-8).

SRFI Description

This page includes excerpts from the SRFI document, but is primarily intended to document the forms exported by the egg. For a full description of the SRFI, see the SRFI document.

Specification

String cursors

Cursors are opaque objects which can only be created by procedures described in the Cursor Operations section (string-cursor-start, etc.). Accessing an element of a Unicode string denoted by a cursor executes in (notional) O(1) time.

Cursors are valid only for the string they were created from. Mutating a string also invalidates all existing cursors referring to it. The current implementation won't help you catch these errors. This is a challenging problem, and suggestions for solving it are welcome.

Most procedures in this library accept cursors or codepoint indexes. Indexes are more convenient, but operations using them are slower (O(n) in the length of the string). For efficient code using substrings or string indexing, prefer cursors.

Shared storage

The following SRFI 130 procedures are allowed to return a result which shares storage with one or more of their string arguments:

substring/cursors
string-take, string-take-right
string-drop, string-drop-right
string-pad, string-pad-right
string-trim, string-trim-right, string-trim-both
string-split, string-filter, string-remove

Notation

In the following procedure specifications:

An s parameter is a string.
A char parameter is a character.
start and end parameters are half-open string cursors or indexes specifying a substring within a string parameter, and typically restrict a procedure's action to the indicated substring.

When omitted, they default to 0 and the length of the string, respectively; or from another point of view, they default to the start cursor and the post-end cursor, respectively. For indexes, it must be the case that 0 ≤ start ≤ end ≤ (string-length s), for the corresponding parameter s when start and end are indexes, and the corresponding relationship must hold when they are cursors. It is an error unless start and end are both cursors or both indexes.

A pred parameter is a unary character predicate procedure, returning a true/false value when applied to a character.
It is an error if a pred is not pure and functional.
A cursor parameter is either a cursor or an exact non-negative integer specifying an index into a string.
len and nchars parameters are exact non-negative integers specifying a length of a string or some number of characters.
An obj parameter may be any value at all.
Passing values to procedures with these parameters that do not satisfy these types is an error.
Parameters given in square brackets are optional.
Unless otherwise noted in the text describing the procedure, any prefix of these optional parameters may be supplied, from zero arguments to the full list.
When a procedure returns multiple values, this is shown by listing the return values in square brackets, as well.

So, for example, the procedure with signature

[procedure] halts? f [x init-store] → [boolean integer]

would take one (f), two (f, x) or three (f, x, init-store) input parameters, and return two values, a boolean and an integer.

A parameter followed by "..." means zero or more elements.

So the procedure with the signature

[procedure] sum-squares x ... → number

takes zero or more arguments (x ...),

while the procedure with signature

[procedure] spell-check doc dict₁ dict₂ ... → string-list

takes two required parameters (doc and dict₁) and zero or more optional parameters (dict₂ ...).

Procedures

Cursor operations

These procedures are mostly taken from Chibi Scheme.

[procedure] string-cursor? obj → boolean

Returns #t if obj can be a string cursor, and #f otherwise.

[procedure] string-cursor-start s → cursor
[procedure] string-cursor-end s → cursor

Returns the start/post-end cursor of s respectively.

[procedure] string-cursor-next s cursor → cursor
[procedure] string-cursor-prev s cursor → cursor

Returns the cursor into s following/preceding cursor. If cursor is an index, returns one more/less than cursor. It is an error if cursor is the post-end/start cursor of s.

[procedure] string-cursor-forward s cursor nchars → cursor
[procedure] string-cursor-back s cursor nchars → cursor

Returns the cursor into s which follows/precedes cursor by nchars characters. If cursor is an index, returns nchars more/less than cursor. It is an error if the result would be an invalid cursor or index.

[procedure] string-cursor=? cursor[1] cursor[2] → boolean
[procedure] string-cursor<? cursor[1] cursor[2] → boolean
[procedure] string-cursor>? cursor[1] cursor[2] → boolean
[procedure] string-cursor<=? cursor[1] cursor[2] → boolean
[procedure] string-cursor>=? cursor[1] cursor[2] → boolean

Compares two cursors or two indexes pointing into the same string.

[procedure] string-cursor-diff s start end → nchars

Returns the number of characters between start and end in string s. Note that the result is always non-negative if start and end are a valid start-end pair.

[procedure] string-cursor->index s cursor → index
[procedure] string-index->cursor s index → cursor

Converts a cursor / index into s into the corresponding index/cursor. If the argument is already an index/cursor, it is returned unchanged.

Predicates

[procedure] string-null? s → boolean

Is s the empty string?

[procedure] string-every pred s [start end] → value
[procedure] string-any pred s [start end] → value

Checks to see if every/any character in s satisfies pred proceeding from left (index start) to right (index end).

The predicate is "witness-generating":

If string-any returns true, the returned true value is the one produced by the application of the predicate.
If string-every returns true, the returned true value is the one produced by the final application of the predicate to s[end-1].

If string-every is applied to an empty sequence of characters, it simply returns #t.

The names of these procedures do not end with a question mark--this is to indicate that they do not return a simple boolean (#t or #f), but a general value.

Constructors

[procedure] string-tabulate proc len → string

proc is an integer → char procedure. Construct a string of size len by applying proc to each value from 0 (inclusive) to len (exclusive) to produce the corresponding string element. The order in which proc is applied to the indexes is not specified.

Note that the order of arguments is not the same as SRFI 1's list-tabulate, but is the same as tabulation functions in other SRFIs. When this discrepancy was discovered in SRFI 13, it was too late to change SRFI 1.

[procedure] string-unfold stop? mapper successor seed [base make-final] → string

This is a fundamental constructor for strings.

successor is used to generate a series of "seed" values from the initial seed: seed, (successor seed), (successor^2 seed), (successor^3 seed), ...
stop? tells us when to stop--when it returns true when applied to one of these seed values.
mapper maps each seed value to the corresponding character in the result string. These chars are assembled into the string in a left-to-right order.
base is the optional initial/leftmost portion of the constructed string; it defaults to the empty string "".
make-final is applied to the terminal seed value (on which stop? returns true) to produce the final/rightmost portion of the constructed string. It defaults to (lambda (x) "").

string-unfold is a fairly powerful string constructor--you can use it to convert a list to a string, read a port into a string, reverse a string, copy a string, and so forth.

Examples:

(port->string p) = (string-unfold eof-object?
                                  values
                                  (lambda (x) (read-char p))
                                  (read-char p))

(list->string lis) = (string-unfold null? car cdr lis)

(string-tabulate f size) = (string-unfold (lambda (i) (= i size)) f add1 0)

To map f over a list lis, producing a string:

(string-unfold null? (compose f car) cdr lis)

Interested functional programmers may enjoy noting that string-fold-right and string-unfold are in some sense inverses. That is, given operations knull?, kar, kdr, kons, and knil satisfying

(kons (kar x) (kdr x)) = x  and (knull? knil) = #t

then

(string-fold-right kons knil (string-unfold knull? kar kdr x)) = x

and

(string-unfold knull? kar kdr (string-fold-right kons knil s)) = s.

The final string constructed does not share storage with either base or the value produced by make-final.

This combinator sometimes is called an "anamorphism."

[procedure] string-unfold-right stop? mapper successor seed [base make-final] → string

This is a fundamental constructor for strings.

It is equivalent to string-unfold, except that the results of mapper are assembled into the string in a right-to-left order, base is the optional rightmost portion of the constructed string, and make-final produces the leftmost portion of the constructed string.

Conversion

[procedure] string->list/cursors s [start end] → char-list
[procedure] string->vector/cursors s [start end] → char-vector

string->list/cursors and string->vector/cursors return a newly allocated list or vector of the characters that make up the given string. They differ from the R7RS procedures string->list and string->vector by accepting either cursors or indexes.

[procedure] reverse-list->string char-list → string

An efficient implementation of (compose list->string reverse):

(reverse-list->string '(#\a #\B #\c)) → "cBa"

This is a common idiom in the epilog of string-processing loops that accumulate an answer in a reverse-order list. (See also string-concatenate-reverse for the "chunked" variant.)

[procedure] string-join string-list [delimiter grammar] → string

This procedure is a simple unparser--it pastes strings together using the delimiter string.

The grammar argument is a symbol that determines how the delimiter is used, and defaults to infix.

infix means an infix or separator grammar: insert the delimiter between list elements. An empty list will produce an empty string--note, however, that parsing an empty string with an infix or separator grammar is ambiguous. Is it an empty list, or a list of one element, the empty string?
strict-infix means the same as infix, but will signal an error if given an empty list.
suffix means a suffix or terminator grammar: insert the delimiter after every list element. This grammar has no ambiguities.
prefix means a prefix grammar: insert the delimiter before every list element. This grammar has no ambiguities.

The delimiter is the string used to delimit elements; it defaults to a single space " ".

Examples:

(string-join '("foo" "bar" "baz") ":")         => "foo:bar:baz"
(string-join '("foo" "bar" "baz") ":" 'suffix) => "foo:bar:baz:"

;; Infix grammar is ambiguous wrt empty list vs. empty string,
(string-join '()   ":") => ""
(string-join '("") ":") => ""

;; but suffix & prefix grammars are not.
(string-join '()   ":" 'suffix) => ""
(string-join '("") ":" 'suffix) => ":"

Selection

[procedure] string-ref/cursor s cursor → char

Returns a character element of s indexed by a valid cursor or index of s. It differs from the R7RS procedure string-ref by accepting either a cursor or an index.

[procedure] substring/cursors s start end → string
[procedure] string-copy/cursors s [start end] → string

These procedures return a string whose contents are the characters of s beginning with index start (inclusive) and ending with index end (exclusive). If substring/cursors produces the entire string, it may return either s or a copy of s; in some implementations, proper substrings may share memory with s. However, string-copy/cursors always returns a newly allocated string. They differ from the R7RS procedures substring and string-copy by accepting either cursors or indexes.

[procedure] string-take s nchars → string
[procedure] string-drop s nchars → string
[procedure] string-take-right s nchars → string
[procedure] string-drop-right s nchars → string

string-take returns the first nchars of s; string-drop returns all but the first nchars of s. string-take-right returns the last nchars of s; string-drop-right returns all but the last nchars of s.

If these procedures produce the entire string, they may return either s or a copy of s; in some implementations, proper substrings may share memory with s.

Examples:

(string-take "Pete Szilagyi" 6) => "Pete S"
(string-drop "Pete Szilagyi" 6) => "zilagyi"

(string-take-right "Beta rules" 5) => "rules"
(string-drop-right "Beta rules" 5) => "Beta "

It is an error to take or drop more characters than are in the string:

(string-take "foo" 37) => error

[procedure] string-pad s len [char start end] → string
[procedure] string-pad-right s len [char start end] → string

Build a string of length len comprised of s padded on the left (right) by as many occurrences of the character char as needed. If s has more than len chars, it is truncated on the left (right) to length len. Char defaults to #\space.

If len ≤ end-start, the returned value is allowed to share storage with s, or be exactly s (if len = end-start).

(string-pad     "325" 5) => "  325"
(string-pad   "71325" 5) => "71325"
(string-pad "8871325" 5) => "71325"

[procedure] string-trim s [pred start end] → string
[procedure] string-trim-right s [pred start end] → string
[procedure] string-trim-both s [pred start end] → string

Trim s by skipping over all characters on the left / on the right / on both sides that satisfy the second parameter pred. pred defaults to char-whitespace?.

If no trimming occurs, these functions may return either s or a copy of s; in some implementations, proper substrings may share memory with s.

(string-trim-both "  The outlook wasn't brilliant,  \n\r")
    => "The outlook wasn't brilliant,"

Prefixes & suffixes

[procedure] string-prefix-length s1 s2 [start1 end1 start2 end2] → integer
[procedure] string-suffix-length s1 s2 [start1 end1 start2 end2] → integer

Return the length of the longest common prefix/suffix of the two strings. For prefixes, this is equivalent to the "mismatch index" for the strings (modulo the start cursors).

The optional start / end cursors or indexes restrict the comparison to the indicated substrings of s1 and s2.

[procedure] string-prefix? s1 s2 [start1 end1 start2 end2] → boolean
[procedure] string-suffix? s1 s2 [start1 end1 start2 end2] → boolean

Is s1 a prefix/suffix of s2?

The optional start/end cursors or indexes restrict the comparison to the indicated substrings of s1 and s2.

Searching

[procedure] string-index s pred [start end] → cursor
[procedure] string-index-right s pred [start end] → cursor
[procedure] string-skip s pred [start end] → cursor
[procedure] string-skip-right s pred [start end] → cursor

string-index searches through s from the left, returning the cursor of the first occurrence of a character which satisfies the predicate pred. If no match is found, it returns end. string-index-right searches through s from the right, returning the cursor of the successor of the first occurrence of a character which satisfies the predicate pred. If no match is found, it returns start.

The start and end parameters specify the beginning and end cursors or indexes of the search; the search includes the start, but not the end. Be careful of "fencepost" considerations: when searching right-to-left, the first position considered is (string-cursor-prev end), whereas when searching left-to-right, the first index considered is start. That is, the start / end indexes describe the same half-open interval [start, end) in these procedures that they do in all the other SRFI 130 procedures.

The skip functions are similar, but use the complement of the criteria: they search for the first char that doesn't satisfy pred. E.g., to skip over initial whitespace, say

(substring/cursors s (string-skip s char-whitespace?))

Note that the result is always a cursor, even when start and end are indexes. Use string-cursor->index to convert the result to an index. Therefore, these four functions are not entirely compatible with their SRFI 13 counterparts, which return #f on failure.

These functions can be trivially composed with string-take and string-drop to produce take-while, drop-while, span, and break procedures without loss of efficiency.

[procedure] string-contains s1 s2 [start1 end1 start2 end2] → cursor
[procedure] string-contains-right s1 s2 [start1 end1 start2 end2] → cursor

Does string s1 contain string s2?

Returns the cursor in s1 referring to the first character of the first/last instance of s2 as a substring, or #f if there is no match. The optional start / end indexes restrict the operation to the indicated substrings.

The returned cursor is in the range [start1, end1). A successful match must lie entirely in the [start1, end1) range of s1.

Note that the result is always a cursor, even when start1 and end1 are indexes. Use string-cursor->index to convert a cursor result to an index.

(string-contains "eek--what a geek." "ee"
                 12 18) ; Searches "a geek"
    => {Cursor 15}

The name of this procedure does not end with a question mark--this is to indicate that it does not return a simple boolean (#t or #f). Rather, it returns either #f or a cursor.

The whole string

[procedure] string-reverse s [start end] -> string

Reverse the string.

string-reverse returns the result string and does not alter its s parameter.

(string-reverse "Able was I ere I saw elba.")
    => ".able was I ere I saw elbA"
(string-reverse "Who stole the spoons?" 14 20)
    => "snoops"

Unicode note: Reversing a string simply reverses the sequence of code-points it contains. So a combining diacritic a coming after a base character b in string s would come out before b in the reversed result.

[procedure] string-concatenate string-list → string

Append the elements of string-list together into a single string. Guaranteed to return a freshly allocated string.

Note that the (apply string-append string-list) idiom is not robust for long lists of strings, as some Scheme implementations limit the number of arguments that may be passed to an n-ary procedure.

[procedure] string-concatenate-reverse string-list [final-string end] → string

With no optional arguments, this function is equivalent to

(string-concatenate (reverse string-list))

If the optional argument final-string is specified, it is consed onto the beginning of string-list before performing the list-reverse and string-concatenate operations.

If the optional argument end is given, only the characters up to but not including end in final-string are added to the result, thus producing

(string-concatenate
  (reverse (cons (substring final-string
                            (string-cursor-start final-string)
                            end)
                 string-list)))

E.g.

(string-concatenate-reverse '(" must be" "Hello, I") " going.XXXX" 7)
  => "Hello, I must be going."

This procedure is useful in the construction of procedures that accumulate character data into lists of string buffers, and wish to convert the accumulated data into a single string when done.

[procedure] string-fold kons knil s [start end] → value
[procedure] string-fold-right kons knil s [start end] → value

These are the fundamental iterators for strings.

The string-fold operator maps the kons procedure across the string from left to right

(... (kons s[2] (kons s[1] (kons s[0] knil))))

In other words, string-fold obeys the (tail) recursion

(string-fold kons knil s start end) =
    (string-fold kons (kons s[start] knil) start+1 end)

The string-fold-right operator maps the kons procedure across the string from right to left

(kons s[0] (... (kons s[end-3] (kons s[end-2] (kons s[end-1] knil)))))

obeying the (tail) recursion

(string-fold-right kons knil s start end) =
    (string-fold-right kons (kons s[end-1] knil) start end-1)

Examples:

;;; Convert a string to a list of chars.
(string-fold-right cons '() s)

;;; Count the number of lower-case characters in a string.
(string-fold (lambda (c count)
               (if (char-lower-case? c)
                   (+ count 1)
                   count))
             0
             s)

;;; Double every backslash character in S.
(let* ((ans-len (string-fold (lambda (c sum)
                               (+ sum (if (char=? c #\\) 2 1)))
                             0 s))
       (ans (make-string ans-len)))
  (string-fold (lambda (c i)
                 (let ((i (if (char=? c #\\)
                              (begin (string-set! ans i #\\) (+ i 1))
                              i)))
                   (string-set! ans i c)
                   (+ i 1)))
               0 s)
  ans)

The right-fold combinator is sometimes called a "catamorphism."

[procedure] string-for-each-cursor proc s [start end] → unspecified

Apply proc to each cursor of s, in order, excluding the post-end cursor. The optional start / end pairs restrict the endpoints of the loop. This is simply a method of looping over a string that is guaranteed to be safe and correct.

Example:

(let ((s "abcde") (v '()))
  (string-for-each-cursor
    (lambda (cur) (set! v (cons (char->integer (string-ref/cursor s cur)) v)))
    s)
  v) => (101 100 99 98 97)

[procedure] string-replicate s from to [start end] → string

This is an "extended substring" procedure that implements replicated copying of a substring of some string.

s is a string; start and end are optional arguments that demarcate a substring of s, defaulting to 0 and the length of s (i.e., the whole string). Replicate this substring up and down index space, in both the positive and negative directions. For example, if s = "abcdefg", start = 3, and end = 6, then we have the conceptual bidirectionally-infinite string

...  d  e  f  d  e  f  d  e  f d  e  f  d  e  f  d  e  f  d ...
... -9 -8 -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8 +9 ...

string-replicate returns the substring of this string beginning at index from, and ending at to. Note that these arguments cannot be cursors. It is an error if from is greater than to.

You can use string-replicate to perform a variety of tasks:

To rotate a string left:

(string-replicate "abcdef" 2 8) => "cdefab"

To rotate a string right:

(string-replicate "abcdef" -2 4) => "efabcd"

To replicate a string:

(string-replicate "abc" 0 7) => "abcabca"

Note that

The from / to indexes give a half-open range--the characters from index from up to, but not including, index to.
The from / to indexes are not in terms of the index space for string s. They are in terms of the replicated index space of the substring defined by s, start, and end.
It is an error if start = end--although this is allowed by special dispensation when from = to.

Compatibility note: string-replicate is identical to the xsubstring procedure of SRFI 13, except that the to argument is required.

[procedure] string-count s pred [start end] → integer

Return a count of the number of characters in s that satisfy the pred argument.

[procedure] string-replace s1 s2 start1 end1 [start2 end2] → string

Returns

(string-append (substring/cursors s1 (string-cursor-start s1) start1)
               (substring/cursors s2 start2 end2)
               (substring/cursors s1 end1 (string-cursor-end s1)))

That is, the segment of characters in s1 from start1 to end1 is replaced by the segment of characters in s2 from start2 to end2. If start1 = end1, this simply splices the s2 characters into s1 at the specified index.

Examples:

(string-replace "The TCL programmer endured daily ridicule."
                "another miserable perl drone" 4 7 8 22 ) =>
    "The miserable perl programmer endured daily ridicule."

(string-replace "It's easy to code it up in Scheme." "lots of fun" 5 9) =>
    "It's lots of fun to code it up in Scheme."

(define (string-insert s i t) (string-replace s t i i))

(string-insert "It's easy to code it up in Scheme." 5 "really ") =>
    "It's really easy to code it up in Scheme."

[procedure] string-split s delimiter [grammar limit start end] → list

Returns a list of the words contained in the substring of string from start (inclusive) to end (exclusive). delimiter specifies a string that is to be used as the word separator. This will often be a single character, but multiple characters are allowed for cases like splitting on "\r\n". The returned list will then have one more item than the number of non-overlapping occurrences of the delimiter in the string. If delimiter is an empty string, then the returned list contains a list of strings, each of which contains a single character.

Grammar is a symbol with the same meaning as in the string-join procedure. If it is infix, which is the default, processing is done as described above, except that an empty s produces the empty list; if it is strict-infix, an empty s signals an error. The values prefix and suffix cause a leading/trailing empty string in the result to be suppressed.

If limit is a non-negative exact integer, at most that many splits occur, and the remainder of string is returned as the final element of the list (thus, the result will have at most limit + 1 elements). If limit is not specified or is #f, then as many splits as possible are made. It is an error if limit is any other value.

Use SRFI 115's regexp-split (irregex-split from the irregex module) to split on a regular expression rather than a simple string.

[procedure] string-filter pred s [start end] → string
[procedure] string-remove pred s [start end] → string

Filter the string s, retaining only those characters that satisfy / do not satisfy pred.

If the string is unaltered by the filtering operation, these functions may return either s or a copy of s.

Compatibility note: string-remove is identical to the string-delete procedure of SRFI 13, but the name string-delete is inconsistent with the conventions of SRFI 1 and other SRFIs.

Exceptions

This egg tries to give useful information when things go wrong. Procedure arguments are type-checked; when a type check fails, a condition of kind (exn type assertion) is raised. String bounds errors are signaled by (exn bounds assertion) conditions. This conforms to the condition protocol used by CHICKEN's internal libraries.

See the Module (chicken condition) page for more information.

 Copyright © 2022 Wolfgang Corcoran-Mathe
 
 Permission is hereby granted, free of charge, to any person
 obtaining a copy of this software and associated documentation
 files (the "Software"), to deal in the Software without
 restriction, including without limitation the rights to use,
 copy, modify, merge, publish, distribute, sublicense, and/or
 sell copies of the Software, and to permit persons to whom the
 Software is furnished to do so, subject to the following
 conditions:
 
 The above copyright notice and this permission notice shall be
 included in all copies or substantial portions of the Software.
 
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
 OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
 WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 OTHER DEALINGS IN THE SOFTWARE.

The SRFI document is copyright Olin Shivers (1998, 1999, 2000) and John Cowan (2016).

Version history

0.1: SRFI-130 ported to Chicken 5.2.0
0.2: Registered srfi-130 feature, and linked to source code in documenation
0.3: Change maintainer and repository information.
0.4 (2021-08-27): Switch to Clinger implementation, add range checks.
1.0 (2022-01-13): First Unicode-aware release.
2.0.0 (2022-09-08): New opaque-cursor implementation.
2.0.1 (2022-09-09): Remove unneeded, unlisted import.