slset

Efficient procedures for dealing with lists of symbols as sets.

Requirements

This egg has no dependencies (except the test egg for running the tests).

This extension provides (more-or-less) drop-in replacement procedures for srfi-1's lset-* set operation procedures, with linear performance. It can be used whenever you use lists *containing only symbols* as sets, but the quadratic performance of the SRFI-1 implementations are causing you grief.

The current implementation uses "plist" property lists to temporarily mark symbols, to avoid traversing lists more than once. This means all operations involving two sets of lengths n and m are O(n+m) instead of O(n*m).

To "upgrade" your code from the SRFI-1 to the slset API, follow the following logic:

- Replace the lset- prefix with slset- - Replace delete-duplicates with slset-deduplicate - Replace delete with slset-delete - Replace member or memq with slset-contains? (optional) - For all the above, drop the comparison predicate (i.e. eq?, usually the first argument, except member and delete)

Note that "linear update" variants are currently not supported. Just drop the bang from the procedure name and you'll be fine.

A special "reified slset" type is also provided to allow O(1) membership testing and adjoining of elements (see below).

Comparison procedures

[procedure] (slset<= LIST_1 ...)

Returns true iff every LIST_i is a subset of LIST_i+1.

List A is a subset of list B if every symbol in A also occurs in B.

(slset<= '(a) '(a b a) '(a b c c)) => #t

(slset<= '(a b) '(a)) =>  #f
(slset<= '(a b) '(a c)) =>  #f

(slset<=) => #t             ; Trivial cases
(slset<= '(a)) => #t

[procedure] (slset= LIST_1 LIST_2 ...) -> boolean

Returns true iff every LIST_i is set-equal to LIST_i+1.

"Set-equal" simply means that LIST_i is a subset of LIST_i+1, and LIST_i+1 is a subset of LIST_i.

(slset= '(b e a) '(a e b) '(e e b a)) => #t

(slset= '(b e a) '(a e b x) '(e e b a)) => #f

(slset=) => #t               ; Trivial cases
(slset= '(a)) => #t

Set element operations

[procedure] (slset-adjoin LIST SYM_1 ...) -> list

Adds the SYM_i symbols not already in the LIST parameter to the result list. The result shares a common tail with the LIST parameter. The new symbols are added to the front of the list, but no guarantees are made about their order.

The LIST parameter is always a suffix of the result -- even if the LIST parameter contains repeated symbols, these are not reduced.

IMPORTANT: This is O(n+m) where n is the number of symbols already in LIST and m the number of symbols to adjoin. This is better than SRFI-1, but it does mean that if you adjoin symbols one by one (a typical situation), it is still O(n*m), just like SRFI-1.

If you know that the set cannot possibly contain the symbols to add, simply use cons to add them.

If you have no knowledge of the symbols and set contents, see reified-slsets below for O(1) set addition without duplicates.

(slset-adjoin '(a b c d c e) 'a 'e 'i 'o 'u) => (u o i a b c d c e)

[procedure] (slset-delete LIST SYM_1 ...) -> list

Removes the SYM_i symbols occurring in the LIST parameter from the result list. If any such symbols occur more than once, those are all removed.

IMPORTANT: This is O(n+m) where n is the number of symbols in LIST and m the number of symbols to delete.

This is better than SRFI-1, but it does mean that if you delete symbols one by one (a typical situation), it is still O(n*m), just like delete in SRFI-1.

See reified-slsets below for slightly better deletion.

(slset-delete '(a b c d a c e) 'b) => (a c d a c e)
(slset-delete '(a b c d a c e) 'a 'e 'i) => '(b c d c)

[procedure] (slset-contains? LIST SYM) -> boolean

Returns #t if SYM occurs in LIST, #f otherwise. This is mostly provided for convenience, as it does not provide any performance improvement over using memq directly - its runtime is O(n).

See reified-slsets below for O(1) membership testing.

(slset-contains? '(a b c d a c e) 'a) => #t
(slset-contains? '(a b c d a c e) 'd) => #t
(slset-contains? '(a b c d a c e) 'i) => #f

Procedures for performing set algebra

[procedure] (slset-union LIST_1 ...) -> list

Returns the union of the lists.

The union of lists A and B is constructed as follows:

If A is the empty list, the answer is B (or a copy of B).
Otherwise, the result is initialised to be list A (or a copy of A).
Proceed through the symbols of list B in a left-to-right order. For each symbol that is not a member of A, it is consed onto the front of the result.

In practice, this means the order of the symbols in LIST_1 are kept as-is at the end of the result, but all the LIST_i for I>1 are added in reverse order to the front of the result.

In the n-ary case, the two-argument list-union operation is simply folded across the argument lists.

(slset-union '(a b c d e) '(a e i o u)) => (u o i a b c d e)

;; Repeated symbols in LIST1 are preserved.
(slset-union '(a a c) '(x a x)) => (x a a c)

;; Trivial cases
(slset-union) => ()
(slset-union '(a b c)) => (a b c)

[procedure] (slset-intersection LIST_1 LIST_2 ...) -> list

Returns the intersection of the lists.

The intersection of lists A and B is comprised of every symbol of A that also occurs in B. Note this implies that an symbol which appears in B and multiple times in list A will also appear multiple times in the result.

The order in which symbols appear in the result is the same as they appear in LIST_1 -- that is, slset-intersection essentially filters LIST_1, without disarranging symbol order. The result may share a common tail with LIST_1.

In the n-ary case, the two-argument list-intersection operation is simply folded across the argument lists.

(slset-intersection '(a b c d e) '(a e i o u)) => (a e)

;; Repeated symbols in LIST1 are preserved.
(slset-intersection '(a x y a) '(x a x z)) => '(a x a)
(slset-intersection '(a x b y b a) '(x a x z) '(a b)) => (a a)

(slset-intersection '(a b c)) => (a b c)     ; Trivial case

[procedure] (slset-difference list_1 list_2 ...) -> list

Returns the difference of the lists -- all the symbols of LIST_1 that do not occur in any of the other LIST_i parameters.

Symbols that are repeated multiple times in the LIST_1 parameter will occur multiple times in the result. The order in which symbols appear in the result is the same as they appear in LIST_1 -- that is, slset-difference essentially filters LIST_1, without disarranging symbol order. The result may share a common tail with LIST_1.

(slset-difference '(a b c d e) '(a e i o u)) => (b c d)

;; Repeated symbols in LIST1 are preserved.
(slset-difference '(a b b c x x d x i e) '(a e i o u) '() '(x x y z z)) => (b b c d)

(slset-difference '(a b c)) => (a b c) ; Trivial case

[procedure] (slset-xor LIST_1 ...) -> list

Returns the exclusive-or of the sets. If there are exactly two lists, this is all the symbols that appear in exactly one of the two lists. The operation is associative, and thus extends to the n-ary case -- the symbols that appear in an odd number of the lists. The result may share a common tail with any of the LIST_i parameters.

More precisely, for two lists A and B, A xor B is a list of

every symbol of A that does not occur in B
every symbol of B that does not occur in A

In the n-ary case, the binary-xor operation is simply folded across the lists.

(slset-xor '(a b c d e) '(a e i o u)) => (d c b i o u)
 
;; For multiple lists, returns symbols that appear in an odd number of lists
(slset-xor '(a b c d e) '(a e i o u) '(b o e) '(x)) => (c d i u e x)

;; Repeated symbols in lists are preserved
(slset-xor '(a b b c d e d) '(a e i o u o)) => (b b c d d i o u o)
(slset-xor '(a b b c d e d) '(a e i o u o) '(x x)) => (b b c d d i o u o x x)

;; Trivial cases.
(slset-xor) => ()
(slset-xor '(a b c d e)) => (a b c d e)

[procedure] (slset-diff+intersection LIST_1 LIST_2 ...) -> [list list]

Returns two values -- the difference and the intersection of the lists. Is equivalent to

(values (slset-difference LIST_1 LIST_2 ...)
        (slset-intersection LIST_1
                            (lset-union = LIST_2 ...)))

but can be implemented more efficiently.

Either of the answer lists may share a common tail with LIST_1. This operation essentially partitions LIST_1.

Deduplication

[procedure] (slset-deduplicate LST)

Removes any symbols occurring more than once in LST. Keeps the first occurrence, removes later occurrences. This is O(n) on the length of the list.

Reified slsets

With symbol set lists, membership testing, deletion and adjoining still requires traversal of the entire list at least once per operation, and marking and unmarking of the list on every operation.

This is fine when the sets are already completed. However, when building sets, it would be nice if one could add a member (without duplicates) with O(1) runtime. And when working with completed sets, it would be nice to do the same for membership testing.

In such situations, one can reify an slset. This wraps the list in a new mutable type. As long as it's reified, the symbols retain their markings, allowing O(1) membership testing and, by extension, O(1) adjoining of new symbols.

This is related to Clojure's notion of "transients". The difference is that a transient is a mutable full copy of the underlying object, whereas the reified slset is a mutable wrapper which refers to the same underlying list.

[procedure] (with-reified-slset LST PROC) -> list ...

Calls PROC with a reified slset object referring to LST as its argument. During the dynamic extent of PROC, all symbols in LST will have an observable marking in their plist.

A call to with-reified-slset is O(n) on the length of LST.

After PROC returns, with-reified-slset returns multiple values:

The first value is the final slset that remains after applying all mutations to the original LST. The rest of the values are the values returned by PROC, allowing one to capture return values from PROC if desired.

Note that most continuations which accept a single value will implicitly discard any extra values, so if you're not interested in what PROC returns, you may act exactly as if with-reified-slset only returned the final slset.

   (let-values ((result (with-reified-slset '(a b c d a c e)
                          (lambda (r)
                            (reified-slset-contains? r 'a) => #t
                            (reified-slset-contains? r 'i) => #f

                            (reified-slset-delete! r 'b) => r
                            (reified-slset->slset r) => (a c d a c e)

                            (reified-slset-adjoin! r 'x) => r
                            (reified-slset-contains? r 'x) => #t

                            (reified-slset-delete! r 'a 'e 'i) => r
                            
                            (reified-slset->slset r) => (x c d c)

                            (reified-slset-contains? r 'a) => #f

                            (reified-slset-contains? r 'c) => #t

                            ;; Return some values
                            (values 1 2 3)))))
     result) => ((x c d c) 1 2 3)

[procedure] (reified-slset? OBJ) -> boolean

Predicate for reified slsets.

Returns #t if OBJ is a reified slset object, #f otherwise.

[procedure] (reified-slset->slset RS) -> list

Extract the slset (the plain list of symbols) from the reified slset RS, with any mutations done so far. Note that this list may be shared with the reified slset, so any mutations made to the list could invalidate the slset assumptions.

[procedure] (reified-slset-adjoin! RS SYM_1 ...) -> reified-slset

Adds the SYM_i symbols not already in the reified slset RS parameter to the set, mutating it in-place. New symbols are added to the front of the list, but no guarantees are made about their order.

Runtime is O(n) on the number of symbols to be added.

Returns RS (for convenience only, the set is updated in-place).

[procedure] (reified-slset-delete! RS SYM_1 ...) -> reified-slset

Runtime is O(n+m), with n the size of the RS set, and m the number of symbols to delete. As a special case, if none of the symbols occur in the set, it is O(m), because filtering of the set is skipped.

Returns RS (for convenience only, the set is updated in-place).

[procedure] (reified-slset-contains? RS SYM) -> boolean

Returns #t if SYM occurs in the reified slset RS, #f otherwise.

Runtime is O(1).

Caveats

Since this egg uses symbol plists under the hood, its performance is dependent the size of plists. In effect, operations on two sets of length n and m will be O(n*L+m*L), where L is the average length of the plists of symbols in both sets. In typical situations, plists will be mostly empty. Therefore, in the documentation above, we ignore the variable L when discussing performance.

Like the similar SRFI-1 lset operations, duplicates are typically kept around in the first set argument only.

All of the egg's procedures accept lists containing other types of elements, not just symbols. Non-symbol elements are silently skipped. If they occur in the first set argument, they will typically not be removed.

Changelog

0.1 Initial release

License

 Copyright (C) 2025 Peter Bex
 Redistribution and use in source and binary forms, with or without
 modification, is permitted.
 
 THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS
 OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE
 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
 OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
 BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
 LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
 USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
 DAMAGE.

Description

Author

Repository