sdbm (historical revision 34196) - The CHICKEN Scheme wiki

You are looking at historical revision 34196 of this page. It may differ significantly from its current revision.

sdbm

sdbm is a clone of the SDBM database library.

sdbm

sdbm is a reimplementation of the public-domain SDBM database library, which is itself essentially a clone of NDBM. SDBM provides a simple key-value store with a fixed limit on data length and no ACID semantics to speak of, providing no write locking, no atomicity, no transactions, and little guarantee that your file won't be corrupted in a crash. It also relies on sparse file support, which is not present on filesystems such as HFS+.

Despite these shortcomings, it is a simple implementation without dependencies, written completely in Scheme. And some issues with the original implementation have been remedied: byte order is configurable, and page and directory block size can be adjusted at runtime. Therefore, sdbm might still be useful as a very simple key-value store for non-critical applications.

Joint Database Technology - SDBM and Flat File Databases working in tandem

Where external, binary, persistent, SDBM (see SDBM mentioned in SUCCESSORS section below) database files (tied to program hash tables - such as in PERL application programs) can really be made useful is in using the key/value pairs for random access indexing into a huge relational "text" flat file database composed of many flat files (with fixed-length records) exhibiting parent/child (1-to-many record) relationships. The key would be composed of a: single field, single partial field, or a compound key of multiple single and/or partial fields concatenated together (perhaps with a delimiter character between them such as a pipe "|"). The value in the key/value pair would be the location offset (in bytes) to seek to (i.e. position the file pointer) in a flat file at the start of a specific record wished to be random accessed for: READ, READ/WRITE, or APPEND access. Multiple SDBM files can be setup as alternate indexes into each of the Flat File database text files, each SDBM file containing a different key (composed of a: single field, single partial field, or a compound key of multiple single and/or partial fields concatenated together). An alternate key with duplicates can be created in the SDBM files by making as part of the key, an incremented number perhaps in the range 1-9999.

       Key example:    LastName|IncNbr(perhaps in range 1-9999)
                       "Williams|1" ... "Williams|5745".

When editing Flat File records, the changes are made "in place" overwriting existing data in the flat file record. Be careful to design your user-interface so that any changes in the Flat File data are also made in any corresponding data in your SDBM file key/value pairs so that the indexing is properly maintained. A DELETE flag indicator field can be employed to mark records in both the Flat Files and SDBM files (for later BATCH deletion Server-side during off hours) for an application program to recognize as a BYPASS indicator.

In a multi-user environment, a manual record locking system could be designed to lock a specific record for editing by one user. The username, flat file name, and record offset could be stored in an external SDBM database file (tied to a program hash table) at the time a user makes the request to edit a specific record. Once the record is released from EDIT (SAVE or CANCEL issued), then the lock is removed from the SDBM file. Each time a user makes a request to edit a record, the user-interface to the Flat File database would perform a Lookup to this Lock File to determine if the record in the Flat File was available for edit or already locked by another user. One more thing to consider is the ADDING of records to each flat file in a multi-user environment when done during business hours. It is likely mandatory (for database stability) that the ability to ADD (APPEND) a record to a flat file is performed by a single user, preventing concurrency issues from arising. Any single user (with ADD permission) can make a request to ADD a new record. Doing so would set an ADD/APPEND flag to "on" in a corresponding SDBM database file used for ADD/APPEND locks. Once a new record was added (or aborted), the ADD/APPEND flag would be set to "off", freeing up that flat file for ADD/APPEND by the first user (with permission) making a new request to ADD a record.

If the user-interface was designed well, child records (in a 1-to-many, parent/child relationship) would not be directly editable, but only editable whenever the corresponding parent record was locked for edit.

This is a very stable/safe database system. The binary SDBM files can easily be rebuilt from the Flat File database records. This is more desirable, then let's say, a MS-Access database, where the text data and indexes are stored together in a binary file which can become corrupted making it sometimes difficult to rescue your important textual data. In MS-Access, the database Data and Objects: back-end Tables/Indexes/Data, and front-end Reports/Forms/Macros/etc. are often mistakenly stored in one file (in binary format) - although MS-Access does provide for the means to separate the back-end and front-end into separate files allowing for a much more stable DB system.

One advantage joint/tandem/dual technology Flat File/SDBM databases have over MS-Access (for example) is that they require no MDAC (Microsoft Data Access Components) be installed to each client. ODBC-enabled MS-Access databases without the use of the MS-Access front-end software can be designed to create a huge database (perhaps to 1 Terabyte/5 Billion rows in practicality - depends on whether it is a READ ONLY Data Warehouse or a READ/WRITE Database) where each MDB file is used as a: single table, group of tables, or partial table (common to all the MDB files, and where the data is logically kept segregated for ease of random access to 1, or perhaps 2, MDB files - each MDB file containing as many as 10 million rows).

FLAT FILE/SDBM, or MDB, relational database systems can employ file naming convention to make it easy for a DB application user-interface to determine which file(s) to look in. Example: A flat file named US_CENSUS_2010_TX_A.txt (or .mdb for MS-Access) would be one way to identify a file logically segregated to contain only data associated with Texas citizens whose last name began with the letter "A". A business would need to determine what logical segregation of data made the most sense for their operational needs. Server-side batch EDIT operations and heavy reporting could be performed during off hours. For common data statistics, a statistics table could be maintained (Server-side during off hours) which answered most user questions which would be an aggregate of the data across the entire database system (as in: Stats for the entire U.S., and Stats for each individual State of the 50 States - from the example given above).

For a discussion on this topic: http://www.perlmonks.org/?node=joint+database+technology

      #-- This Perl program retrieves 5 verses of King James Version Bible text
      #-- from a large Flat File (with fixed-length, "text" records) by random access lookup. 
      #-- The Flat File contains 180 complete copies of the KJV Bible, with a bogus
      #-- translation number (tr) assigned to each Bible copy (tr = 1 to 180)
      #-- to make a unique key: {translation_nbr + book_nbr + chapter_nbr + verse_nbr}.
      #-- Record offsets (in bytes) are persistently stored in a binary Perl SDBM database file,
      #-- of key/value pairs, tied to a program hash table. The value is the offset.
      #-- The key is {tr + bk + chp + ver} numbers combined/concatenated.
      #-- If $offset is a negative value, seek from BOTTOM/END of file.
      #-- If $offset is a positive value, seek from BEGIN/TOP of file.
      #-- Each Bible contains 31102 verses of text, of max length 528 charater each.
      #-- But with the compound index {tr + bk + chp + ver} added to the Bible text, 
      #--       for the purpose of proving the random access is working, and
      #-- MIMEbase64 encoding applied (to hide the Bible text), the fixed
      #-- length records have become 760 characters each. Decoding will occur as records are read.
      #-- The Flat File is just under 4 GIG. The SDBM file just under 1 GIG. 
      #-- There are over 5 Million records each, in both the Flat File and SDBM file.
      #--    [180 copies of the Bible times 31102 verses per Bible]
      #-- Flat File, random access, record lookup, is instantaneous. 
      #-- You can use Perl Portable Code: sysopen, syswrite, sysseek, sysread.
      #-- But the below example is Windows O/S specific Perl Code.
      #-- This example is a batch application process (no user front-end), having 5 hard-coded lookup keys.
      #-- You can build a user-interface to instead accept the lookup keys from user input: either typed in,
      #-- or selected from a GUI widget of preloaded values {tr, bk, chp, ver}. 
      #-- A RANGE of values could even be selected to print Bible verses for an entire Book (ex. tr="134", bk="01" i.e. Genesis)

      use Win32API::File 0.08 qw( :ALL );
      use Win32;
      use SDBM_File;
      use Fcntl;
      use MIME::Base64 qw(decode_base64);    

      $PWD=Win32::GetCwd();  #-- working directory
      
      #--- tie the external binary SDBM file contents (of key/value pairs) to a program hash table.
      tie( %BibleVersesIDX, "SDBM_File", '.\BibleFlatFile_760_31102_180_IDX', O_RDONLY, 0444 );

      if (tied %BibleVersesIDX) { print "BibleVersesIDX Hash now tied to external SDBM file\n\n"; }    
      else { print "Could not tie BibleVersesIDX Hash with external SDBM file - Aborting\n\n";  die; }
      
      #-- create a file handle to: open, and random access read from, the Flat File of Bible verses. 
      #-- the flat file already exists, and was preloaded with 5 million plus records.
      #-- if you are unfamiliar with the next line of code, it would be deceiving to you. 
      $hFILE = createFile("$PWD\\BibleFlatFile_760_31102_180.dat", "r");  #-- $hFile is a native Windows file handle
                 
      #--           tr bk chp ver 
      foreach $key ("00101001001", "09066022021", "09101001001", "18001001001", "18066022021") {  
            $offset=$BibleVersesIDX{$key};
            if ($offset < 0) {
                $pos=SetFilePointer( $hFILE, $offset, [], FILE_END);   #-- moves the file pointer to a specific record at $offset  
            } else {
                $pos=SetFilePointer( $hFILE, $offset, [], FILE_BEGIN); #-- moves the file pointer to a specific record at $offset     
            }
            #-- FYI: Don't rely on $pos, because if location is past 2 GIG mark, $pos (the return value) is wrong.
            #-- $offset will be an integer value to seek up to 2 GIG bytes from Top, or Bottom, of a 4 GIG file.

            ReadFile( $hFILE, $Buf, 760, [], [] );  #-- $Buf contains the 760 characters read in from the Flat File
            $decoded_Buf=decode_base64($Buf);   #-- MIMEBASE64 decoded to length 570 from 760
            $decoded_Buf=~s/ *$//;              #-- remove trailing spaces 
            print $decoded_Buf . "\n\n";        #-- print to the screen the decoded Bible verse fetched from the Flat File
      }
      exit;
      END {
         CloseHandle( $hFILE );
         untie( %BibleVersesIDX );
         sleep 5;
      }

Installation

Use chicken-install as usual. But some configuration can be done by defining certain features at compile-time.

Byte order: sdbm-little-endian or sdbm-big-endian set the read and write order of bytes in the file. If no byte order is specified, host order is used, as in the original implementation.
Hash function: sdbm-hash-djb selects an alternate hash function by Dan Bernstein. If no hash function is specified, the native SDBM hash function is used.

To define a feature, set it in CSC_OPTIONS before calling chicken-install:

CSC_OPTIONS="-Dsdbm-hash-djb -Dsdbm-big-endian" chicken-install sdbm

will configure sdbm to use the DJB hash and big-endian order.

Basic interface

[procedure] (open-database pathname #!key flags mode page-block-power dir-block-power) -> db

Opens existing SDBM database pathname or creates an empty database if pathname does not exist. The database resides in two files: pathname.dir (directory file) and pathname.pag (page file). Returns an opaque database object.

Optional keyword arguments are:

flags: flags passed to file-open, default: (+ open/rdwr open/creat)
mode: permissions passed to file-open, default: (+ perm/irwxu perm/irgrp perm/iroth)
page-block-power: bytes in each data page, as a power of 2; default: 12 (4096 bytes)
dir-block-power: bytes in each directory block, as a power of 2; default: 12 (4096 bytes)

The data page size limits the length of a key/value pair, so you may need to increase it to correspond with your maximum pair size. An undersized page can lead to frequent hash bucket splits and a bloated file size with many holes. An oversized page can incur disk performance overhead on read and write, since an entire page is read or written for every operation. Values between 4096 and 16384 bytes seem reasonable.

Note: The SDBM format has no database header, so you must always specify the same page-block-power and dir-block-power for a given database. The reference implementation uses page-block-power of 10 (1024 bytes) and dir-block-power of 12 (4096 bytes).

[procedure] (close-database db)

Close database associated with db.

[procedure] (fetch db key) -> val

Fetch key from SDBM database db, returning the associated value or #f if the key did not exist. The returned value is a string. key is normally a string; if not, it is converted into a string.

[procedure] (store! db key val #!optional (replace #t))

Store key, val pair into SDBM database db. val must be a string; key is converted into a string if not already.

If the key exists, and optional argument replace is #t (the default) then the pair will be replaced. If replace is #f instead, an error is returned.

[procedure] (delete! db key)

Delete key from SDBM database db. If key does not exist, an error is raised.

Enumeration

[procedure] (pair-iterator db) -> iter

Return a new pair iterator object that can be used to iterate over pairs in the SDBM database db. Pass this iterator to next-pair repeatedly.

[procedure] (next-pair iter) -> (key . val)

Return the next pair that iter, a pair-iterator, sees in the database. If there are no more pairs, returns #f. Otherwise, it returns a (key . val) pair, where both values are strings.

[procedure] (pair-fold db kons knil) -> kvs

Perform a fold over all pairs in SDBM database db. knil is the initial value. kons is a procedure of three arguments: (key val kvs). The return value of kons is passed to the next execution of kons in kvs.

Author

Jim Ursetto

Version history

0.1.0: Initial release

License

BSD