Dataset Utilities

A set of routines to load and manage datasets for machine learning / data mining tasks.

A dataset is a table:


Outlook Temperature Humidity Windy Plays
sunny hot high false no
sunny hot high true no

Each column in the table is an attribute, and each row is an instance. Instances have values for each attribute. The whole table is called a relation, and can be given a name.

Exported Procedures

Creating datasets

[procedure] (make-nominal-attribute name value-1 ...)

Creates a nominal attribute with given values, e.g.:

> (make-nominal-attribute 'outlook 'sunny 'overcast 'rainy)
[procedure] (make-numeric-attribute name)

Creates a numeric attribute, e.g.:

> (make-numeric-attribute 'temperature)
[procedure] (make-relation name attributes data)

Creates a relation with given name. The attributes must be a list of attribute instances, and the data are a list of lists: each sublist representing an instance, and giving the value for that instance of every attribute.

> (make-relation 'plays-tennis
                  (list (make-nominal-attribute 'outlook 'sunny 'overcast 'rainy)
                        (make-nominal-attribute 'temperature 'hot 'mild 'cool)
                        (make-nominal-attribute 'humidity 'high 'normal)
                        (make-nominal-attribute 'windy 'true 'false)
                        (make-nominal-attribute 'plays 'yes 'no))
                  '((sunny hot high false no)
                    (sunny hot high true no)
                    (overcast hot high false yes)
                    ...
                    (rainy mild high true no)))

Managing datasets

[procedure] (attribute-name attribute)

Returns the name of given attribute.

[procedure] (attribute-definition attribute)

Returns a definition of the type of given attribute. This definition will be one of:

[procedure] (class-probability relation attribute-name value)

Returns the proportion of instances with the given attribute value.

[procedure] (entropy relation attribute-name)

Computes entropy of given relation, using attribute-name to divide the relation into groups. attribute-name should be a nominal attribute.

[procedure] (filter-instances relation attribute-name value)

Returns a new relation containing those instances of relation which have the given value for attribute-name.

[procedure] (find-attribute-index relation attribute-name)

Returns the index number of given attribute name in relation.

[procedure] (get-attribute-values relation attribute-name)

Returns the values taken by instances in relation for given attribute name.

[procedure] (information-gain relation target-class attribute-name)

Computes the information gain from using the given attribute-name to split the data in relation over the entropy of the data as they are; target-class is used to compute the entropy.

[procedure] (relation-attributes relation)

Returns a list of attributes for given relation.

[procedure] (relation-data relation)

Returns a list of the instances in the given relation.

[procedure] (relation-name relation)

Returns the name of given relation.

[procedure] (split-instances relation attribute-name)

Given a nominal attribute, returns a list of relations, each representing instances in relation with the same value for given attribute-name.

Metrics

[procedure] (euclidean-distance instance-1 instance-2)

Computes the euclidean distance between the two instances.

Importing Data

[procedure] (read-arff filename)

Reads an ARFF definition from given filename, and returns a relation. Currently supports nominal and numeric attribute types, and not sparse files.

Author

Peter Lane.

License

GPL version 3.0.

Version History

in trunk.