## dataframe

Tabular data structure implementation for data analysis.

## Documentation

The `dataframe` library provides an interface for representing numerical data in tables with rows and columns. It is inspired by the various dataframe implementations found in R, Python and Racket.

The `dataframe` library also provides functions for loading and saving data from data frames as well as routines for descriptive statistics and linear regression.

### Columns

Each dataframe consists of a collection of columns, which in turn is an object consisting of a unique key, data collection, and an associative list of properties. The following operations are defined on columns.

*[procedure]*

`(column? obj)`

Returns true if the given object is a column.

*[procedure]*

`(get-column-properties column)`

Returns an associative list with column properties.

*[procedure]*

`(get-column-key column)`

Returns the key of the column.

*[procedure]*

`(get-column-collection column)`

Returns the data collection of the column.

*[procedure]*

`(column-deserialize column port)`

Loads the data collection of a column from the given port.

*[procedure]*

`(column-serialize column port)`

Stores the data collection of a column to the given port in an s-expression format.

### Creating data frames

*[procedure]*

`(make-data-frame [column-key-compare: compare-symbol])`

Creates a new dataframe, with optional argument a procedure that specifies how to compare column keys. Default is comparison on symbols. Returns the new dataframe.

*[procedure]*

`(df-insert-column df key collection properties)`

Inserts a new column with the given key, data collection, and properties. Returns a new dataframe with the inserted column.

*[procedure]*

`(df-insert-derived df parent-key key proc properties)`

Inserts a derived column, that is a column whose data elements are obtained by mapping a procedure onto the elements of an existing (parent) column. Returns a new dataframe with the inserted column.

*[procedure]*

`(df-insert-columns df lseq)`

Inserts the columns contained in the given lseq of column objects.

### Accessing data frames

*[procedure]*

`(show df)`

Displays a subset of the rows and columns contained in the dataframe.

*[procedure]*

`(row-count df)`

Returns the number of rows in the dataframe.

*[procedure]*

`(df-column df key)`

Returns the column indicated by the given key.

*[procedure]*

`(df-columns df)`

Returns a lazy sequence containing the columns of the dataframe.

*[procedure]*

`(df-filter-columns df proc)`

Returns a filtered lseq of the columns of the dataframe according to the given filter predicate procedure.

*[procedure]*

`(df-select-columns df keys)`

Returns an lseq of the columns of the dataframe that have the keys enumerated in the given list of keys.

*[procedure]*

`(df-keys df)`

Returns the keys of all columns in the dataframe.

*[procedure]*

`(df-items df)`

Returns an lseq of the key-column pairs contained in the dataframe.

*[procedure]*

`(apply-collections proc df key ...)`

Applies the given procedure to the data collections of the named columns of the dataframe and returns the result as a list.

*[procedure]*

`(apply-columns proc df key ...)`

Applies the given procedure to the named columns of the dataframe and returns the result as a list.

*[procedure]*

`(map-collections proc df key ...)`

Applies the given procedure to the data collections of the named columns of the dataframe and returns the result as a dataframe.

*[procedure]*

`(map-columns proc df key ...)`

Applies the given procedure to the named columns of the dataframe and returns the result as a dataframe.

*[procedure]*

`(reduce-collections proc df seed key ...)`

Fold over the data collections of the named columns.

### Iterators

*[procedure]*

`(df-for-each-column df proc)`

Applies proc to each column.

*[procedure]*

`(df-for-each-collection df proc)`

Applies proc to the data collection of each column.

*[procedure]*

`(df-gen-rows df)`

Returns a generator procedure that returns the dataframe rows in succession.

*[procedure]*

`(df-gen-columns df)`

Returns a generator procedure the returns the dataframe columns in succession.

### Descriptive statistics

*[procedure]*

`(describe df port)`

Displays a table with the min/max/mean/sdev of each column in the dataframe.

*[procedure]*

`(cmin df)`

Computes the minimum value of each column.

*[procedure]*

`(cmax df)`

Computes the maximum value of each column.

*[procedure]*

`(mean df)`

Computes the mean value of each column.

*[procedure]*

`(median df)`

Computes the median value of each column.

*[procedure]*

`(mode df)`

Computes the mode value of each column.

*[procedure]*

`(range df)`

Computes the difference between maximum and minimum value of each column.

*[procedure]*

`(percentile df)`

Computes the percentile values of each column.

*[procedure]*

`(variance df)`

Computes the variance of each column.

*[procedure]*

`(standard-deviation df)`

Computes the standard deviation of each column.

*[procedure]*

`(coefficient-of-variation df)`

Computes the coefficient of variation of each column.

### Regression and correlation

*[procedure]*

`(linear-regression df x y)`

Linear regression between columns x and y.

*[procedure]*

`(correlation-coefficient df x y)`

Correlation coefficient between columns x and y.

### I/O

*[procedure]*

`(df-serialize df port)`

Stores the dataframe in an s-expression format to the given port.

*[procedure]*

`(df-deserialize df port)`

Loads the data collections of the dataframe columns from the given port.

## Examples

(import scheme yasos dataframe dataframe-statistics) (definedf (make-data-frame)) (definedf1 (df-insert-column df 'base (list-tabulate 100 (lambda(x) (- x 10))) '())) ;; exponential series (definedf2 (df-insert-derived df1 'base 'exp (lambda(x) (* 2.0 (exp (* 0.1 x)))) '() )) (show df2 #f) (describe df2 #f) (linear-regression df2 'base 'exp)

## About this egg

### Author

### Version history

- 0.1
- Initial release

### License

Copyright 2019 Ivan Raikov. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. A full copy of the GPL license can be found at <http://www.gnu.org/licenses/>.