• egg

## Statistics

This library is a port of Larry Hunter's Lisp statistics library to chicken scheme.

The library provides a number of formulae and methods taken from the book "Fundamentals of Biostatistics" by Bernard Rosner (5th edition).

### Statistical Distributions

To use this library, you need to understand the underlying statistics. In brief:

The Binomial distribution is used when counting discrete events in a series of trials, each of which events has a probability p of producing a positive outcome. An example would be tossing a coin n times: the probability of a head is p, and the distribution gives the expected number of heads in the n trials. The binomial distribution is defined as B(n, p).

The Poisson distribution is used to count discrete events which occur with a known average rate. A typical example is the decay of radioactive elements. A poisson distribution is defined Pois(mu).

The Normal distribution is used for real-valued events which cluster around a specific mean with a symmetric variance. A typical example would be the distribution of people's heights. A normal distribution is defined N(mean, variance).

### Provided Functions

#### Utilities

[procedure] (average-rank value sorted-values)

returns the average position of given value in the list of sorted values: the rank is based from 1.

```> (average-rank 2 '(1 2 2 3 4))
5/2```
[procedure] (beta-incomplete x a b)
[procedure] (bin-and-count items n)

Divides the range of the list of items into n bins, and returns a vector of the number of items which fall into each bin.

```> (bin-and-count '(1 1 2 3 3 4 5) 5)
#(2 1 2 1 1)```
[procedure] (combinations n k)

returns the number of ways to select k items from n, where the order does not matter.

[procedure] (factorial n)

returns the factorial of n.

[procedure] (find-critical-value p-function p-value #:increasing?)

given a monotonic function p-function taking a single value x to y, returns the value of x which makes (p-function x) closest to p-value. A boolean keyword parameter #:increasing? determines if function should be increasing or decreasing (the default).

[procedure] (fisher-z-transform r)

returns the transformation of a correlation coefficient r into an approximately normal distribution.

[procedure] (gamma-incomplete a x)
[procedure] (gamma-ln x)
[procedure] (permutations n k)

returns the number of ways to select k items from n, where the order does matter.

[procedure] (random-normal mean sd)

returns a random number distributed with specified mean and standard deviation.

[procedure] (random-pick items)

returns a random item from the given list of items.

[procedure] (random-sample n items)

returns a random sample from the list of items without replacement of size n.

[procedure] (random-weighted-sample n items weights)

returns a random sample from the list of items without replacement of size n, where each sample has a defined probability of selection (weight).

[procedure] (sign n)

returns 0, 1 or -1 according to if n is zero, positive or negative.

[procedure] (square n)
[procedure] (cumsum sequences)

returns the cumulative sum of a sequence.

#### Descriptive statistics

These functions provide information on a given list of numbers, the items. Note, the list does not have to be sorted.

[procedure] (mean items)

returns the arithmetic mean of the items (the sum of the numbers divided by the number of numbers).

`(mean '(1 2 3 4 5)) => 3`
[procedure] (median items)

returns the value which separates the upper and lower halves of the list of numbers.

`(median '(1 2 3 4)) => 5/2`
[procedure] (mode items)

returns two values. The first is a list of the modes and the second is the frequency. (A mode of a list of numbers is the most frequently occurring value.)

```> (mode '(1 2 3 4))
(1 2 3 4)
1
> (mode '(1 2 2 3 4))
(2)
2
> (mode '(1 2 2 3 3 4))
(2 3)
2```
[procedure] (geometric-mean items)

returns the geometric mean of the items (the result of multiplying the items together and then taking the nth root, where n is the number of items).

`(geometric-mean '(1 2 3 4 5)) => 2.60517108469735`
[procedure] (range items)

returns the difference between the biggest and the smallest value from the list of items.

`(range '(5 1 2 3 4)) => 4`
[procedure] (percentile items percent)

returns the item closest to the percent value if the items are sorted into order; the returned item may be in the list, or the average of adjacent values.

```(percentile '(1 2 3 4) 50) => 5/2
(percentile '(1 2 3 4) 67) => 3```
[procedure] (variance items)
[procedure] (standard-deviation items)
[procedure] (coefficient-of-variation items)

returns 100 * (std-dev / mean) of the items.

`(coefficient-of-variation '(1 2 3 4)) => 51.6397779494322`
[procedure] (standard-error-of-the-mean items)

returns std-dev / sqrt(length items).

` (standard-error-of-the-mean '(1 2 3 4)) => 0.645497224367903`
[procedure] (mean-sd-n items)

returns three values, one for the mean, one for the standard deviation, and one for the length of the list.

```> (mean-sd-n '(1 2 3 4))
5/2
1.29099444873581
4```

#### Distributional functions

[procedure] (binomial-probability n k p)

returns the probability that the number of positive outcomes for a binomial distribution B(n, p) is k.

```> (do-ec (: i 0 11)
(format #t "i = ~d P = ~f~&" i (binomial-probability 10 i 0.5)))
i = 0 P = 0.0009765625
i = 1 P = 0.009765625
i = 2 P = 0.0439453125
i = 3 P = 0.1171875
i = 4 P = 0.205078125
i = 5 P = 0.24609375
i = 6 P = 0.205078125
i = 7 P = 0.1171875
i = 8 P = 0.0439453125
i = 9 P = 0.009765625
i = 10 P = 0.0009765625```
[procedure] (binomial-cumulative-probability n k p)

returns the probability that less than k positive outcomes occur for a binomial distribution B(n, p).

```> (do-ec (: i 0 11)
(format #t "i = ~d P = ~f~&" i (binomial-cumulative-probability 10 i 0.5)))
i = 0 P = 0.0
i = 1 P = 0.0009765625
i = 2 P = 0.0107421875
i = 3 P = 0.0546875
i = 4 P = 0.171875
i = 5 P = 0.376953125
i = 6 P = 0.623046875
i = 7 P = 0.828125
i = 8 P = 0.9453125
i = 9 P = 0.9892578125
i = 10 P = 0.9990234375```
[procedure] (binomial-ge-probability n k p)

returns the probability of k or more positive outcomes for a binomial distribution B(n, p).

[procedure] (binomial-le-probability n k p)

returns the probability k or fewer positive outcomes for a binomial distribution B(n, p).

[procedure] (poisson-probability mu k)

returns the probability of k events occurring when the average is mu.

```> (do-ec (: i 0 20)
(format #t "P(X=~2d) = ~,4f~&" i (poisson-probability 10 i)))
P(X= 0) = 0.0000
P(X= 1) = 0.0005
P(X= 2) = 0.0023
P(X= 3) = 0.0076
P(X= 4) = 0.0189
P(X= 5) = 0.0378
P(X= 6) = 0.0631
P(X= 7) = 0.0901
P(X= 8) = 0.1126
P(X= 9) = 0.1251
P(X=10) = 0.1251
P(X=11) = 0.1137
P(X=12) = 0.0948
P(X=13) = 0.0729
P(X=14) = 0.0521
P(X=15) = 0.0347
P(X=16) = 0.0217
P(X=17) = 0.0128
P(X=18) = 0.0071
P(X=19) = 0.0037```
[procedure] (poisson-cumulative-probability mu k)

returns the probability of less than k events occurring when the average is mu.

```> (do-ec (: i 0 20)
(format #t "P(X=~2d) = ~,4f~&" i (poisson-cumulative-probability 10 i)))
P(X= 0) = 0.0000
P(X= 1) = 0.0000
P(X= 2) = 0.0005
P(X= 3) = 0.0028
P(X= 4) = 0.0103
P(X= 5) = 0.0293
P(X= 6) = 0.0671
P(X= 7) = 0.1301
P(X= 8) = 0.2202
P(X= 9) = 0.3328
P(X=10) = 0.4579
P(X=11) = 0.5830
P(X=12) = 0.6968
P(X=13) = 0.7916
P(X=14) = 0.8645
P(X=15) = 0.9165
P(X=16) = 0.9513
P(X=17) = 0.9730
P(X=18) = 0.9857
P(X=19) = 0.9928```
[procedure] (poisson-ge-probability mu k)

returns the probability of k or more events occurring when the average is mu.

[procedure] (normal-pdf x mean variance)

returns the likelihood of x given a normal distribution with stated mean and variance.

```> (do-ec (: i 0 11)
(format #t "~3d ~,4f~&" i (normal-pdf i 5 4)))
0 0.0088
1 0.0270
2 0.0648
3 0.1210
4 0.1760
5 0.1995
6 0.1760
7 0.1210
8 0.0648
9 0.0270
10 0.0088```
[procedure] (convert-to-standard-normal x mean variance)

returns a value for x rescaling the given normal distribution to a standard N(0, 1).

```> (convert-to-standard-normal 5 6 2)
-1/2```
[procedure] (phi x)

returns the cumulative distribution function (CDF) of the standard normal distribution.

```> (do-ec (: x -2 2 0.4)
(format #t "~4,1f ~,4f~&" x (phi x)))
-2.0 0.0228
-1.6 0.0548
-1.2 0.1151
-0.8 0.2119
-0.4 0.3446
0.0 0.5000
0.4 0.6554
0.8 0.7881
1.2 0.8849
1.6 0.9452```
[procedure] (z percentile)

returns the inverse of the standard normal distribution. Input is a percentile, between 0.0 and 1.0.

[procedure] ( t-distribution degrees-of-freedom percentile)

returns the point in the t-distribution given the degrees-of-freedom and percentile. degrees-of-freedom must be a positive integer, and percentile a value between 0.0 and 1.0.

[procedure] (chi-square degrees-of-freedom percentile)

returns the point at which chi-square distribution has percentile to its left, using given degrees-of-freedom.

[procedure] (chi-square-cdf x degrees-of-freedom)

returns the probability that a random variable is to the left of x using the chi-square distribution with given degrees-of-freedom.

#### Confidence intervals

These functions report bounds for an observed property of a distribution: the bounds are tighter as the confidence level, alpha, varies from 0.0 to 1.0.

[procedure] (binomial-probability-ci n p alpha)

returns two values, the upper and lower bounds on an observed probability p from n trials with confidence (1-alpha).

```> (binomial-probability-ci 10 0.8 0.9)
0.724273681640625
0.851547241210938
; 2 values```
[procedure] (poisson-mu-ci k alpha)

returns two values, the upper and lower bounds on the poisson parameter if k events are observed; the bound is for confidence (1-alpha).

```> (poisson-mu-ci 10 0.9)
8.305419921875
10.0635986328125
; 2 values```
[procedure] (normal-mean-ci mean standard-deviation k alpha)

returns two values, the upper and lower bounds on the mean of the normal distibution of k events are observed; the bound is for confidence (1-alpha).

```> (normal-mean-ci 0.5 0.1 10 0.8)
0.491747852700165
0.508252147299835
; 2 values```
[procedure] (normal-mean-ci-on-sequence items alpha)

returns two values, the upper and lower bounds on the mean of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).

```> (normal-mean-ci-on-sequence '(1 2 3 4 5) 0.9)
2.40860081649174
3.59139918350826
; 2 values```
[procedure] (normal-variance-ci standard-deviation k alpha)

returns two values, the upper and lower bounds on the variance of the normal distibution of k events are observed; the bound is for confidence (1-alpha).

[procedure] (normal-variance-ci-on-sequence items alpha)

returns two values, the upper and lower bounds on the variance of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).

[procedure] normal-sd-ci standard-deviation k alpha)

returns two values, the upper and lower bounds on the standard deviation of the normal distibution of k events are observed; the bound is for confidence (1-alpha).

[procedure] (normal-sd-ci-on-sequence sequence items)

returns two values, the upper and lower bounds on the standard deviation of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).

#### Hypothesis testing

These functions report on the significance of an observed sample against a given distribution.

##### (parametric)
[procedure] (z-test x-bar n #:mu #:sigma #:tails)

Given x-bar the sample mean, n the number in the sample, #:mu the distribution mean (defaults to 0), #:sigma the distribution standard deviation (defaults to 1), and #:tails the significance to report on:

• ':both, the probability of the difference between x-bar and #:mu
• ':positive, the probability that observation is >= x-bar
• ':negative, the probability that observation is <= x-bar

e.g. given a distribution with mean 50 and standard deviation 10

```; probability that a single observation is <= 40
> (z-test 40 1 #:mu 50 #:sigma 10 #:tails ':negative)
0.158655
; probability that 10 observations are <= 40
> (z-test 40 10 #:mu 50 #:sigma 10 #:tails ':negative)
0.000783
; probability that 5 observations give a mean of 40
> (z-test 40 5 #:mu 50 #:sigma 10)
0.025347```
[procedure] (z-test-on-sequence observations #:mu #:sigma #:tails)

As for z-test except x-bar and n are computed from given observations.

[procedure] (t-test-one-sample x-bar sd n mu #:tails)

Given observed data with mean x-bar, standard devation sd and number of observations n (n < 30), return the significance of the sample compared with the population mean mu. #:tails is one of:

• ':both two-sided (default)
• ':positive one-sided, x-bar >= mu
• ':negative one-sided, x-bar <= mu
[procedure] (t-test-one-sample-on-sequence observations mu #:tails)

As for t-test-one-sample except x-bar, sd and n are computed from given observations.

[procedure] (t-test-paired t-bar sd n #:tails)

Computes the significance of the differences between two sequences of data: the differences are given as their mean, t-bar, standard deviation, sd, and number of measurements, n.

[procedure] (t-test-paired-on-sequences before after #:tails)

Computes the significance of the difference between two sequences of data: one before an experimental change and one after. #:tails is as for t-significance.

```> (t-test-paired-on-sequences '(4 3 5) '(1 1 3))
0.0198039411803931```
[procedure] (t-test-two-sample mean-1 sd-1 n-1 mean-2 sd-2 n-2 #:variances-equal? #:variance-significance-cutoff #:tails)

Computes the significance of the difference of two means given the sample standard deviations and sizes.

[procedure] (t-test-two-sample-on-sequences sequence-1 sequence-2 #:tails)

Significance of difference of two sequences of observations.

[procedure] (f-test variance-1 n1 variance-2 n2 #:tails)

Tests for the equality of two variances.

[procedure] (chi-square-test-one-sample observed-variance sample-size test-variance #:tails)

Tests for significance of difference between an observed and a test variance.

[procedure] (binomial-test-one-sample p-hat n p #:tails #:exact?)

Returns the significance of a one sample test with n observations, observed probability p-hat and expected probability p.

[procedure] (binomial-test-two-sample p-hat-1 n-1 p-hat-2 n-2 #:tails #:exact?)

Returns the significance of a two sample test.

[procedure] (fisher-exact-test a b c d #:tails)

Given a 2x2 contingency table, returns a p value using Fisher's exact test. a and b form the first row of the contingency table, c and d the second row.

[procedure] (mcnemars-test a-discordant-count b-discordant-count #:exact?)

For measuring effectiveness of, e.g., one treatment over another. a-discordant-count is the number of times when A worked, b-discordant-count the number of times B worked.

[procedure] (poisson-test-one-sample observed mu #:tails #:approximate?)

Computes significance of the number of observed events under a Poisson distribution against mu expected events.

##### (non parametric)
[procedure] (sign-test plus-count minus-count #:exact? #:tails)
[procedure] (sign-test-on-sequence sequence-1 sequence-2 #:exact? #:tails)

Takes two equal-sized sequences of observations, and reports if the entries of one are different to those in the other.

[procedure] (wilcoxon-signed-rank-test differences #:tails)

Given at least 16 differences, reports if the positive differences are significantly larger or smaller than the negative differences.

[procedure] (wilcoxon-signed-rank-test-on-sequences sequence-1 sequence-2 #:tails)

Given two sequences of at least 16 observations, computes wilcoxon-signed-rank-test on the differences.

[procedure] (chi-square-test-rxc contingency-table)

Given a contingency table (a SRFI-63 array), returns significance of relation between row and column variable.

[procedure] (chi-square-test-for-trend row1-counts row2-counts)

Returns p significance of trend, and prints a string to show if increasing or decreasing.

#### Sample size estimates

[procedure] (t-test-one-sample-sse mean-1 mean-2 sigma-1 #:alpha #:1-beta #:tails)

Returns the size of sample necessary to distinguish a normally distributed sample with mean-2 from a population mean-1 standard deviation sigma-1. The significance #:alpha (defaults to 0.05), power #:1-beta (0.95) and sides #:tails (':both) may be altered.

```> (t-test-one-sample-sse 5.0 5.2 0.5)
163```
[procedure] (t-test-two-sample-sse mean-1 sigma-1 mean-2 sigma-2 #:alpha #:1-beta #:tails #:sample-ratio)

Returns the size of sample necessary to distinguish a normally distributed sample N(mean-1, sigma-1) from a normally distributed sample N(mean-2, sigma-2). The significance #:alpha (defaults to 0.05), power #:1-beta (0.95), sides #:tails (':both) and sample-ratio #:sample-ratio (1) may be altered.

[procedure] (t-test-paired-sse difference-mean difference-sigma #:alpha #:1-beta #:tails)

Returns the size of sample to produce a given mean and standard deviation in the differences of two samples.

[procedure] (binomial-test-one-sample-sse p-estimated p-null #:alpha #:1-beta #:tails)

Returns the size of sample needed to test whether an observed probability is significantly different from a particular binomial null hypothesis with a significance alpha and a power 1-beta.

[procedure] (binomial-test-two-sample-sse p-one p-two #:alpha #:1-beta #:tails #:sample-ratio)

Returns the size of sample needed to test if given two binomial probabilities are significantly different. #:sample-ratio can be given if the two samples differ in size.

[procedure] (binomial-test-paired-sse pd pa #:alpha #:1-beta #:tails)

Sample size estimate for McNemar's discordant pairs test.

[procedure] (correlation-sse rho #:alpha #:1-beta)

Returns the size of sample necessary to find a correlation of value rho with significance #:alpha (defaults to 0.05) and power #:1-beta (defaults to 0.95).

```> (correlation-sse 0.80 #:alpha 0.05 #:1-beta 0.9)
11```

#### Correlation and regression

[procedure] (linear-regression xs ys)

Given a line definition as lists of point coordinates, first prints to the terminal and then returns 5 values for the best fitting line through the points:

• the y-intercept
• the slope
• the correlation coefficient, r
• the square of the correlation coefficient, r^2
• the significance of the difference of the slope from zero, p

(This is also called the Pearson correlation; used when relation expected to be linear. Also see spearman-rank-correlation.)

```> (linear-regression '(1.0 2.0 3.0) '(0.1 0.3 0.8))
Intercept = -0.3, slope = 0.35, r = 0.970725343394151, R^2 = 0.942307692307692, p = 0.154420958311267
-0.3
0.35
0.970725343394151
0.942307692307692
0.154420958311267
; 5 values```
[procedure] (correlation-coefficient xs ys)

As above, but only returns the value of r:

```> (correlation-coefficient '(1.0 2.0 3.0) '(0.1 0.3 0.8))
0.970725343394151```
[procedure] (correlation-test-two-sample r1 n1 r2 n2 #:tails)

Returns the significance of the similarity between two correlations. #:tails determines how the comparison is made: ':both measures the difference, ':negative if r1 < r2 and #':positive if r2 > r1.

[procedure] (correlation-test-two-sample-on-sequences points-1 points-2 #:tails)

As above, but computes the correlations from given lists of points.

[procedure] (spearman-rank-correlation xs ys)

Returns two values, the Spearman Rank measure of correlation between the given lists of point coordinates, and the p-significance of the correlation. (This correlation is used for non-linear relations; compare with linear-regression.)

#### Significance test functions

[procedure] (t-significance t-value degrees-of-freedom #:tails)

returns the probability of t-value for given degrees-of-freedom. The keyword #:tails modifies the calculation to be two-sided (the default) with ':both, or one-sided, ':positive or ':negative.

```> (t-significance 0.2 5)
0.849360513995829
> (t-significance 0.2 5 #:tails ':positive)
0.424680256997915
> (t-significance 0.2 5 #:tails ':negative)
0.575319743002086```
[procedure] (f-significance f-value numerator-dof denominator-dof #:one-tailed?)

returns the probability of f-value for given numerator-dof and denominator-dof. The boolean keyword #:one-tailed? indicates if calculation is two-sided (the default) or not.

```> (f-significance 1.5 8 2)
0.920449812578091
> (f-significance 1.5 8 2 #:one-tailed? #t)
0.460224906289046```

### Authors

Peter Lane wrote the Scheme version of this library. The original Lisp version was written by Larry Hunter.

### Repository

This egg is hosted on the CHICKEN Subversion repository:

https://anonymous@code.call-cc.org/svn/chicken-eggs/release/5/statistics

If you want to check out the source code repository of this egg and you are not familiar with Subversion, see this page.

GPL version 3.0.

### Version History

• 0.12: removed GSL dependency
• 0.11: refactoring correlation and regression interface to take two separate dataset arguments
• 0.9: ported to CHICKEN 5
• 0.8: added cumsum and random-weighted-sample
• 0.5: fixed warning in compilation (thanks to Felix for pointing it out)
• 0.4: all functions should now be working
• 0.3: some error fixes and addition of tests for majority of functions
• 0.2: fixed some errors in keywords and find-critical-value
• 0.1: initial package