Statistics

Statistics

This library is a port of Larry Hunter's Lisp statistics library to chicken scheme.

The library provides a number of formulae and methods taken from the book "Fundamentals of Biostatistics" by Bernard Rosner (5th edition).

Statistical Distributions

To use this library, you need to understand the underlying statistics. In brief:

The Binomial distribution is used when counting discrete events in a series of trials, each of which events has a probability p of producing a positive outcome. An example would be tossing a coin n times: the probability of a head is p, and the distribution gives the expected number of heads in the n trials. The binomial distribution is defined as B(n, p).

The Poisson distribution is used to count discrete events which occur with a known average rate. A typical example is the decay of radioactive elements. A poisson distribution is defined Pois(mu).

The Normal distribution is used for real-valued events which cluster around a specific mean with a symmetric variance. A typical example would be the distribution of people's heights. A normal distribution is defined N(mean, variance).

Provided Functions

Utilities

[procedure] (average-rank value sorted-values)

returns the average position of given value in the list of sorted values: the rank is based from 1.

> (average-rank 2 '(1 2 2 3 4))
5/2

[procedure] (beta-incomplete x a b)
[procedure] (bin-and-count items n)

Divides the range of the list of items into n bins, and returns a vector of the number of items which fall into each bin.

> (bin-and-count '(1 1 2 3 3 4 5) 5)
#(2 1 2 1 1)

[procedure] (combinations n k)

returns the number of ways to select k items from n, where the order does not matter.

[procedure] (factorial n)

returns the factorial of n.

[procedure] (find-critical-value p-function p-value #:increasing?)

given a monotonic function p-function taking a single value x to y, returns the value of x which makes (p-function x) closest to p-value. A boolean keyword parameter #:increasing? determines if function should be increasing or decreasing (the default).

[procedure] (fisher-z-transform r)

returns the transformation of a correlation coefficient r into an approximately normal distribution.

[procedure] (gamma-incomplete a x)
[procedure] (gamma-ln x)
[procedure] (permutations n k)

returns the number of ways to select k items from n, where the order does matter.

[procedure] (random-normal mean sd)

returns a random number distributed with specified mean and standard deviation.

[procedure] (random-pick items)

returns a random item from the given list of items.

[procedure] (random-sample n items)

returns a random sample from the list of items without replacement of size n.

[procedure] (random-weighted-sample n items weights)

returns a random sample from the list of items without replacement of size n, where each sample has a defined probability of selection (weight).

[procedure] (sign n)

returns 0, 1 or -1 according to if n is zero, positive or negative.

[procedure] (square n)
[procedure] (cumsum sequences)

returns the cumulative sum of a sequence.

Descriptive statistics

These functions provide information on a given list of numbers, the items. Note, the list does not have to be sorted.

[procedure] (mean items)

returns the arithmetic mean of the items (the sum of the numbers divided by the number of numbers).

(mean '(1 2 3 4 5)) => 3

[procedure] (median items)

returns the value which separates the upper and lower halves of the list of numbers.

(median '(1 2 3 4)) => 5/2

[procedure] (mode items)

returns two values. The first is a list of the modes and the second is the frequency. (A mode of a list of numbers is the most frequently occurring value.)

> (mode '(1 2 3 4))
(1 2 3 4)
1
> (mode '(1 2 2 3 4))
(2)
2
> (mode '(1 2 2 3 3 4))
(2 3)
2

[procedure] (geometric-mean items)

returns the geometric mean of the items (the result of multiplying the items together and then taking the nth root, where n is the number of items).

(geometric-mean '(1 2 3 4 5)) => 2.60517108469735

[procedure] (range items)

returns the difference between the biggest and the smallest value from the list of items.

(range '(5 1 2 3 4)) => 4

[procedure] (percentile items percent)

returns the item closest to the percent value if the items are sorted into order; the returned item may be in the list, or the average of adjacent values.

(percentile '(1 2 3 4) 50) => 5/2
(percentile '(1 2 3 4) 67) => 3

[procedure] (variance items)
[procedure] (standard-deviation items)
[procedure] (coefficient-of-variation items)

returns 100 * (std-dev / mean) of the items.

(coefficient-of-variation '(1 2 3 4)) => 51.6397779494322

[procedure] (standard-error-of-the-mean items)

returns std-dev / sqrt(length items).

 (standard-error-of-the-mean '(1 2 3 4)) => 0.645497224367903

[procedure] (mean-sd-n items)

returns three values, one for the mean, one for the standard deviation, and one for the length of the list.

> (mean-sd-n '(1 2 3 4))
5/2
1.29099444873581
4

Distributional functions

[procedure] (binomial-probability n k p)

returns the probability that the number of positive outcomes for a binomial distribution B(n, p) is k.

> (do-ec (: i 0 11) 
         (format #t "i = ~d P = ~f~&" i (binomial-probability 10 i 0.5)))
i = 0 P = 0.0009765625
i = 1 P = 0.009765625
i = 2 P = 0.0439453125
i = 3 P = 0.1171875
i = 4 P = 0.205078125
i = 5 P = 0.24609375
i = 6 P = 0.205078125
i = 7 P = 0.1171875
i = 8 P = 0.0439453125
i = 9 P = 0.009765625
i = 10 P = 0.0009765625

[procedure] (binomial-cumulative-probability n k p)

returns the probability that less than k positive outcomes occur for a binomial distribution B(n, p).

> (do-ec (: i 0 11) 
         (format #t "i = ~d P = ~f~&" i (binomial-cumulative-probability 10 i 0.5)))
i = 0 P = 0.0
i = 1 P = 0.0009765625
i = 2 P = 0.0107421875
i = 3 P = 0.0546875
i = 4 P = 0.171875
i = 5 P = 0.376953125
i = 6 P = 0.623046875
i = 7 P = 0.828125
i = 8 P = 0.9453125
i = 9 P = 0.9892578125
i = 10 P = 0.9990234375

[procedure] (binomial-ge-probability n k p)

returns the probability of k or more positive outcomes for a binomial distribution B(n, p).

[procedure] (binomial-le-probability n k p)

returns the probability k or fewer positive outcomes for a binomial distribution B(n, p).

[procedure] (poisson-probability mu k)

returns the probability of k events occurring when the average is mu.

> (do-ec (: i 0 20) 
         (format #t "P(X=~2d) = ~,4f~&" i (poisson-probability 10 i)))
P(X= 0) = 0.0000
P(X= 1) = 0.0005
P(X= 2) = 0.0023
P(X= 3) = 0.0076
P(X= 4) = 0.0189
P(X= 5) = 0.0378
P(X= 6) = 0.0631
P(X= 7) = 0.0901
P(X= 8) = 0.1126
P(X= 9) = 0.1251
P(X=10) = 0.1251
P(X=11) = 0.1137
P(X=12) = 0.0948
P(X=13) = 0.0729
P(X=14) = 0.0521
P(X=15) = 0.0347
P(X=16) = 0.0217
P(X=17) = 0.0128 
P(X=18) = 0.0071
P(X=19) = 0.0037

[procedure] (poisson-cumulative-probability mu k)

returns the probability of less than k events occurring when the average is mu.

> (do-ec (: i 0 20) 
         (format #t "P(X=~2d) = ~,4f~&" i (poisson-cumulative-probability 10 i)))
P(X= 0) = 0.0000
P(X= 1) = 0.0000
P(X= 2) = 0.0005
P(X= 3) = 0.0028
P(X= 4) = 0.0103
P(X= 5) = 0.0293
P(X= 6) = 0.0671
P(X= 7) = 0.1301
P(X= 8) = 0.2202
P(X= 9) = 0.3328
P(X=10) = 0.4579
P(X=11) = 0.5830
P(X=12) = 0.6968
P(X=13) = 0.7916
P(X=14) = 0.8645
P(X=15) = 0.9165
P(X=16) = 0.9513
P(X=17) = 0.9730
P(X=18) = 0.9857
P(X=19) = 0.9928

[procedure] (poisson-ge-probability mu k)

returns the probability of k or more events occurring when the average is mu.

[procedure] (normal-pdf x mean variance)

returns the likelihood of x given a normal distribution with stated mean and variance.

> (do-ec (: i 0 11) 
         (format #t "~3d ~,4f~&" i (normal-pdf i 5 4)))
 0 0.0088
 1 0.0270
 2 0.0648
 3 0.1210
 4 0.1760
 5 0.1995
 6 0.1760
 7 0.1210
 8 0.0648
 9 0.0270
10 0.0088

[procedure] (convert-to-standard-normal x mean variance)

returns a value for x rescaling the given normal distribution to a standard N(0, 1).

> (convert-to-standard-normal 5 6 2)
-1/2

[procedure] (phi x)

returns the cumulative distribution function (CDF) of the standard normal distribution.

> (do-ec (: x -2 2 0.4)
         (format #t "~4,1f ~,4f~&" x (phi x)))
-2.0 0.0228
-1.6 0.0548
-1.2 0.1151
-0.8 0.2119
-0.4 0.3446
 0.0 0.5000
 0.4 0.6554
 0.8 0.7881
 1.2 0.8849
 1.6 0.9452

[procedure] (z percentile)

returns the inverse of the standard normal distribution. Input is a percentile, between 0.0 and 1.0.

[procedure] ( t-distribution degrees-of-freedom percentile)

returns the point in the t-distribution given the degrees-of-freedom and percentile. degrees-of-freedom must be a positive integer, and percentile a value between 0.0 and 1.0.

[procedure] (chi-square degrees-of-freedom percentile)

returns the point at which chi-square distribution has percentile to its left, using given degrees-of-freedom.

[procedure] (chi-square-cdf x degrees-of-freedom)

returns the probability that a random variable is to the left of x using the chi-square distribution with given degrees-of-freedom.

Confidence intervals

These functions report bounds for an observed property of a distribution: the bounds are tighter as the confidence level, alpha, varies from 0.0 to 1.0.

[procedure] (binomial-probability-ci n p alpha)

returns two values, the upper and lower bounds on an observed probability p from n trials with confidence (1-alpha).

> (binomial-probability-ci 10 0.8 0.9)
0.724273681640625 
0.851547241210938
; 2 values

[procedure] (poisson-mu-ci k alpha)

returns two values, the upper and lower bounds on the poisson parameter if k events are observed; the bound is for confidence (1-alpha).

> (poisson-mu-ci 10 0.9)
8.305419921875
10.0635986328125
; 2 values

[procedure] (normal-mean-ci mean standard-deviation k alpha)

returns two values, the upper and lower bounds on the mean of the normal distibution of k events are observed; the bound is for confidence (1-alpha).

> (normal-mean-ci 0.5 0.1 10 0.8)
0.491747852700165
0.508252147299835
; 2 values

[procedure] (normal-mean-ci-on-sequence items alpha)

returns two values, the upper and lower bounds on the mean of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).

> (normal-mean-ci-on-sequence '(1 2 3 4 5) 0.9)
2.40860081649174
3.59139918350826
; 2 values

[procedure] (normal-variance-ci standard-deviation k alpha)

returns two values, the upper and lower bounds on the variance of the normal distibution of k events are observed; the bound is for confidence (1-alpha).

[procedure] (normal-variance-ci-on-sequence items alpha)

returns two values, the upper and lower bounds on the variance of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).

[procedure] normal-sd-ci standard-deviation k alpha)

returns two values, the upper and lower bounds on the standard deviation of the normal distibution of k events are observed; the bound is for confidence (1-alpha).

[procedure] (normal-sd-ci-on-sequence sequence items)

returns two values, the upper and lower bounds on the standard deviation of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).

Hypothesis testing

These functions report on the significance of an observed sample against a given distribution.

(parametric)

[procedure] (z-test x-bar n #:mu #:sigma #:tails)

Given x-bar the sample mean, n the number in the sample, #:mu the distribution mean (defaults to 0), #:sigma the distribution standard deviation (defaults to 1), and #:tails the significance to report on:

':both, the probability of the difference between x-bar and #:mu
':positive, the probability that observation is >= x-bar
':negative, the probability that observation is <= x-bar

e.g. given a distribution with mean 50 and standard deviation 10

; probability that a single observation is <= 40
> (z-test 40 1 #:mu 50 #:sigma 10 #:tails ':negative)
0.158655
; probability that 10 observations are <= 40
> (z-test 40 10 #:mu 50 #:sigma 10 #:tails ':negative)
0.000783
; probability that 5 observations give a mean of 40
> (z-test 40 5 #:mu 50 #:sigma 10)
0.025347

[procedure] (z-test-on-sequence observations #:mu #:sigma #:tails)

As for z-test except x-bar and n are computed from given observations.

[procedure] (t-test-one-sample x-bar sd n mu #:tails)

Given observed data with mean x-bar, standard devation sd and number of observations n (n < 30), return the significance of the sample compared with the population mean mu. #:tails is one of:

':both two-sided (default)
':positive one-sided, x-bar >= mu
':negative one-sided, x-bar <= mu

[procedure] (t-test-one-sample-on-sequence observations mu #:tails)

As for t-test-one-sample except x-bar, sd and n are computed from given observations.

[procedure] (t-test-paired t-bar sd n #:tails)

Computes the significance of the differences between two sequences of data: the differences are given as their mean, t-bar, standard deviation, sd, and number of measurements, n.

[procedure] (t-test-paired-on-sequences before after #:tails)

Computes the significance of the difference between two sequences of data: one before an experimental change and one after. #:tails is as for t-significance.

> (t-test-paired-on-sequences '(4 3 5) '(1 1 3))
0.0198039411803931

[procedure] (t-test-two-sample mean-1 sd-1 n-1 mean-2 sd-2 n-2 #:variances-equal? #:variance-significance-cutoff #:tails)

Computes the significance of the difference of two means given the sample standard deviations and sizes.

[procedure] (t-test-two-sample-on-sequences sequence-1 sequence-2 #:tails)

Significance of difference of two sequences of observations.

[procedure] (f-test variance-1 n1 variance-2 n2 #:tails)

Tests for the equality of two variances.

[procedure] (chi-square-test-one-sample observed-variance sample-size test-variance #:tails)

Tests for significance of difference between an observed and a test variance.

[procedure] (binomial-test-one-sample p-hat n p #:tails #:exact?)

Returns the significance of a one sample test with n observations, observed probability p-hat and expected probability p.

[procedure] (binomial-test-two-sample p-hat-1 n-1 p-hat-2 n-2 #:tails #:exact?)

Returns the significance of a two sample test.

[procedure] (fisher-exact-test a b c d #:tails)

Given a 2x2 contingency table, returns a p value using Fisher's exact test. a and b form the first row of the contingency table, c and d the second row.

[procedure] (mcnemars-test a-discordant-count b-discordant-count #:exact?)

For measuring effectiveness of, e.g., one treatment over another. a-discordant-count is the number of times when A worked, b-discordant-count the number of times B worked.

[procedure] (poisson-test-one-sample observed mu #:tails #:approximate?)

Computes significance of the number of observed events under a Poisson distribution against mu expected events.

(non parametric)

[procedure] (sign-test plus-count minus-count #:exact? #:tails)
[procedure] (sign-test-on-sequence sequence-1 sequence-2 #:exact? #:tails)

Takes two equal-sized sequences of observations, and reports if the entries of one are different to those in the other.

[procedure] (wilcoxon-signed-rank-test differences #:tails)

Given at least 16 differences, reports if the positive differences are significantly larger or smaller than the negative differences.

[procedure] (wilcoxon-signed-rank-test-on-sequences sequence-1 sequence-2 #:tails)

Given two sequences of at least 16 observations, computes wilcoxon-signed-rank-test on the differences.

[procedure] (chi-square-test-rxc contingency-table)

Given a contingency table (a SRFI-63 array), returns significance of relation between row and column variable.

[procedure] (chi-square-test-for-trend row1-counts row2-counts)

Returns p significance of trend, and prints a string to show if increasing or decreasing.

Sample size estimates

[procedure] (t-test-one-sample-sse mean-1 mean-2 sigma-1 #:alpha #:1-beta #:tails)

Returns the size of sample necessary to distinguish a normally distributed sample with mean-2 from a population mean-1 standard deviation sigma-1. The significance #:alpha (defaults to 0.05), power #:1-beta (0.95) and sides #:tails (':both) may be altered.

> (t-test-one-sample-sse 5.0 5.2 0.5)
163

[procedure] (t-test-two-sample-sse mean-1 sigma-1 mean-2 sigma-2 #:alpha #:1-beta #:tails #:sample-ratio)

Returns the size of sample necessary to distinguish a normally distributed sample N(mean-1, sigma-1) from a normally distributed sample N(mean-2, sigma-2). The significance #:alpha (defaults to 0.05), power #:1-beta (0.95), sides #:tails (':both) and sample-ratio #:sample-ratio (1) may be altered.

[procedure] (t-test-paired-sse difference-mean difference-sigma #:alpha #:1-beta #:tails)

Returns the size of sample to produce a given mean and standard deviation in the differences of two samples.

[procedure] (binomial-test-one-sample-sse p-estimated p-null #:alpha #:1-beta #:tails)

Returns the size of sample needed to test whether an observed probability is significantly different from a particular binomial null hypothesis with a significance alpha and a power 1-beta.

[procedure] (binomial-test-two-sample-sse p-one p-two #:alpha #:1-beta #:tails #:sample-ratio)

Returns the size of sample needed to test if given two binomial probabilities are significantly different. #:sample-ratio can be given if the two samples differ in size.

[procedure] (binomial-test-paired-sse pd pa #:alpha #:1-beta #:tails)

Sample size estimate for McNemar's discordant pairs test.

[procedure] (correlation-sse rho #:alpha #:1-beta)

Returns the size of sample necessary to find a correlation of value rho with significance #:alpha (defaults to 0.05) and power #:1-beta (defaults to 0.95).

> (correlation-sse 0.80 #:alpha 0.05 #:1-beta 0.9)
11

Correlation and regression

[procedure] (linear-regression xs ys)

Given a line definition as lists of point coordinates, first prints to the terminal and then returns 5 values for the best fitting line through the points:

the y-intercept
the slope
the correlation coefficient, r
the square of the correlation coefficient, r^2
the significance of the difference of the slope from zero, p

(This is also called the Pearson correlation; used when relation expected to be linear. Also see spearman-rank-correlation.)

> (linear-regression '(1.0 2.0 3.0) '(0.1 0.3 0.8))
Intercept = -0.3, slope = 0.35, r = 0.970725343394151, R^2 = 0.942307692307692, p = 0.154420958311267
-0.3
0.35
0.970725343394151
0.942307692307692
0.154420958311267
; 5 values

[procedure] (correlation-coefficient xs ys)

As above, but only returns the value of r:

> (correlation-coefficient '(1.0 2.0 3.0) '(0.1 0.3 0.8))
0.970725343394151

[procedure] (correlation-test-two-sample r1 n1 r2 n2 #:tails)

Returns the significance of the similarity between two correlations. #:tails determines how the comparison is made: ':both measures the difference, ':negative if r1 < r2 and #':positive if r2 > r1.

[procedure] (correlation-test-two-sample-on-sequences points-1 points-2 #:tails)

As above, but computes the correlations from given lists of points.

[procedure] (spearman-rank-correlation xs ys)

Returns two values, the Spearman Rank measure of correlation between the given lists of point coordinates, and the p-significance of the correlation. (This correlation is used for non-linear relations; compare with linear-regression.)

Significance test functions

[procedure] (t-significance t-value degrees-of-freedom #:tails)

returns the probability of t-value for given degrees-of-freedom. The keyword #:tails modifies the calculation to be two-sided (the default) with ':both, or one-sided, ':positive or ':negative.

> (t-significance 0.2 5)
0.849360513995829
> (t-significance 0.2 5 #:tails ':positive)
0.424680256997915
> (t-significance 0.2 5 #:tails ':negative)
0.575319743002086

[procedure] (f-significance f-value numerator-dof denominator-dof #:one-tailed?)

returns the probability of f-value for given numerator-dof and denominator-dof. The boolean keyword #:one-tailed? indicates if calculation is two-sided (the default) or not.

> (f-significance 1.5 8 2)
0.920449812578091
> (f-significance 1.5 8 2 #:one-tailed? #t)
0.460224906289046

Authors

Peter Lane wrote the Scheme version of this library. The original Lisp version was written by Larry Hunter.

Repository

This egg is hosted on the CHICKEN Subversion repository:

https://anonymous@code.call-cc.org/svn/chicken-eggs/release/5/statistics

If you want to check out the source code repository of this egg and you are not familiar with Subversion, see this page.

License

GPL version 3.0.

Version History

0.12: removed GSL dependency
0.11: refactoring correlation and regression interface to take two separate dataset arguments
0.9: ported to CHICKEN 5
0.8: added cumsum and random-weighted-sample
0.5: fixed warning in compilation (thanks to Felix for pointing it out)
0.4: all functions should now be working
0.3: some error fixes and addition of tests for majority of functions
0.2: fixed some errors in keywords and find-critical-value
0.1: initial package