Wiki
Download
Manual
Eggs
API
Tests
Bugs
show
edit
history
You can edit this page using
wiki syntax
for markup.
Article contents:
[[tags: egg]] [[toc:]] == Statistics This library is a port of [[http://compbio.ucdenver.edu/hunter/|Larry Hunter]]'s Lisp statistics library to chicken scheme. The library provides a number of formulae and methods taken from the book "Fundamentals of Biostatistics" by Bernard Rosner (5th edition). === Statistical Distributions To use this library, you need to understand the underlying statistics. In brief: The [[http://en.wikipedia.org/wiki/Binomial_distribution|Binomial distribution]] is used when counting discrete events in a series of trials, each of which events has a probability p of producing a positive outcome. An example would be tossing a coin {{n}} times: the probability of a head is {{p}}, and the distribution gives the expected number of heads in the {{n}} trials. The binomial distribution is defined as B(n, p). The [[http://en.wikipedia.org/wiki/Poisson_distribution|Poisson distribution]] is used to count discrete events which occur with a known average rate. A typical example is the decay of radioactive elements. A poisson distribution is defined Pois(mu). The [[http://en.wikipedia.org/wiki/Normal_distribution|Normal distribution]] is used for real-valued events which cluster around a specific mean with a symmetric variance. A typical example would be the distribution of people's heights. A normal distribution is defined N(mean, variance). === Provided Functions ==== Utilities <procedure>(average-rank value sorted-values)</procedure> returns the average position of given value in the list of sorted values: the rank is based from 1. > (average-rank 2 '(1 2 2 3 4)) 5/2 <procedure>(beta-incomplete x a b)</procedure> <procedure>(bin-and-count items n)</procedure> Divides the range of the list of {{items}} into {{n}} bins, and returns a vector of the number of items which fall into each bin. > (bin-and-count '(1 1 2 3 3 4 5) 5) #(2 1 2 1 1) <procedure>(combinations n k)</procedure> returns the number of ways to select {{k}} items from {{n}}, where the order does not matter. <procedure>(factorial n)</procedure> returns the factorial of {{n}}. <procedure>(find-critical-value p-function p-value #:increasing?)</procedure> given a monotonic function {{p-function}} taking a single value {{x}} to {{y}}, returns the value of {{x}} which makes {{(p-function x)}} closest to {{p-value}}. A boolean keyword parameter {{#:increasing?}} determines if function should be increasing or decreasing (the default). <procedure>(fisher-z-transform r)</procedure> returns the transformation of a correlation coefficient {{r}} into an approximately normal distribution. <procedure>(gamma-incomplete a x)</procedure> <procedure>(gamma-ln x)</procedure> <procedure>(permutations n k)</procedure> returns the number of ways to select {{k}} items from {{n}}, where the order does matter. <procedure>(random-normal mean sd)</procedure> returns a random number distributed with specified mean and standard deviation. <procedure>(random-pick items)</procedure> returns a random item from the given list of items. <procedure>(random-sample n items)</procedure> returns a random sample from the list of items without replacement of size {{n}}. <procedure>(random-weighted-sample n items weights)</procedure> returns a random sample from the list of items without replacement of size {{n}}, where each sample has a defined probability of selection (weight). <procedure>(sign n)</procedure> returns 0, 1 or -1 according to if {{n}} is zero, positive or negative. <procedure>(square n)</procedure> <procedure>(cumsum sequences)</procedure> returns the cumulative sum of a sequence. ==== Descriptive statistics These functions provide information on a given list of numbers, the {{items}}. Note, the list does not have to be sorted. <procedure>(mean items)</procedure> returns the arithmetic mean of the {{items}} (the sum of the numbers divided by the number of numbers). (mean '(1 2 3 4 5)) => 3 <procedure>(median items)</procedure> returns the value which separates the upper and lower halves of the list of numbers. (median '(1 2 3 4)) => 5/2 <procedure>(mode items)</procedure> returns two '''values'''. The first is a list of the ''modes'' and the second is the frequency. (A mode of a list of numbers is the most frequently occurring value.) > (mode '(1 2 3 4)) (1 2 3 4) 1 > (mode '(1 2 2 3 4)) (2) 2 > (mode '(1 2 2 3 3 4)) (2 3) 2 <procedure>(geometric-mean items)</procedure> returns the geometric mean of the {{items}} (the result of multiplying the items together and then taking the nth root, where n is the number of items). (geometric-mean '(1 2 3 4 5)) => 2.60517108469735 <procedure>(range items)</procedure> returns the difference between the biggest and the smallest value from the list of {{items}}. (range '(5 1 2 3 4)) => 4 <procedure>(percentile items percent)</procedure> returns the item closest to the {{percent}} value if the {{items}} are sorted into order; the returned item may be in the list, or the average of adjacent values. (percentile '(1 2 3 4) 50) => 5/2 (percentile '(1 2 3 4) 67) => 3 <procedure>(variance items)</procedure> <procedure>(standard-deviation items)</procedure> <procedure>(coefficient-of-variation items)</procedure> returns 100 * (std-dev / mean) of the {{items}}. (coefficient-of-variation '(1 2 3 4)) => 51.6397779494322 <procedure>(standard-error-of-the-mean items)</procedure> returns std-dev / sqrt(length items). (standard-error-of-the-mean '(1 2 3 4)) => 0.645497224367903 <procedure>(mean-sd-n items)</procedure> returns three '''values''', one for the mean, one for the standard deviation, and one for the length of the list. > (mean-sd-n '(1 2 3 4)) 5/2 1.29099444873581 4 ==== Distributional functions <procedure>(binomial-probability n k p)</procedure> returns the probability that the number of positive outcomes for a binomial distribution B(n, p) is k. > (do-ec (: i 0 11) (format #t "i = ~d P = ~f~&" i (binomial-probability 10 i 0.5))) i = 0 P = 0.0009765625 i = 1 P = 0.009765625 i = 2 P = 0.0439453125 i = 3 P = 0.1171875 i = 4 P = 0.205078125 i = 5 P = 0.24609375 i = 6 P = 0.205078125 i = 7 P = 0.1171875 i = 8 P = 0.0439453125 i = 9 P = 0.009765625 i = 10 P = 0.0009765625 <procedure>(binomial-cumulative-probability n k p)</procedure> returns the probability that less than {{k}} positive outcomes occur for a binomial distribution B(n, p). > (do-ec (: i 0 11) (format #t "i = ~d P = ~f~&" i (binomial-cumulative-probability 10 i 0.5))) i = 0 P = 0.0 i = 1 P = 0.0009765625 i = 2 P = 0.0107421875 i = 3 P = 0.0546875 i = 4 P = 0.171875 i = 5 P = 0.376953125 i = 6 P = 0.623046875 i = 7 P = 0.828125 i = 8 P = 0.9453125 i = 9 P = 0.9892578125 i = 10 P = 0.9990234375 <procedure>(binomial-ge-probability n k p)</procedure> returns the probability of {{k}} or more positive outcomes for a binomial distribution B(n, p). <procedure>(binomial-le-probability n k p)</procedure> returns the probability {{k}} or fewer positive outcomes for a binomial distribution B(n, p). <procedure>(poisson-probability mu k)</procedure> returns the probability of {{k}} events occurring when the average is {{mu}}. > (do-ec (: i 0 20) (format #t "P(X=~2d) = ~,4f~&" i (poisson-probability 10 i))) P(X= 0) = 0.0000 P(X= 1) = 0.0005 P(X= 2) = 0.0023 P(X= 3) = 0.0076 P(X= 4) = 0.0189 P(X= 5) = 0.0378 P(X= 6) = 0.0631 P(X= 7) = 0.0901 P(X= 8) = 0.1126 P(X= 9) = 0.1251 P(X=10) = 0.1251 P(X=11) = 0.1137 P(X=12) = 0.0948 P(X=13) = 0.0729 P(X=14) = 0.0521 P(X=15) = 0.0347 P(X=16) = 0.0217 P(X=17) = 0.0128 P(X=18) = 0.0071 P(X=19) = 0.0037 <procedure>(poisson-cumulative-probability mu k)</procedure> returns the probability of less than {{k}} events occurring when the average is {{mu}}. > (do-ec (: i 0 20) (format #t "P(X=~2d) = ~,4f~&" i (poisson-cumulative-probability 10 i))) P(X= 0) = 0.0000 P(X= 1) = 0.0000 P(X= 2) = 0.0005 P(X= 3) = 0.0028 P(X= 4) = 0.0103 P(X= 5) = 0.0293 P(X= 6) = 0.0671 P(X= 7) = 0.1301 P(X= 8) = 0.2202 P(X= 9) = 0.3328 P(X=10) = 0.4579 P(X=11) = 0.5830 P(X=12) = 0.6968 P(X=13) = 0.7916 P(X=14) = 0.8645 P(X=15) = 0.9165 P(X=16) = 0.9513 P(X=17) = 0.9730 P(X=18) = 0.9857 P(X=19) = 0.9928 <procedure>(poisson-ge-probability mu k)</procedure> returns the probability of {{k}} or more events occurring when the average is {{mu}}. <procedure>(normal-pdf x mean variance)</procedure> returns the likelihood of {{x}} given a normal distribution with stated mean and variance. > (do-ec (: i 0 11) (format #t "~3d ~,4f~&" i (normal-pdf i 5 4))) 0 0.0088 1 0.0270 2 0.0648 3 0.1210 4 0.1760 5 0.1995 6 0.1760 7 0.1210 8 0.0648 9 0.0270 10 0.0088 <procedure>(convert-to-standard-normal x mean variance)</procedure> returns a value for {{x}} rescaling the given normal distribution to a standard N(0, 1). > (convert-to-standard-normal 5 6 2) -1/2 <procedure>(phi x)</procedure> returns the cumulative distribution function (CDF) of the standard normal distribution. > (do-ec (: x -2 2 0.4) (format #t "~4,1f ~,4f~&" x (phi x))) -2.0 0.0228 -1.6 0.0548 -1.2 0.1151 -0.8 0.2119 -0.4 0.3446 0.0 0.5000 0.4 0.6554 0.8 0.7881 1.2 0.8849 1.6 0.9452 <procedure>(z percentile)</procedure> returns the inverse of the standard normal distribution. Input is a percentile, between 0.0 and 1.0. <procedure>( t-distribution degrees-of-freedom percentile)</procedure> returns the point in the t-distribution given the {{degrees-of-freedom}} and {{percentile}}. {{degrees-of-freedom}} must be a positive integer, and {{percentile}} a value between 0.0 and 1.0. <procedure>(chi-square degrees-of-freedom percentile)</procedure> returns the point at which chi-square distribution has {{percentile}} to its '''left''', using given {{degrees-of-freedom}}. <procedure>(chi-square-cdf x degrees-of-freedom)</procedure> returns the probability that a random variable is to the '''left''' of {{x}} using the chi-square distribution with given {{degrees-of-freedom}}. ==== Confidence intervals These functions report bounds for an observed property of a distribution: the bounds are tighter as the confidence level, alpha, varies from 0.0 to 1.0. <procedure>(binomial-probability-ci n p alpha)</procedure> returns two values, the upper and lower bounds on an observed probability {{p}} from {{n}} trials with confidence {{(1-alpha)}}. > (binomial-probability-ci 10 0.8 0.9) 0.724273681640625 0.851547241210938 ; 2 values <procedure>(poisson-mu-ci k alpha)</procedure> returns two values, the upper and lower bounds on the poisson parameter if {{k}} events are observed; the bound is for confidence {{(1-alpha)}}. > (poisson-mu-ci 10 0.9) 8.305419921875 10.0635986328125 ; 2 values <procedure>(normal-mean-ci mean standard-deviation k alpha)</procedure> returns two values, the upper and lower bounds on the mean of the normal distibution of {{k}} events are observed; the bound is for confidence {{(1-alpha)}}. > (normal-mean-ci 0.5 0.1 10 0.8) 0.491747852700165 0.508252147299835 ; 2 values <procedure>(normal-mean-ci-on-sequence items alpha)</procedure> returns two values, the upper and lower bounds on the mean of the given {{items}}, assuming they are normally distributed; the bound is for confidence {{(1-alpha)}}. > (normal-mean-ci-on-sequence '(1 2 3 4 5) 0.9) 2.40860081649174 3.59139918350826 ; 2 values <procedure>(normal-variance-ci standard-deviation k alpha)</procedure> returns two values, the upper and lower bounds on the variance of the normal distibution of {{k}} events are observed; the bound is for confidence {{(1-alpha)}}. <procedure>(normal-variance-ci-on-sequence items alpha)</procedure> returns two values, the upper and lower bounds on the variance of the given {{items}}, assuming they are normally distributed; the bound is for confidence {{(1-alpha)}}. <procedure>normal-sd-ci standard-deviation k alpha)</procedure> returns two values, the upper and lower bounds on the standard deviation of the normal distibution of {{k}} events are observed; the bound is for confidence {{(1-alpha)}}. <procedure>(normal-sd-ci-on-sequence sequence items)</procedure> returns two values, the upper and lower bounds on the standard deviation of the given {{items}}, assuming they are normally distributed; the bound is for confidence {{(1-alpha)}}. ==== Hypothesis testing These functions report on the significance of an observed sample against a given distribution. ===== (parametric) <procedure>(z-test x-bar n #:mu #:sigma #:tails)</procedure> Given {{x-bar}} the sample mean, {{n}} the number in the sample, {{#:mu}} the distribution mean (defaults to 0), {{#:sigma}} the distribution standard deviation (defaults to 1), and {{#:tails}} the significance to report on: * {{':both}}, the probability of the difference between {{x-bar}} and {{#:mu}} * {{':positive}}, the probability that observation is {{>= x-bar}} * {{':negative}}, the probability that observation is {{<= x-bar}} e.g. given a distribution with mean 50 and standard deviation 10 ; probability that a single observation is <= 40 > (z-test 40 1 #:mu 50 #:sigma 10 #:tails ':negative) 0.158655 ; probability that 10 observations are <= 40 > (z-test 40 10 #:mu 50 #:sigma 10 #:tails ':negative) 0.000783 ; probability that 5 observations give a mean of 40 > (z-test 40 5 #:mu 50 #:sigma 10) 0.025347 <procedure>(z-test-on-sequence observations #:mu #:sigma #:tails)</procedure> As for {{z-test}} except {{x-bar}} and {{n}} are computed from given {{observations}}. <procedure>(t-test-one-sample x-bar sd n mu #:tails)</procedure> Given observed data with mean {{x-bar}}, standard devation {{sd}} and number of observations {{n}} ({{n < 30}}), return the significance of the sample compared with the population mean {{mu}}. {{#:tails}} is one of: * {{':both}} two-sided (default) * {{':positive}} one-sided, {{x-bar >= mu}} * {{':negative}} one-sided, {{x-bar <= mu}} <procedure>(t-test-one-sample-on-sequence observations mu #:tails)</procedure> As for {{t-test-one-sample}} except {{x-bar}}, {{sd}} and {{n}} are computed from given {{observations}}. <procedure>(t-test-paired t-bar sd n #:tails)</procedure> Computes the significance of the differences between two sequences of data: the differences are given as their mean, {{t-bar}}, standard deviation, {{sd}}, and number of measurements, {{n}}. <procedure>(t-test-paired-on-sequences before after #:tails)</procedure> Computes the significance of the difference between two sequences of data: one before an experimental change and one after. {{#:tails}} is as for {{t-significance}}. > (t-test-paired-on-sequences '(4 3 5) '(1 1 3)) 0.0198039411803931 <procedure>(t-test-two-sample mean-1 sd-1 n-1 mean-2 sd-2 n-2 #:variances-equal? #:variance-significance-cutoff #:tails)</procedure> Computes the significance of the difference of two means given the sample standard deviations and sizes. <procedure>(t-test-two-sample-on-sequences sequence-1 sequence-2 #:tails)</procedure> Significance of difference of two sequences of observations. <procedure>(f-test variance-1 n1 variance-2 n2 #:tails)</procedure> Tests for the equality of two variances. <procedure>(chi-square-test-one-sample observed-variance sample-size test-variance #:tails)</procedure> Tests for significance of difference between an observed and a test variance. <procedure>(binomial-test-one-sample p-hat n p #:tails #:exact?)</procedure> Returns the significance of a one sample test with {{n}} observations, observed probability {{p-hat}} and expected probability {{p}}. <procedure>(binomial-test-two-sample p-hat-1 n-1 p-hat-2 n-2 #:tails #:exact?)</procedure> Returns the significance of a two sample test. <procedure>(fisher-exact-test a b c d #:tails)</procedure> Given a 2x2 contingency table, returns a p value using Fisher's exact test. {{a}} and {{b}} form the first row of the contingency table, {{c}} and {{d}} the second row. <procedure>(mcnemars-test a-discordant-count b-discordant-count #:exact?)</procedure> For measuring effectiveness of, e.g., one treatment over another. {{a-discordant-count}} is the number of times when A worked, {{b-discordant-count}} the number of times B worked. <procedure>(poisson-test-one-sample observed mu #:tails #:approximate?)</procedure> Computes significance of the number of observed events under a Poisson distribution against {{mu}} expected events. ===== (non parametric) <procedure>(sign-test plus-count minus-count #:exact? #:tails)</procedure> <procedure>(sign-test-on-sequence sequence-1 sequence-2 #:exact? #:tails)</procedure> Takes two equal-sized sequences of observations, and reports if the entries of one are different to those in the other. <procedure>(wilcoxon-signed-rank-test differences #:tails)</procedure> Given at least 16 differences, reports if the positive differences are significantly larger or smaller than the negative differences. <procedure>(wilcoxon-signed-rank-test-on-sequences sequence-1 sequence-2 #:tails)</procedure> Given two sequences of at least 16 observations, computes {{wilcoxon-signed-rank-test}} on the differences. <procedure>(chi-square-test-rxc contingency-table)</procedure> Given a contingency table (a SRFI-63 array), returns significance of relation between row and column variable. <procedure>(chi-square-test-for-trend row1-counts row2-counts)</procedure> Returns p significance of trend, and prints a string to show if increasing or decreasing. ==== Sample size estimates <procedure>(t-test-one-sample-sse mean-1 mean-2 sigma-1 #:alpha #:1-beta #:tails)</procedure> Returns the size of sample necessary to distinguish a normally distributed sample with {{mean-2}} from a population {{mean-1}} standard deviation {{sigma-1}}. The significance {{#:alpha}} (defaults to 0.05), power {{#:1-beta}} (0.95) and sides {{#:tails}} (':both) may be altered. > (t-test-one-sample-sse 5.0 5.2 0.5) 163 <procedure>(t-test-two-sample-sse mean-1 sigma-1 mean-2 sigma-2 #:alpha #:1-beta #:tails #:sample-ratio)</procedure> Returns the size of sample necessary to distinguish a normally distributed sample N(mean-1, sigma-1) from a normally distributed sample N(mean-2, sigma-2). The significance {{#:alpha}} (defaults to 0.05), power {{#:1-beta}} (0.95), sides {{#:tails}} (':both) and sample-ratio {{#:sample-ratio}} (1) may be altered. <procedure>(t-test-paired-sse difference-mean difference-sigma #:alpha #:1-beta #:tails)</procedure> Returns the size of sample to produce a given mean and standard deviation in the differences of two samples. <procedure>(binomial-test-one-sample-sse p-estimated p-null #:alpha #:1-beta #:tails)</procedure> Returns the size of sample needed to test whether an observed probability is significantly different from a particular binomial null hypothesis with a significance alpha and a power 1-beta. <procedure>(binomial-test-two-sample-sse p-one p-two #:alpha #:1-beta #:tails #:sample-ratio)</procedure> Returns the size of sample needed to test if given two binomial probabilities are significantly different. {{#:sample-ratio}} can be given if the two samples differ in size. <procedure>(binomial-test-paired-sse pd pa #:alpha #:1-beta #:tails)</procedure> Sample size estimate for McNemar's discordant pairs test. <procedure>(correlation-sse rho #:alpha #:1-beta)</procedure> Returns the size of sample necessary to find a correlation of value {{rho}} with significance {{#:alpha}} (defaults to 0.05) and power {{#:1-beta}} (defaults to 0.95). > (correlation-sse 0.80 #:alpha 0.05 #:1-beta 0.9) 11 ==== Correlation and regression <procedure>(linear-regression xs ys)</procedure> Given a line definition as lists of point coordinates, first prints to the terminal and then returns 5 '''values''' for the best fitting line through the points: * the y-intercept * the slope * the correlation coefficient, r * the square of the correlation coefficient, r^2 * the significance of the difference of the slope from zero, p (This is also called the Pearson correlation; used when relation expected to be linear. Also see {{spearman-rank-correlation}}.) > (linear-regression '(1.0 2.0 3.0) '(0.1 0.3 0.8)) Intercept = -0.3, slope = 0.35, r = 0.970725343394151, R^2 = 0.942307692307692, p = 0.154420958311267 -0.3 0.35 0.970725343394151 0.942307692307692 0.154420958311267 ; 5 values <procedure>(correlation-coefficient xs ys)</procedure> As above, but only returns the value of ''r'': > (correlation-coefficient '(1.0 2.0 3.0) '(0.1 0.3 0.8)) 0.970725343394151 <procedure>(correlation-test-two-sample r1 n1 r2 n2 #:tails)</procedure> Returns the significance of the similarity between two correlations. {{#:tails}} determines how the comparison is made: {{':both}} measures the difference, {{':negative}} if {{r1 < r2}} and {{#':positive}} if {{r2 > r1}}. <procedure>(correlation-test-two-sample-on-sequences points-1 points-2 #:tails)</procedure> As above, but computes the correlations from given lists of points. <procedure>(spearman-rank-correlation xs ys)</procedure> Returns two '''values''', the Spearman Rank measure of correlation between the given lists of point coordinates, and the p-significance of the correlation. (This correlation is used for non-linear relations; compare with {{linear-regression}}.) ==== Significance test functions <procedure>(t-significance t-value degrees-of-freedom #:tails)</procedure> returns the probability of {{t-value}} for given {{degrees-of-freedom}}. The keyword {{#:tails}} modifies the calculation to be two-sided (the default) with {{':both}}, or one-sided, {{':positive}} or {{':negative}}. > (t-significance 0.2 5) 0.849360513995829 > (t-significance 0.2 5 #:tails ':positive) 0.424680256997915 > (t-significance 0.2 5 #:tails ':negative) 0.575319743002086 <procedure>(f-significance f-value numerator-dof denominator-dof #:one-tailed?)</procedure> returns the probability of {{f-value}} for given {{numerator-dof}} and {{denominator-dof}}. The boolean keyword {{#:one-tailed?}} indicates if calculation is two-sided (the default) or not. > (f-significance 1.5 8 2) 0.920449812578091 > (f-significance 1.5 8 2 #:one-tailed? #t) 0.460224906289046 === Authors [[/users/peter-lane|Peter Lane]] wrote the Scheme version of this library. The original Lisp version was written by [[http://compbio.ucdenver.edu/hunter/|Larry Hunter]]. === Repository This egg is hosted on the CHICKEN Subversion repository: [[https://anonymous@code.call-cc.org/svn/chicken-eggs/release/5/statistics|https://anonymous@code.call-cc.org/svn/chicken-eggs/release/5/statistics]] If you want to check out the source code repository of this egg and you are not familiar with Subversion, see [[/egg-svn-checkout|this page]]. === License GPL version 3.0. === Version History * 0.12: removed GSL dependency * 0.11: refactoring correlation and regression interface to take two separate dataset arguments * 0.9: ported to CHICKEN 5 * 0.8: added cumsum and random-weighted-sample * 0.5: fixed warning in compilation (thanks to Felix for pointing it out) * 0.4: all functions should now be working * 0.3: some error fixes and addition of tests for majority of functions * 0.2: fixed some errors in keywords and find-critical-value * 0.1: initial package
Description of your changes:
I would like to authenticate
Authentication
Username:
Password:
Spam control
What do you get when you multiply 8 by 7?