Unit ProbFun/ProbF87

 

Probability distribution functions for statistical calculations

Copyright 1990 by J. W. Rider

 

This unit uses the math and specfun units (available separately)

for defining float types, and the beta-, erf-, and gamma-related

functions

 

This unit is not intended to be a self-contained tutorial in

probability.  The probability cumulative distribution functions

(cdf's) are provided with the caveat that they only work with the

correct inputs.  The "probability" returned assumes that the

"null hypothesis" (that some number is a random variate with a

particular probability distribution) is true.  Based upon the

returned probability, you can determine at what level you want to

accept or reject the "null hypothesis".

 

This unit provide "probability distributions" rather than

"statistical routines".  They will not help you compute

statistics.  However, they will tell you what the probability

of computing a statistic from a particular distribution will be.

 

Because of the number of probability distributions available, I

have tried to adopt a consistent (but not standard anywhere else)

way to describing the functions.  Each distribution has a base

"prefix" to which is appended either "CDF", "INV", or "PF".

"CDF" indicates that the function is a "cumulative distribution

function".  Where possible, I've strived to make this

consistently the probability that a random variate will be less

than or equal to ("<=") the "x" argument.  "INV" indicates an

"inverse cumulative distribution function" which takes a

probability as the argument and returns an "x" for which the

"CDF" would yield the given probability.  "PF" indicates a

"probability density function" which is the derivative of the

"CDF".  (Or, inversely, the "CDF" is the integral from minus

infinity to "x" of the "PF".)

 

Not all distributions have a complete set of functions defined.

 

Probability function supplied for:

 

  PROB DIST             PREFIX  CDF     PD      INV     type

  Beta                  beta-    y      y               cnts

  Binomial              bin-     y      y               disc

  Cauchy                cauchy-  y      y        y      cnts

  Chi-square            chs-     y      y               cnts

  Double-Exponential

     (Laplacian)        dx-      y      y        y      cnts

  Error                 erf-     y      y               cnts

  Exponential           x-       y      y        y      cnts

  Snedecor's F          f-       y      y               cnts

  Gamma                 gam-     y      y               cnts

  Gaussian (Normal)     g-       y      y               cnts

  Geometric             geo-            y               disc

  Hypergeometric        hgeo-           y               disc

  Kolmogorov-Smirnov's

     D                  ks-      y                      cnts

  Maxwell               maxwell-                        cnts

  Pascal

    (Negative Binomial) pas-     y      y               disc

  Poisson               poi-     y      y               disc

  Rayleigh              ray-                            cnts

  Student's T           t-       y      y               cnts

  Uniform

     (Rectangular)      u-       y      y        y      cnts

 

 

In most of references cited at the end of this document, some

sort of mathematical expression is provided for any particular

distribution.  Unfortunately, this is often insufficient to

determine when the practitioner should use a particular

distribution.  Knowing precisely what circumstances yield random

variates that follow a particular distribution can be especially

fruitful in determining which hypotheses can be tested.

Consequently, I have strived to explain such in the descriptions

that follow.

 

The concept of a "Bernoulli trial" occurs frequently in

relationship with probability distributions.  Briefly, a

Bernoulli trial has a fixed probability of success without regard

to when it is tried, and any one Bernoulli trial is independent

of all others.

 

 

BETA. The beta distribution is continuous distribution related to

the binomial distribution.  Mathematically, the Beta distribution

describes the distribution of the probability of success of "n"

Bernoulli trials given "s" successes, rather than the

distribution of the number of successes out of "n" Bernoulli

trials with a given probablity "p". If X1 and X2 are independent

chi-square random variates with "degrees of freedom" v1 and v2

respectively, then the expression "X1/(X1+X2)" will follow a beta

distribution with parameters v1/2 and v2/2.

 

     Limits:  0 <= x <= 1

              0 < dof1,dof2

 

function betapdf(x,dof1,dof2:xfloat):xfloat

function betapf(x,dof1,dof2:xfloat):xfloat

 

 

BINOMIAL. The binomial distribution describes the number of

successes out of a specific number of Bernoulli trials.

 

        The CDF returns the probability of less than "k"

successes out of "n" trials with probability "p" of success per

trial.  Low values indicate that "p" is likely too big;  high

values, "p" too small, for less than "k" events out of "n".

 

     Limits:  0 <= k <= n

              0 <= p <= 1

 

function bincdf(k,n,p:xfloat):xfloat

function binpf(k,n,p:xfloat):xfloat

 

 

CAUCHY. The continuous Cauchy distribution is peculiar in the

sense that it has no well-defined mean or variance.  However, it

arises in some physical phenomena.

 

function cauchycdf(x:xfloat):xfloat

function cauchyinv(prob:xfloat):xfloat

function cauchypf(x:xfloat):xfloat

 

 

CHI-SQUARE. If X1, X2, ..., XN are independent gaussian random

variates of zero mean and unit variance, then the sum of the

squares of the variates will follow a continuous Chi-square

distribution with "N-1" degrees of freedom.

 

        The CDF returns the probability that an observed

chi-square statistic will be less than "chs" with "dof"

degrees-of-freedom.  Low values indicate "cooked" or "biased"

experimentation.  High values indicate significant differences

between model predictions and experimental outcomes.

 

     Limits:  0 < chs,dof

 

function chscdf(chs,dof:xfloat):xfloat

function chspf(chs,dof:xfloat):xfloat

 

 

DOUBLE-EXPONENTIAL. The double-exponential or Laplacian is a

continous distribution that is a double-ended version of the

exponential.

 

function dxcdf(x:xfloat):xfloat

function dxinv(prob:xfloat):xfloat

function dxpf(x:xfloat):xfloat

 

 

ERROR.  The distribution of the absolute values of a gaussian

variates (with zero mean and unit variance).  For positive x,

this is the same as the "error function" (ERF) defined in the

SpecFun unit.  There is a difference.  ERF is an "odd" function

in that "ERF(-x)=-ERF(x)".  For negative values of "x", the CDF

and PF functions here are strictly zero.

 

    Limits:  0 <= x

 

function erfcdf(x:xfloat):xfloat

function erfpf(x:xfloat):xfloat

 

 

EXPONENTIAL.  The continuous exponential distribution describes

the intervals between Poisson events.

 

        The CDF returns the probability that an observed

exponential deviate (mean 1) will be less than "x".  Another

easy-to-understand function, not as useless as "ucdf".

 

     Limits:  0 <= x

 

function xcdf(x:float):float

function xinv(prob:xfloat):xfloat

function xpf(x:xfloat):xfloat

 

 

SNEDECOR'S F.  If X1 and X2 are independent chi-square variates

with v1 and v2 degrees of freedom, then the expression (the

"F-ratio") (X1/v1)/(X2/v2) follows an F-distribution.

 

        The CDF returns the probability that an observed F-ratio

will be less than "f" with "dof1" and "dof2" degreess of freedom.

Low and high values indicate significant differences between two

sample variances.

 

     Limits:  0 < f,dof1,dof2

 

function fcdf(f,dof1,dof2:xfloat):xfloat

function fpf(f,dof1,dof2:xfloat):xfloat

 

 

GAMMA.

 

     Limits: 0 <= x

             0 < p < 1

 

function gamcdf(x,p:xfloat):xfloat

function gampf(x,p:xfloat):xfloat

 

 

GAUSSIAN.  Gaussian (Normal) sum of many small variates.

 

        The CDF returns the probability that a random gaussian

deviate (mean 0, var 1) will be less than "x". Another

easy-to-understand function, and quite useful considering the

number of ways that the gaussian distribution arises.

 

function gcdf(x:xfloat):xfloat

function gpf(x:xfloat):xfloat

 

 

GEOMETRIC. Interval between Bernoulli successes, or number of

trials until first success.

 

function geopf(x,p:xfloat):xfloat

 

 

HYPERGEOMETRIC.  This is perhaps the most primitive of the

probability distributions in this collection.  In a finite

population "Npop" of items there is a specific number of "T" of

items of interest.  Examine "Nsamp" of the population items

(sampled without replacement).  The number of items of interest

in the sample follows a Hypergeometric distribution.

 

     Limits: 0 <= x <= min(Nsamp,T)

             0 <= Nsamp,T <= Npop

 

function hgeopf(x,Nsamp,T,Npop:xfloat):xfloat

 

 

KOLMOGOROV-SMIRNOV D.

 

        The CDF returns probability that the observed D-statistic

will be less than "d". High values indicate significant

difference between source distributions.

 

function kscdf(d,dof:xfloat):xfloat; { NR calls this "PROBKS" }

 

 

MAXWELL.  If X1, X2, and X3 are independent, gaussian random

variates with zero mean and unit variance, then sqrt( sqr(X1) +

sqr(X2) + sqr(X3)) has a Maxwell distribution.  This distribution

arises with three dimensional applications with "spherical error

probabilities".

 

     Limits:  0 <= x

 

 

PASCAL. (Negative Binomial)  The distribution of failures in a

run of Bernoulli trials that have exactly "n" successes where the

probability of success of each trial is "p".

 

     Limits:  0 <= x

              0 < p < 1

 

function pascdf(x,n,p:xfloat):xfloat

function paspf(x,n,p:xfloat):xfloat

 

 

POISSON. Poisson is a limiting case of the binomial distribution

as the probability of each individual Bernoulli event goes to

zero, and the number of trials goes to infinity, but the expected

number of events remains constant.

 

        The CDF returns the probability that a Poisson (mean

"mu") random event will be less than "k" (that is, 0 to k-1).

Low values indicate that "mu" is too high; high, "mu" too low.

 

     Domain: 0 <= k

             0 < mu

 

function poicdf(k,mu:xfloat):xfloat

function poipf(k,mu:xfloat):xfloat

 

 

RAYLEIGH.  If X1 and X2 are independent gaussian random variates

with zero mean and unit variance, then sqrt( sqr(X1) + sqr(X2))

has a Rayleigh distribution.  This distribution arises in two

dimensional applications with "circular error probabilities".

 

     Limits: 0 <= x

 

 

STUDENT'S T. Student's T distribution of sample means drawn from

a normal distribution.

 

        The CDF returns the probability that an observed

t-statistic will be greater than "t" (or less than "-t") with

"dof" degrees of freedom.  Two-tail test.  Low values indicate

significant differences between sample means.

 

function tcdf(t,dof:xfloat):xfloat

function tpf(t,dof:xfloat):xfloat

 

 

UNIFORM. (Rectangular) The trivial "uniform" probability

distribution function.

 

        The CDF returns the probability that an observed uniform

deviate between 0 and 1 will be less than "x". Not particularly

useful, but provided because the distribution is easy to

understand.

 

       Limits: 0 <= x <= 1

 

function ucdf(x:xfloat):xfloat

function uinv(prob:xfloat):xfloat

function upf(x:xfloat):xfloat

 

 

References:

 

[HMF]   Abramowitz and Stegun, Handbooks of Mathmetical Functions,

        Government Printing Office. (also available as a Dover

        reprint)

 

[HMS]   Beyer, Handbook of Mathematical Sciences, CRC Press.

 

[BST]   Beyer, Basic Statistical Tables, CRC Press.

 

[SNA]   Knuth, Semi-numerical Algorithms.

 

[FFP]   Menzel, Fundamental Formulas of Physics, Dover reprint.

 

[HAM]   Pearson, Handbook of Applied Mathematics, Van Nostrand

        Reinhold.

 

[NR]    Press, et al., Numerical Recipes, Cambridge.