[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12. vpdl: Probability Distributions (Experimental)

Chapter summary: vpdl is a library of probability distributions.

This library contains data structures to represent univariate and multivariate probability distributions and provides algorithms to operate on them. This includes estimating a distribution from data points and sampling from a distribution. vpdl is built on top of vnl and vnl_algo.

vpdl is comprised of two programming paradigms: generic programming and polymorphic inheritance. The generic programming part is in its own sub-library called vpdt. vpdt is a template library (like STL). There is no compiled library for vpdt, only a collection of header files in subdirectory vpdl/vpdt. vpdt works with vnl types, but in many cases can generalize to other types. The goal of vpdt is to provide generic implementations that are both time and memory efficient when types are known at compile time.

The rest of vpdl uses a polymorphic design and provides greater run time flexibility and easy of use. It is limited to distributions over scalar, vnl_vector, and vnl_vector_fixed types.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.1 History

vpdl is the combination of two different design patterns for probability distributions. It was formed from the merger of three contrib libraries: mul/vpdfl, mul/pdf1d, and brl/bbas/bsta.

Created by Manchester, vpdfl provided a polymorphic hierarchy (using virtual functions) for multivariate distributions based on vnl_vector and vnl_matrix types. For univariate distributions, pdf1d mirrored the design of vpdfl, but used scalar types (i.e. double). These libraries were very flexible at run time. Both distribution type and, in the case of vpdfl, dimension could be selected at run time.

Created by Brown, bsta provided a generic programming hierarchy (using templates) for both univariate and multivariate distributions. Template parameters specified scalar type (float or double) and dimension. Templates allowed the same code base to use scalars in the univariate case and vnl_vector_fixed and vnl_matrix_fixed in the multivariate case. The goal of bsta was to be very efficient. Many optimizations are possible by assuming types and dimension are known at compile time.

vpdl was designed as a core library to meet the needs of both original designs. The main library generalizes Manchester's polymorphic design with a templated distribution library. This creates polymorphic inheritance hierarchy for each choice to template parameters. The templates allow hierarchies based on scalar, vnl_vector, and vnl_vector_fixed types. When possible, the implementations of vpdl distributions are wrappers around corresponding vpdt distributions. The vpdt distributions provide even more generalized implementations but without virtual functions or inheritance.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.2 Polymorphic Distributions

Each polymorphic distribution is derived (directly or indirectly) from a common templated base class called vpdl_distribution<T,n>. Template parameter T specifies the scalar numeric type (float or double) and n specifies the dimension. In particular:

The distribution class is an abstract base class. The most important functions it defines are:

virtual T density (const vector &pt) const=0;

Evaluate the unnormalized density at a point.

virtual T norm_const () const=0;

The normalization constant for the density.

virtual T prob_density (const vector &pt) const;

Evaluate the probability density at a point. Defaults to density(pt) * norm_const().

These functions allow the density of the distribution to be evaluated independently from the normalization constant. This is important because the normalization constant is sometimes expensive to compute and not required for some calculations. Furthermore, the normalization constant is independent of the evaluation point so it can be computed once in advance and reused. Alternatively, the function prob_density returns the normalized density directly.

The type vector is a typedef defined in the class that selects the appropriate vector class depending on T and n. Similarly, the type matrix is defined to the appropriate matrix type.

Several other functions are defined on the distribution base class including: gradient_density, cumulative_prob, compute_mean, and compute_covar. All of these functions provide a polymorphic interface to probability distributions. However, in most cases, the vpdl classes are simply light wrappers around generic vpdt implementations that do the real work.

One additional component of the polymorphic wrappers is a class hierarchy. In vpdt, the distribution classes need not have any inheritance relations, but in vpdl the currently implemented distributions are arranged into the following class hierarchy. Each top level class below is derived directly from vpdl_distribution (template arguments are omitted).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.3 Generic Probability (vpdt)

The subdirectory vpdl/vpdt contains the generic programming components of this library. vpdt is a self-sufficient template library that does not depended on the rest of vpdl. However, there is nothing to stop vpdl distribution classes from being used as vpdt distribution types as long as they meet the requirements. Whenever possible, vpdl distributions should be wrappers around a generic vpdt implementation. This strategy allows the code to be used both in generic and polymorphic designs.

Generic programming uses concepts to implement compile time type polymorphism rather than inheritance to implement run time object-oriented polymorphism. Concepts are sets of requirements that a data type must meet. To satisfy a concept a class must define the required typedefs, static constants, member functions, or related global functions. Classes that satisfy a concept need not be related by inheritance. The Standard Template Library (STL) is built upon the definition of concepts. In STL, some concepts (like Iterators) are general enough to include built-in types (like pointers). In vpdt, there are also some basic concepts which can be satisfied by built-in types. However, the most useful concepts (like Distributions) are a bit more specific and require a class implementation.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.3.1 Basic Concepts

A collection of more basic concepts will form the building blocks of the probability distribution concept. The relevant concepts are:

vnl data structures satisfiy these concepts and are used in practice. However, other representations could be used if the appropriate functions are provided. These concepts are all interrelated by various function requirements. The type that meets the Field concept is considered the primary type. For each Field type there should be a unique associated Scalar, Vector, and Matrix type The vpdt_field_traits<F> struct defines these types for each F. vpdt_field_traits<F> needs to be specialized for each Field type. This traits struct defines the following typedefs:

vpdt_field_traits<F>::dimension

This is actually a static const unsigned int and indicates the fixed dimension of the Field type

vpdt_field_traits<F>::scalar_type

The Scalar type associated with the Field type

vpdt_field_traits<F>::field_type

The Field type itself (F==field_type). Redundant, but provided for consistency.

vpdt_field_traits<F>::vector_type

The Vector type associated with the Field type

vpdt_field_traits<F>::matrix_type

The Matrix type associated with the Field type

In the special case of univariate distributions, all four of these types become equal, and dimension==1. There is also a special typedef vpdt_field_traits<F>::type_is_scalar defined to void that only exists if F is scalar. Similarly, vpdt_field_traits<F>::type_is_vector is defined to void only if F is non-scalar. The existence of these special types is used to disambiguate some template specializations, and to provide faster implementations for the scalar case.

Scalar

A probability distribution is defined over some scalar field. The Scalar concept represents this value at each point in space. It is satisfied by floating point types. For now this is likely to be limited to double or float. Scalars must support all built in arithmetic operators (+, -, *, /, etc.) as well as all standard cmath functions like sqrt, exp, and log. If the type does not support these functions directly, it must be able to implicitly cast itself to and from a type that does.

Field

Each point in a probability space is represented by the Field concept. When used in vpdl distributions the Field is satisfied by either a scalar, a vnl_vector, or a vnl_vector_fixed. The Field has an associated Scalar and dimension. For dimension n, a Field contains n values of its Scalar type. A Field actually has two measure of dimension: a fixed (compile time) dimension and a variable (run time) dimension. These dimensions may or may not be equal. Field types with data allocated on the stack will have size specified at compile time. The fixed dimension and variable dimension will both equal this fixed size. Field types with data allocated on the heap will have a fixed dimension of zero and a variable dimension equal to the number of currently allocated Scalar objects.

A Field type will be equal to its Scalar type in the case of univariate distributions. Since Scalar types are usually built-in C++ types, they can not be required to implement any member functions. Instead, they are required to implement a set of overloaded global functions to perform the required set of actions. For the standard types used in vpdl, these functions are implemented in vpdl/vpdt/vpdt_access.h. For some types that satisfy the concept these functions are trivial or even empty. In the table below the Field type is denoted field and its corresponding Scalar type is scalar.

unsigned int vpdt_size(const field&)

Access the run time size (i.e. dimension)

void vpdt_set_size(field&, unsigned int)

Change the run time size for heap-based structures, otherwise do nothing.

void vpdt_fill(field&, const scalar&)

Fill all dimensions of the field with the scalar value

const scalar& vpdt_index(const field&, unsigned int)

Access a scalar element reference (const) by index

scalar& vpdt_index(field&, unsigned int)

Access a scalar element reference (non-const) by index

The Field type should also support the subtraction operator on two Field types. The return type for the difference of two Field types is a Vector type.

Vector

The Vector concept represents the difference between two Field values. In the case of the default vnl types, the Field and Vector types are the same. The requirements for the Vector are a superset of the requirements for a Field. A Vector should support all the operation of Field plus several other arithmetic operations:

Vectors should also define these global functions

Matrix

The Matrix concept actually refers to a square matrix. It is used for second order statistics (like covariance). Since it is square, the matrix also has a single (fixed or variable) dimension like the Field and Vector. It requires all the same vpdt_access.h functions except the signature of the vpdt_index function now has two indices

A few arithmetic operations are also needed:


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.3.2 Distribution Concept

The Distribution concept is the central concept in vpdt. The Distribution is a probability distribution class that is analogous to vpdl_distribution<T,n>. In fact, vpdl_distribution<T,n> and all its subclasses satisfy the Distribution concept. The Distribution requires a subset of the functions that are declared virtual on vpdl_distribution<T,n>. The following example lays out the required API that a Distribution is required to have. In this example, it is assumed that a consistent set of basic types have been defined such that T satisfies Scalar, F satisfies Field, V satisfies Vector, and M satisfies Matrix.

 
class vpdt_distribution
{
 public:
  //: The field type
  typedef F field_type;


  //: Return the variable (run time) dimension
  unsigned int dimension() const;

  //: Evaluate the unnormalized density at a point
  // This must be multiplied by norm_const() to integrate to 1
  T density(const F& pt) const;

  //: Compute the gradient of the density function, returned in \a g
  // The return value of the function is the density itself
  T gradient_density(const F& pt, V& g) const;

  //: Compute the normalization constant (independent of sample point).
  T norm_const() const;

  //: Evaluate the cumulative distribution function at a point
  // This is the integral of the density function from negative infinity
  // (in all dimensions) to the point in question
  T cumulative_prob(const F& pt) const;

  //: Compute the mean of the distribution.
  void compute_mean(F& m) const;

  //: Compute the covariance matrix of the distribution.
  void compute_covar(M& c) const;
};

The other member functions that are defined in vpdl_distribution<T,n> are not needed for a vpdt Distribution because they can be implemented in terms of the required functions. These extra functions are provided as global functions instead. For example, T vpdt_prob_density(const dist& d, const F& pt) is a function that returns d.density(pt) * d.norm_const(). These functions can be overloaded when a more efficient implementation exists for a particular distribution class. For example, the default vpdt_log_density function applied to a Gaussian distribution would compute the log of an exponential. An overload of vpdt_log_density for Gaussians can eliminate this extra unnecessary step.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Rumpelstiltskin on November, 22 2009 using texi2html 1.78.