Estimation of covariance matrices
In multivariate statistics, the importance of the Wishart distribution stems in part from the fact that it is the probability distribution of the maximum likelihood estimator of the covariance matrix of a multivariate normal distribution. Although no one is surprised that the estimator of the population covariance matrix is simply the sample covariance matrix, the mathematical derivation is perhaps not widely known and is surprisingly subtle and elegant.
| Contents |
The multivariate normal distribution
A random vector X ∈ Rp×1 (a p×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix Σ precisely if Σ ∈ Rp × p is a positive-definite matrix and the probability density function of X is
where μ ∈ Rp×1 is the expected value. The matrix Σ is the higher-dimensional analog of what in one dimension would be the variance.
Maximum-likelihood estimation
Suppose now that X1, ..., Xn are independent and identically distributed with the distribution above. Based on the observed values x1, ..., xn of this sample, we wish to estimate Σ (we adhere to the convention of writing random variables as capital letters and data as lower-case letters).
First steps
It is fairly readily shown that the maximum-likelihood estimate of the expected value μ is the "sample mean"
See the section on estimation in the article on the normal distribution for details; the process here is similar.
Since the estimate of μ does not depend on Σ, we can just substitute it for μ in the likelihood function
and then seek the value of Σ that maximizes this.
We have
The trace of a 1 × 1 matrix
Now we come to the first surprising step.
Regard the scalar
as the trace of a 1×1 matrix!
This makes it possible to use the identity tr(AB) = tr(BA) whenever A and B are matrices so shaped that both products exist. We get
(so now we are taking the trace of a p×p matrix!)
where
Using the spectral theorem
It follows from the spectral theorem of linear algebra that a positive-definite symmetric matrix S has a unique positive-definite symmetric square root S1/2. We can again use the "cyclic property" of the trace to write
Let B = S1/2 Σ−1 S1/2. Then the expression above becomes
The positive-definite matrix B can be diagonalized, and then the problem of finding the value of B that maximizes
reduces to the problem of finding the values of the diagonal entries λ1, ..., λp that maximize
This is just a calculus problem and we get λi = n, so that B = n Ip, i.e., n times the p×p identity matrix.
Concluding steps
Finally we get
- Σ = S1 / 2B − 1S1 / 2 = S1 / 2((1 / n)Ip)S1 / 2 = S / n,
i.e., the p×p "sample covariance matrix"
is the maximum-likelihood estimator of the "population covariance matrix" Σ. At this point we are using a capital X rather than a lower-case x because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. This random matrix can be shown to have a Wishart distribution with n − 1 degrees of freedom.
