General Least-Squares - Direct Solutions and Bundle Adjustments

by David Manthey (
Copyright 1999-2005 by David Manthey

A PDF version of this paper is available here: LEAST.PDF

Table of Contents

List of Symbols
General Technique
   Potential Problems
Linear Technique
Error Residuals, Ellipsoids, and Confidence
An Example Problem
   A Linear Solution
   A Nonlinear Solution
   An Error Ellipse


The method of least squares is used to solve a set of linear equations having more equations than unknown variables. Since there are more equations than variables, the solution will not be exactly correct for each equation; rather, the process minimizes the sum of the squares of the residual errors. The method is very powerful and can be applied to numerous applications.

In the general case, the least-squares method is often used to solve a set of non-linear equations that have been linearized using a first-order Taylor-series expansion. Solving non-linear equations is an iterative process using Newton’s method. The speed of convergence is dependent on the quality of an initial guess for the solution. The non-linear least-squares method is often referred to as a bundle adjustment since all of the values of an initial guess of the solution are modified together (adjusted in a bundle). This technique is also occasionally referred to as the Gauss-Newton method.

The least-squares method is not new; Legendre invented the method is 1805, and reference books have mentioned least squares in their titles as early as the 1870s. However, in most books about least squares, the general method is bound inextricably with the book's primary subject matter. There is little uniformity between different books, or even within a single book, on the designation of different variables. This makes it difficult to understand the method. This paper provides a new description of least squares which hopefully describes the process in a simple and straight-forward manner. A complete derivation of the method is not provided, only a functional description. Familiarity with matrices and partial derivatives is assumed.

An example problem is provided, computing the location of a circle based on a set of points. The example shows a linear solution and an iterative nonlinear solution to this problem as well as some error analysis.

List of Symbols

aicoefficients of a linear equation
erelative size of adjustment compared to initial value
Fi(x1,x2,x3,...)equation based on a set of unknowns. This is not necessarily a function. The equation must be solved for the value (i.e., Fi(x1,x2,x3,...)=ki) whose error will be minimized
Fisher(a,v1,v2) the Fisher distribution
Jthe Jacobian matrix. This is the partial differentials of each equation with respect to each unknown. Sometimes written J=Fi(x1,x2,x3,...), where is del, the gradient operator
Kvector of residuals. Each component is the difference between the observation, ki, and the equation evaluated for the initial guess, xi0
kian observation. The least-squares process minimizes the error with respect to the observations
Nthe normal matrix. N=JtWJ
Qcofactor matrix. The inverse of W. This is typically the standard deviations of the measurements
Qxxthe covariance matrix. Also called the variance-covariance matrix. Qxx=N-1
Q'xxa sub-matrix of the covariance matrix. Used to calculate an error ellipse
qxxija value within the covariance matrix
rthe degrees of freedom. This is the number of equations minus the number of unknowns
S02the reference variance. This is used to scale the covariance matrix
Saxislength of the semi-axis of an error ellipse based on the covariance matrix
Saxis%length of the semi-axis of an error ellipse with a specific confidence level
Wweighting matrix. This is a square symmetric matrix. For independent measurements, this is a diagonal matrix. Larger values indicate greater significance
withe diagonal components of the weighting matrix
Xvector of initial guesses for the unknowns
X'vector of refined guesses for the unknowns. X'=X+DX
xian unknown value. This is solved for in the least-squares process
x'ithe refined guess for an unknown value. x'i=xi0+Dxi
xi0the initial guess for an unknown
DXvector of adjustment values. The initial guesses are adjusted by this amount
Dxian adjustment value. This is added to the initial guess to improve the solution
G(v)the Gamma distribution

General Technique

The general least squares process can be used to solve a set of equations for a set of unknowns. The only requirement is that there are at least as many equations as there are unknowns.

If the equations are linear, the least-squares process will produce a direct solution for the unknowns. If the equations are not linear, an initial guess of the unknowns is required, and the result is an adjustment to the initial parameters. This is repeated until the results converge (the adjustments become very close to zero). The linear case is an adjustment using zero as the initial guess of all parameters.

The process requires a set of equations with the unknowns on one side and some known quantity on the other. Let xi be the set of unknowns, and let the equations be of the form


where ki is the observation (value) whose least-squares error will be minimized. Since there are more equations than unknowns, the solution of the unknowns will not be exact. Using the solution to compute the equation, Fi(x1,x2,x3,...), will not generate the exact observation value, ki. The square of the difference between the evaluated equation and the observation is minimized.

There is typically one equation for each observation. In photogrammetry, this might be one equation for each x pixel coordinate and one equation for each y pixel coordinate. Each equation is not required to have all of the unknowns in it.

The Jacobian matrix, J, is the matrix of the partial differentials of each equation with respect to each unknown. That is,

In general, the height of the Jacobian matrix will be larger than the width, since there are more equations than unknowns.

Furthermore, let the vector K be the vector of the residuals. A residual is the difference between the observation and the equation calculated using the initial values. That is

One further parameter is a weighting matrix, W. This is a matrix which includes the expected confidence of each equation and also includes any dependence of the equations. A larger value in the weighting matrix increases the importance of the corresponding equation (larger values indicate greater confidence). It is a square symmetric matrix with one row per equation. The main diagonal contains the weights of the individual equations, while the off-diagonal entries are the dependencies of equations upon one another. If all of the observations are independent, this will be a diagonal matrix. The cofactor matrix, Q, is the inverse of the weighting matrix (i.e., Q=W-1).

Let xi0 be an initial guess for the unknowns. The initial guesses can have any finite real value, but the system will converge faster if the guesses are close to the solution. Also, let X be the vector of these initial guesses. That is

It is desirable to solve for the adjustment values, DX. This is the vector of adjustments for the unknowns

where, based on the initial guess, X, and an adjustment, DX, a set of new values are computed


To solve for DX, (see the references for the reasoning behind this solution)


In various texts, the normal matrix, N, is defined as


and the covariance matrix (sometimes referred to as the variance-covariance matrix), Qxx, is defined as


If the weighting matrix is diagonal, then DX can be solved by row reduction of the matrix

where w1, w2, w3, ... are the diagonal elements of the weighting matrix. Note that the left side of the matrix is the normal matrix, N.

The initial guesses, xi0, are updated using the solution of the adjustment matrix, DX, as follows




The process is repeated using the new values, x'i, as the initial guesses until the adjustments are close to zero.

Practically, an adjustment is close to zero when it is small compared to the absolute magnitude of the value it is adjusting, i.e., Dxi < xi0 e, where e is a small value. The actual value for e can be selected based on the number of decimal digits of precision used in the calculations. Typically, the order of magnitude of e will be a few less than the number of digits of precision. For example, if the calculations are done on a computer using standard double precision (8-byte) values, the computer can hold around 15 digits of precision; therefore e is about 10-12.

Potential Problems

There are conditions where the solution will not converge or will converge to undesirable values. This process finds a local minimum. As such, there may be a better solution than the one found. A solution is dependent on the equations, Fi(x1,x2,x3,...), being continuous in x1,x2,x3,.... The first and second derivatives do not need to be continuous, but if the equations are not continuous, there is no guaranty that the process will converge. Also, in certain circumstances, even if the equations are continuous, the solution may not converge. This can happen when the first and second derivatives of the equations have significantly different values at the initial values than at the solution.

In any case where the solution does not converge, a solution may still be able to be obtained if different starting values, xi0, are used.

Linear Technique

For sets of linear equations, the least-squares process will produce a direct solution for the unknowns. The linear case is mathematically the same as the general case where an adjustment is performed using zero as the initial guess of all parameters. Only a single iteration is required for convergence.

The equations must be of the form


The Jacobian matrix, J, is therefore

where aij is the ith coefficient of the jth equation.

Since the initial guesses are all zero, the vector of residuals, K, is

If the weighting matrix is diagonal, then DX can be solved by row reduction of the matrix

The final solution will be the adjustment values. That is




Error Residuals, Ellipsoids, and Confidence

The covariance matrix, Qxx, contains the variance of each unknown and the covariance of each pair of unknowns. The quantities in Qxx need to be scaled by a reference variance. This reference variance, S02, is related to the weighting matrix and the residuals by the equation

where r is the number of degrees of freedom (i.e., the number of equations minus the number of unknowns).

For any set of quantities, an error ellipse can be calculated. The dimensions and orientations of the ellipse are calculated from the coefficients of the covariance matrix. Only the coefficients of the covariance matrix in the relevant rows and columns are used. This is the appropriate n x n sub-matrix, where n is the number of dimensions for the error ellipse. The sub-matrix is symmetric.

The ellipse matrix is composed of entries from the covariance matrix. For example, a three-dimensional error ellipsoid is computed from

where qxxij are values from the covariance matrix Qxx, and a, b, and c are the indices for the unknowns for which the ellipse is computed.

The error ellipse semi-axes are given by

The orientation of the error ellipse is the column eigenvectors of Q'xx.

To determine the error to a specific confidence level, the length of the semi-axis is multiplied by a confidence factor based on the Fisher distribution using the formula

where the confidence is a number from 0 to 1, with 1 being complete confidence, and r is the number of degrees of freedom. The Fisher distribution is determined from the equation

where the Gamma function, G(v), is given by

An Example Problem

Given a set of two-dimensional points, find the circle best represented by these points.

For this example, there are 82 points, see also the figure to the right:


Let the points be denoted (xi,yi). That is, (x1,y1)=(3,-8), (x2,y2)=(4,-9), (x3,y3)=(5,-9), , ...

A circle can be defined by the equation

where (x0,y0) is the center of the circle and r is the radius of the circle.

Ideally, it is desirable to minimize the square of the distance between the points defining the circle and the circle. That is, minimize

This is equivalent to performing the least-squares process using the equations

A Linear Solution

The equation of a circle is not linear in the unknown values x0, y0, and r. The equation of a circle can be written in the linear form



This can be rewritten for the original unknowns as

Note that by using the linear form of the circle equation, the square of the distance from each point to the circle in not the value that is being minimized.

Using the equation A(x2+y2)+Bx+Cy=1 with the unknowns A, B, and C, the Jacobian matrix is

Placing numerical values in the Jacobian matrix gives

Since all initial guesses are zero, the residual vector is

Lacking other information, the weighting matrix, W, is the identity matrix. That is


The unknowns, A, B, and C, can be solved for using the equation


This can be simplified to

Numerically, this is

Solving for the unknowns, this is

Solving for the circle parameters, this is


This solution is shown superimposed on the original points in the figure to the right. Note that the circle does not fit the data points very well because the solution used the linear form of the circle equations. A much better fit can be obtained using the nonlinear equations.

A Nonlinear Solution

Instead of solving for the linear parameters, A, B, and C, it is more desirable to solve for the values x0, y0, and r using an equation that minimizes the distance from the points to the circle. The following equation is used

The Jacobian matrix is


The solution of x0, y0, and r will require multiple iterations. A starting guess for these values is required. The starting guess can be arbitrary provided the equations are valid. A starting guess of


will be used.

Placing numerical values in the Jacobian matrix gives

The residual vector is

Again, lacking other information, the weighting matrix, W, is the identity matrix.

The adjustments of the unknowns, Dx0, Dy0, and Dr, can be solved for using the equation


Numerically, this is

Solving for the adjustment values, this is

The circle parameters are then adjusted to


This new solution is now used to compute a new Jacobian matrix and a new residual vector, which are then used to again solve for DX. The next iteration is

The circle parameters are now adjusted to


Additional iterations are performed until the adjustment values become close to zero. When performing computations with ten significant figures, after nine iterations, the values have converged to


This solution is shown superimposed on the original points in the figure to the right. Note that the circle fits the data points significantly better than the linear solution.

An Error Ellipse

To determine the expected error of the nonlinear solution of the circle, the error ellipse can be calculated. To illustrate this, the error ellipse of the circle’s center will be computed.

The error ellipse is computed using the covariance matrix

Since only the error ellipse for the center of the circle, (x0,y0), is desired, the matrix of interest is

The matrix Q'xx is composed of the entries in the first two columns and rows of the matrix Qxx since x0 corresponds to the first column and y0 to the second column of the Jacobian matrix.

The eigenvalues of Q'xx are

The eigenvectors specify the orientation of the ellipse. The eigenvectors are

The reference variance is computed based on the residual matrix and on the degrees of freedom. The number of degrees of freedom is the number of observations (82 observations – the number of points used to compute the circle) minus the number of unknowns (3 unknowns - x0, y0, and r). The reference variance is

The error ellipse can be computed for any confidence level. The Fisher distribution needs to be computed for the selected confidence. Although the reference variance was computed based on the total number of unknowns, the Fisher distribution is computed only for the number of unknowns in the ellipse. For 95% confidence


Based on this information, the semi-axes of the error ellipse are

The 95% confidence error ellipse is shown on the same graph as the circle, see figure to the right. The error ellipse has been magnified by a factor of 5 to make it more visible.

The meaning of the 95% confidence error ellipse is that if more points are added to improve the calculation of the circle, there is 95% probability that the center of the circle will remain within the area specified by the ellipse. As expected, as more points are used to compute the circle, the area of the ellipse becomes smaller. If the points lie closer to the actual circle, this will also reduce the ellipse area.


[1] Nash, Stephan G. and Sofer, Ariela, Linear and Nonlinear Programming, McGraw-Hill Companies, Inc, New York, 1996. ISBN 0-07-046065-5.

[2] Wolf, Paul R. and Ghilani, Charles D., Adjustment Computations, John Wiley and Sons, Inc, New York, 1997. ISBN 0-471-16833-5.