You are on page 1of 18

Fast Algorithms for Robust Regression

Thorsten Bernholt Robin Nunkesser

Department of Computer Science, University of Dortmund

Statistical Computing 2006

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

1 / 17

Outline

Introduction Robust Regression Time Series Analysis in Intensive Care Statistics and Computer Science Using the Toolkits of Computer Science Problem transformation Example

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

2 / 17

Robust Regression
Denition (Donoho and Huber(1983))
The (nite sample) breakdown point is the smallest fraction of data points that need to be changed to have an unbounded eect on the estimate.

Number of international phone calls originated in Belgium


200 Number of Calls (in millions) 50 100 150

LS

LQD 0 1950

1955

1960 Year

1965

1970

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

3 / 17

Time Series Analysis


Heart rate of a patient in intensive care
150 heart rate 0 0 50 100 200

1000

2000

3000 time

4000

5000

Time series data is monitored online, e.g. in intensive care Regression techniques have to be applied to a moving time window Robust regression may be used to reduce the eect of outliers Need for fast oine and online algorithms
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 4 / 17

Using the Toolkits of Computer Science


Problems from Statistics often need to be reformulated or transformed.

Denition
An algorithmic problem consists of a description of the set of allowable inputs and a description of a function that maps each allowable input to a non-empty set of correct outputs (answers, results). Computational Geometry is a related eld of research The geometric avour of statistics becomes apparent when a sample is regarded as a set of points in Euclidean space. Searching nearest neighbours may be reformulated to compute the Hodges-Lehmann estimator and an estimator of scale. It is often useful to consider underlying decision problems

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

5 / 17

Decision Problems

Denition
A decision problem is an algorithmic problem where the set of outputs is restricted to Yes and No. Obvious decision problems are: Is the optimal value of the objective function better than x? Is the local solution y the global solution?

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

6 / 17

Problem Transformation: The Power of Geometric Duality


Search for a point in an arrangement of lines instead of a line through a set of points. Map a point (a, b) to the line y = ax + b and the line y = ax + b to the point (a, b). Distances and relations are preserved

primal space
3 2

dual space
2 y -3 -1.5 -2 -1 0 1 3

-3

-2

-1

-1.5

-1

-0.5

0 x

0.5

1.5

-1

-0.5

0 x

0.5

1.5

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

7 / 17

Modications to Geometric Duality

Adding or Subtracting a dimension may lead to a known problem


Searching for nearest neighbours in the plane reduces to querying extreme points of convex hulls in R 3

Adding or deleting points may help


We apply this to a robust regression estimator later on

Other duality concepts besides point/line duality may be used

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

8 / 17

Overview and Newest Result


Improved static or dynamic algorithms for Repeated Median, Median Absolute Deviation, Least Median of Squares, Least Quartile Dierence

Denition (Croux et al.(1994))


Consider n points pi in the plane and let h = (n + 3)/2 . The LQD solution to the regression problem is given by the slope of the line L which minimises the h th order statistic of {|ri (L) rj (L)| | 1 i < j n} . 2

Example for ri (L)


L pi ri(L)

The problem has O(n4 ) possible solutions Original running time O(n5 log n) We achieve O(n2 log n)

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

9 / 17

Application of Geometric Duality


We map data values consisting of n points (xi , yi ) to 2 L+ i,j L i,j : v = +(xi xj )u (yi yj ) : v = (xi xj )u + (yi yj ) .
n 2

lines

Example of the modied dual space


0.8 0.6

In this arrangement, we search the lowest point (, r ) with n + h 2 2 subjacent or intersecting lines. equals the slope of the LQD t, r equals the minimised order statistic.

v
0.2 0.6 0 0.2

0.4

0.7

0.8

0.9

1.1

1.2

1.3

1.4

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

10 / 17

Results from Computational Geometry

The dual problem is equivalent to two problems from Computational Geometry: Minimum k-level point k-violation linear programming

Corollary (Cole et al(1987), Roos and Widmayer(1994), Chan(1999))


It is possible to compute the LQD estimator for n data points in the plane in expected running time O(n2 log n) or deterministic running time O(n2 log2 n).

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

11 / 17

One of our Algorithms

Theoretical superior algorithms are often hard to implement or even impractical Our own algorithms achieve slightly inferior theoretical running times The framework of the algorithms:
1

Map the input consisting of n data values to 2 O(n2 )

n 2

lines using time

Search for the optimal solution with the help of the underlying decision problem Output the solution

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

12 / 17

The Underlying Decision Problem


We need to decide for a given height, if a local solution exists at this height or below.

Example for the Decision Problem


0.8 0.6

Compute all intersections of the lines with this height Sift through the sorted intersections and update the number of subjacent lines accordingly If the number equals decide YES
n 2

v
0.2

0.4

h 2

0.2 0.6

0.7

0.8

0.9

1.1

1.2

1.3

1.4

Running time: O(n2 log n)

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

13 / 17

Randomised Search for the Global Solution


A lower and an upper bound for the height of the optimal solution is stored while the algorithm runs.
1

Initialise the search:


Initialise 0 as the lower bound and nd a trivial local solution to initialise the upper bound.

Search for the global solution:


Calculate the number of intersections that lie between the lower and the upper bound. Choose one of this intersections uniformly at random. Decide if the height of this intersection becomes the new lower or the new upper bound.

Stopping Criterion:
Search until no intersection remains between the lower and the upper bound.

Expected number the decision problem has to be solved: O(log n).


Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 14 / 17

Calculating the number of intersections eciently

Example for Intersections


3 1 2 46 5 7 9 10 11 8

Calculate the no of intersections


1

Label the lines according to their intersection with the upper horizontal line. Interpret the intersections with the lower horizontal line as a permutation of these labels (e.g. (8, 1, 5, 2, 10, 3, 7, 4, 9, 6, 11)). Calculate the number of inversions of the permutation, e.g. with merge sort.

8 5 10 1 6

239

11 4

Running time: O(n2 log n)


Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 15 / 17

Summary

Results from Computational Geometry are applicable for problems from Statistics. Solving dual or equivalent problems may lead to superior running times. Results from Statistics may also help computer scientists, e.g. in the analysis of running times.

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

16 / 17

Thank you!

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

17 / 17

Bibliography
Chan, T. M., 1999. Geometric applications of a randomized optimization technique. Discrete and Computational Geometry 22 (4), 547567. Cole, R., Sharir, M., Yap, C. K., 1987. On k-hulls and related problems. SIAM J. Comput. 16 (1), 6177. Croux, C., Rousseeuw, P. J., Hssjer, O., 1994. Generalized S-estimators. J. Amer. Statist. Assoc. 89, 12711281. Donoho, D., Huber, P., 1983. The notion of breakdown point. In: Bickel, P., Doksum, K., Hodges, J. J. (Eds.), A Festschrift for Erich L. Lehmann. Wadsworth, pp. 157184. Roos, T., Widmayer, P., 1994. k-violation linear programming. Inf. Process. Lett. 52 (2), 109114.

Thorsten Bernholt, Robin Nunkesser

Fast Algorithms for Robust Regression

Statistical Computing

18 / 17

You might also like