You are on page 1of 62

Linear Regression

Regression
Goal: Learn a mapping from observations (features) to
continuous labels given a training set (supervised learning)
Example: Height, Gender, Weight Shoe Size
Audio features Song year
Processes, memory Power consumption
Historical financials Future stock price
Many more

Linear Least Squares Regression


Example: Predicting shoe size from height, gender, and weight

For each observation we have a feature vector, x, and label, y

x = x1

x2

x3

We assume a linear mapping between features and label:

w 0 + w 1 x1 + w 2 x2 + w 3 x 3

Linear Least Squares Regression


Example: Predicting shoe size from height, gender, and weight

We can augment the feature vector to incorporate offset:

x = 1

x1

x2

x3

We can then rewrite this linear mapping as scalar product:

y =

w i xi = w x
i=0

Why a Linear Mapping?


Simple
Often works well in practice
Can introduce complexity via feature extraction

1D Example
Goal: find the line of best fit

x coordinate: features
y coordinate: labels

y = w0 + w1 x
x
Intercept / Offset

Slope

Evaluating Predictions
Can measure closeness between label and prediction

Shoe size: better to be off by one size than 5 sizes


Song year prediction: better to be off by a year than by 20 years

What is an appropriate evaluation metric or loss function?

Absolute loss:

|y

Squared loss: (y

y|

y)

Has nice mathematical properties

How Can We Learn Model (w)?


Assume we have n training points, where x

(i)

denotes the ith point

Recall two earlier points:

y=w x
Linear assumption:
We use squared loss: (y

y)

Idea: Find w that minimizes squared loss over training points:


n

min
w

(i)

(w x
(i)
i=1
y

(i) 2

y )

Given n training points with d features, we define:

X
y
y

n d

R : matrix storing points


n
R : real-valued labels
n
= Xw
R : predicted labels, where y
d
R : regression parameters / model to learn

Least Squares Regression: Learn mapping ( w ) from


features to labels that minimizes residual sum of squares:

min ||Xw
w

2
y||2

Equivalent min
w

(w x
i=1

(i)

y ) by definition of Euclidean norm


(i) 2

Find solution by setting derivative to zero


n

||
w
x
df
1D: f(w)(=
i)
(i)
(w ) = 2
x (wx
dw
i=n1
df
(i)
(i)
(w ) = 2
x (wx
wx x x
dw
i=1

2
y||
(i)2

=
(wx
y )=0

(i)

i=1

(i)

y )=0

(i) 2

y )
wx x

x y=0

wx x

x y=0

w = (x x)

wx x x y

x y

w = (x x) x y
Least Squares Regression: Learn mapping ( w ) from
features to labels that minimizes residual sum of squares:

min ||Xw
w

Closed form solution: w = (X X)

2
y||2
1

X y (if inverse exists)

Overfitting and Generalization


We want good predictions on new data, i.e., generalization
Least squares regression minimizes training error, and could overfit
Simpler models are more likely to generalize (Occams razor)
Can we change the problem to penalize for model complexity?
Intuitively, models with smaller weights are simpler

Given n training points with d features, we define:

X
y
y

n d

R : matrix storing points


n
R : real-valued labels
n
= Xw
R : predicted labels, where y
d
R : regression parameters / model to learn

Ridge Regression: Learn mapping ( w) that minimizes


residual sum of squares along with a regularization term:
Training Error

min ||Xw
w

2
y||2

Closed-form solution: w = (X X + Id )

Model Complexity

+
1

2
||w||2

X y

free parameter trades off


between training error and
model complexity

Millionsong
Regression Pipeline

full
dataset

Obtain Raw Data

Supervised Learning Pipeline

training
set
full
dataset

test set
Obtain Raw Data
Split Data

Supervised Learning Pipeline

training
set
full
dataset

test set
Obtain Raw Data
Split Data

Supervised Learning Pipeline

Feature Extraction

training
set

model

full
dataset

test set
Obtain Raw Data
Split Data

Supervised Learning Pipeline

Feature Extraction
Supervised Learning

training
set

model

full
dataset

test set

accuracy
Obtain Raw Data
Split Data

Supervised Learning Pipeline

Feature Extraction
Supervised Learning

Evaluation

training
set

model

full
dataset

test set

accuracy
Obtain Raw Data
Split Data

Supervised Learning Pipeline

Feature Extraction
Supervised Learning

Evaluation

training
set

model

full
dataset
new entity

test set

accuracy

prediction
Obtain Raw Data
Split Data

Supervised Learning Pipeline

Feature Extraction
Supervised Learning

Evaluation
Predict

training
set

model

full
dataset
new entity

test set

accuracy

prediction
Obtain Raw Data

Goal: Predict songs release year from audio features


Raw Data: Millionsong Dataset from UCI ML Repository
Western, commercial tracks from 1980-2014
12 timbre averages (features) and release year (label)

Split Data
Feature Extraction
Supervised Learning

Evaluation
Predict

training
set

model

full
dataset
new entity

test set

accuracy

prediction
Obtain Raw Data

Split Data: Train on training set, evaluate with test set


Test set simulates unobserved data
Test error tells us whether weve generalized well

Split Data
Feature Extraction
Supervised Learning

Evaluation
Predict

training
set

model

full
dataset
new entity

test set

accuracy

prediction
Obtain Raw Data

Feature Extraction: Quadratic features


Compute pairwise feature interactions
Captures covariance of initial timbre features
Leads to a non-linear model relative to raw features

Split Data
Feature Extraction
Supervised Learning

Evaluation
Predict

Given 2 dimensional data, quadratic features are:

x = x1
z = z1

x2
z2

=
=

(x) =

2
x1

(z) =

2
z1

x1 x2
z1 z2

x2 x 1
z2 z 1

2
x2
2
z2

More succinctly:

(x) =

2
x1

2x1 x2

2
x2

(z) =

2
z1

2z1 z2

2
z2

Equivalent inner products:

(x) (z) =

2 2
x 1 z1

+ 2x1 x2 z1 z2 +

2 2
x 2 z2

= (x) (z)

training
set

model

full
dataset
new entity

test set

accuracy

prediction
Obtain Raw Data

Supervised Learning: Least Squares Regression


Learn a mapping from entities to continuous
labels given a training set
Audio features Song year

Split Data
Feature Extraction
Supervised Learning

Evaluation
Predict

Given n training points with d features, we define:

X
y
y

n d

R : matrix storing points


n
R : real-valued labels
n
= Xw
R : predicted labels, where y
d
R : regression parameters / model to learn

Ridge Regression: Learn mapping ( w) that minimizes


residual sum of squares along with a regularization term:
Training Error

min ||Xw
w

2
y||2

Model Complexity

2
||w||2

Closed-form solution: w = (X X + Id )

X y

Ridge Regression: Learn mapping ( w) that minimizes


residual sum of squares along with a regularization term:
Training Error

min ||Xw
w

2
y||2

Model Complexity

2
||w||2

free parameter trades off between training


error and model complexity

How do we choose a good value for this free parameter?


Most methods have free parameters / hyperparameters to tune
First thought: Search over multiple values, evaluate each on test set
But, goal of test set is to simulate unobserved data
We may overfit if we use it to choose hyperparameters
Second thought: Create another hold out dataset for this search

training
set
full
dataset

model

validation
set
test set

new entity

accuracy

prediction
Obtain Raw Data

Evaluation (Part 1): Hyperparameter tuning


Training: train various models
Validation: evaluate various models (e.g., Grid Search)
Test: evaluate final models accuracy

Split Data
Feature Extraction
Supervised Learning

Evaluation
Predict

10-8

10-6

10-4

10-2

Regulariza*on-Parameter-(--)

Hyperparameter-1

Hyperparameter-2

Grid Search: Exhaustively search through hyperparameter space


Define and discretize search space (linear or log scale)
Evaluate points via validation error

Evaluating Predictions
How can we compare labels and predictions for n validation points?
2

Least squares optimization involves squared loss, (y y) , so it


seems reasonable to use mean squared error (MSE):
1
MSE =
n

(i)

(y

(i) 2

y )

i=1

But MSEs unit of measurement is square of quantity being


measured, e.g., squared years for song prediction
More natural to use root-mean-square error (RMSE), i.e.,

MSE

training
set
full
dataset

model

validation
set
test set

new entity

accuracy

prediction
Obtain Raw Data

Evaluation (Part 2): Evaluate final model


Training set: train various models
Validation set: evaluate various models
Test set: evaluate final models accuracy

Split Data
Feature Extraction
Supervised Learning

Evaluation
Predict

training
set
full
dataset

model

validation
set
test set

new entity

accuracy

prediction
Obtain Raw Data
Split Data

Predict: Final model can then be used to make


predictions on future observations, e.g., new songs

Feature Extraction
Supervised Learning

Evaluation
Predict

Distributed ML:
Computation and Storage

Challenge: Scalability
Classic ML techniques are not always suitable for modern datasets
60"
50"

Data Grows Faster


than Moores Law

Moore's"Law"
Overall"Data"

40"

Par8cle"Accel."

30"

[IDC report, Kathy Yelick, LBNL]

DNA"Sequencers"

20"
10"
0"
2010"

2011"

2012"

2013"

2014"

2015"

Machine
Learning

Data

Data"Grows"faster"than"Moores"Law"

[IDC%report,%Kathy%Yelick,%LBNL]%

Distributed
Computing

Least Squares Regression: Learn mapping ( w ) from


features to labels that minimizes residual sum of squares:

min ||Xw
w

Closed form solution: w = (X X)

2
y||2
1

X y (if inverse exists)

How do we solve this computationally?


Computational profile similar for Ridge Regression

Computing Closed Form Solution


1

w = (X X) X y
Computation: O(nd2 + d3) operations

Consider number of arithmetic operations ( +, , , / )


Computational bottlenecks:
2
Matrix multiply of X X : O(nd ) operations
3
Matrix inverse: O(d ) operations
Other methods (Cholesky, QR, SVD) have same complexity

Storage Requirements
1

w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
Consider storing values as floats (8 bytes)
Storage bottlenecks:
2
X X and its inverse: O(d ) floats
X : O(nd) floats

Big n and Small d


1

w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
3
O(d )

Assume
computation and
single machine

2
O(d )

storage feasible on

Storing X and computing X X are the bottlenecks


Can distribute storage and computation!
Store data points (rows of X ) across machines
Compute X X as a sum of outer products

Matrix Multiplication via Inner Products


Each entry of output matrix is result of inner product of inputs matrices

9
4

3
1

5
2
9

1
3
2
1+3

2
28
5 =
11
3
3+5

2 = 28

18
9

Matrix Multiplication via Inner Products


Each entry of output matrix is result of inner product of inputs matrices

9
4

3
1

5
2

1
3
2

2
28
5 =
11
3

18
9

Matrix Multiplication via Inner Products


Each entry of output matrix is result of inner product of inputs matrices

9
4

3
1

5
2

1
3
2

2
28
5 =
11
3

18
9

Matrix Multiplication via Outer Products


Output matrix is sum of outer products between corresponding
rows and columns of input matrices

5
2

1
3
2

9
4

3
1

9
4

18
9
+
8
3

2
28
5 =
11
3
15
10
+
5
4

18
9

15
6

Matrix Multiplication via Outer Products


Output matrix is sum of outer products between corresponding
rows and columns of input matrices

5
2

1
3
2

9
4

3
1

9
4

18
9
+
8
3

2
28
5 =
11
3
15
10
+
5
4

18
9

15
6

Matrix Multiplication via Outer Products


Output matrix is sum of outer products between corresponding
rows and columns of input matrices

5
2

1
3
2

9
4

3
1

9
4

18
9
+
8
3

2
28
5 =
11
3
15
10
+
5
4

18
9

15
6

Matrix Multiplication via Outer Products


Output matrix is sum of outer products between corresponding
rows and columns of input matrices

5
2

1
3
2

9
4

3
1

9
4

18
9
+
8
3

2
28
5 =
11
3
15
10
+
5
4

18
9

15
6

x(1)
x(2)

x(i)

x(i)

X X=

x(n)

x(1)
x(2)

i =1

x(n)

Example: n = 6; 3 workers
x(3)
x(4)
x(i)

x(i)

x(i)

reduce:

x(i)

map:

x(1)
x(5)

x(2)
x(6)

x(i)

x(i)

x(i)

workers:

-1

x(i)

O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage

x(1)
x(2)

x(i)

x(i)

X X=

x(n)

x(1)
x(2)

i =1

x(n)

Example: n = 6; 3 workers
x(3)
x(4)
x(i)

x(i)

x(i)

reduce:

x(i)

map:

x(1)
x(5)

x(2)
x(6)

x(i)

x(i)

x(i)

workers:

-1

x(i)

O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage

Distributed ML:
Computation and Storage,
Part II

Big n and Small d


1

w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
3
O(d )

Assume
computation and
single machine

2
O(d )

storage feasible on

Can distribute storage and computation!


Store data points (rows of X ) across machines
Compute X X as a sum of outer products

Big n and Small d


1

w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
3
O(d )

Assume
computation and
single machine

2
O(d )

storage feasible on

Can distribute storage and computation!


Store data points (rows of X ) across machines
Compute X X as a sum of outer products

Big n and Big d


1

w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
As before, storing X and computing X X are bottlenecks
Now, storing and operating on X X is also a bottleneck
Cant easily distribute!

x(1)
x(2)

x(i)

x(i)

X X=

x(n)

x(1)
x(2)

i =1

x(n)

Example: n = 6; 3 workers
x(3)
x(4)
x(i)

x(i)

x(i)

reduce:

x(i)

map:

x(1)
x(5)

x(2)
x(6)

x(i)

x(i)

x(i)

workers:

-1

x(i)

O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage

x(1)
x(2)

x(i)

x(i)

X X=

x(n)

x(1)
x(2)

i =1

x(n)

Example: n = 6; 3 workers
x(3)
x(4)
x(i)

x(i)

x(i)

reduce:

x(i)

map:

x(1)
x(5)

x(2)
x(6)

x(i)

x(i)

x(i)

workers:

-1

x(i)

O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage

Big n and Big d


1

w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
As before, storing X and computing X X are bottlenecks
Now, storing and operating on X X is also a bottleneck
Cant easily distribute!

1st Rule of thumb


Computation and storage should be linear (in n, d)

Big n and Big d


We need methods that are linear in time and space
One idea: Exploit sparsity
Explicit sparsity can provide orders of magnitude storage and
computational gains
Sparse data is prevalent

Text processing: bag-of-words, n-grams


Collaborative filtering: ratings matrix
Graphs: adjacency matrix
Categorical features: one-hot-encoding
Genomics: SNPs, variant calling

dense : 1. 0. 0. 0. 0. 0. 3.
8
size : 7
>
<
sparse : indices : 0 6
>
:
values : 1. 3.

Big n and Big d


We need methods that are linear in time and space
One idea: Exploit sparsity
Explicit sparsity can provide orders of magnitude storage and
computational gains
Latent sparsity assumption can be used to reduce dimension,
e.g., PCA, low-rank approximation (unsupervised learning)
d

d
r

Low-rank

Big n and Big d


We need methods that are linear in time and space
One idea: Exploit sparsity
Explicit sparsity can provide orders of magnitude storage and
computational gains
Latent sparsity assumption can be used to reduce dimension,
e.g., PCA, low-rank approximation (unsupervised learning)
Another idea: Use dierent algorithms
Gradient descent is an iterative algorithm
that requires O(nd) computation and O(d)
local storage per iteration

Closed Form Solution for Big n and Big d


Example: n = 6; 3 workers

x(i)

x(i)

x(i)

reduce:

x(i)

map:

x(3)
x(4)

x(2)
x(6)

x(i)

x(i)

x(i)

workers:

x(1)
x(5)

-1

x(i)

O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage

Gradient Descent for Big n


and Big d
Example: n = 6; 3 workers

x(i)

x(i)

x(i)

reduce:

x(i)

map:

x(3)
x(4)

x(2)
x(6)

x(i)

x(i)

x(i)

workers:

x(1)
x(5)

-1

x(i)

O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage

Gradient Descent for Big n


and Big d
Example: n = 6; 3 workers
x(3)
x(4)

x(2)
x(6)

map:

reduce:

x(i)

workers:

x(1)
x(5)

x(i)

-1

O(nd) Distributed
Storage
O(nd)
O(d)
2
O(nd )
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage

Gradient Descent for Big n


and Big d
Example: n = 6; 3 workers
workers:

x(1)
x(5)

x(3)
x(4)

x(2)
x(6)

map:

reduce:

O(nd) Distributed
Storage
O(nd)
O(d)
2
O(nd )
O(d2) Local
Distributed
Storage
Computation
O(d)
O(d)
O(d3) Local O(d2) Local
Computation Storage

Gradient Descent for Big n


and Big d
Example: n = 6; 3 workers
workers:

x(1)
x(5)

x(3)
x(4)

x(2)
x(6)

map:

reduce:

O(nd) Distributed
Storage
O(nd)
O(d)
2
O(nd )
O(d2) Local
Distributed
Storage
Computation
O(d)
O(d)
O(d3) Local O(d2) Local
Computation Storage

You might also like