CS190.1x Week3a

Linear Regression
Regression
Goal: Learn a mapping from observations (features) to
continuous labels given a training set (supervised learning)
Example: Height, Gender, Weight Shoe Size
Audio features Song year
Processes, memory Power consumption
Historical financials Future stock price
Many more
Linear Least Squares Regression

Example: Predicting shoe size from height, gender, and weight
For each observation we have a feature vector, x, and label, y
x = x1
x2
x3
We assume a linear mapping between features and label:
w 0 + w 1 x1 + w 2 x2 + w 3 x 3
Linear Least Squares Regression

Example: Predicting shoe size from height, gender, and weight
We can augment the feature vector to incorporate offset:
x = 1
x1
x2
x3
We can then rewrite this linear mapping as scalar product:
y =
w i xi = w x
i=0
Why a Linear Mapping?

Simple
Often works well in practice
Can introduce complexity via feature extraction
1D Example
Goal: find the line of best fit
x coordinate: features
y coordinate: labels
y = w0 + w1 x
x
Intercept / Offset
Slope
Evaluating Predictions
Can measure closeness between label and prediction
Shoe size: better to be off by one size than 5 sizes

Song year prediction: better to be off by a year than by 20 years
What is an appropriate evaluation metric or loss function?
Absolute loss:
|y
Squared loss: (y
y|
y)
Has nice mathematical properties
How Can We Learn Model (w)?

Assume we have n training points, where x
(i)
denotes the ith point
Recall two earlier points:
y=w x
Linear assumption:
We use squared loss: (y
y)
Idea: Find w that minimizes squared loss over training points:

n
min
w
(i)
(w x
(i)
i=1
y
(i) 2
y )
Given n training points with d features, we define:
X
y
y
n d
R : matrix storing points

n
R : real-valued labels
n
= Xw
R : predicted labels, where y
d
R : regression parameters / model to learn
Least Squares Regression: Learn mapping ( w ) from

features to labels that minimizes residual sum of squares:
min ||Xw
w
2
y||2
Equivalent min
w
(w x
i=1
(i)
y ) by definition of Euclidean norm

(i) 2
Find solution by setting derivative to zero

n
||
w
x
df
1D: f(w)(=
i)
(i)
(w ) = 2
x (wx
dw
i=n1
df
(i)
(i)
(w ) = 2
x (wx
wx x x
dw
i=1
2
y||
(i)2
=
(wx
y )=0
(i)
i=1
(i)
y )=0
(i) 2
y )
wx x
x y=0
wx x
x y=0
w = (x x)
wx x x y
x y
w = (x x) x y
min ||Xw
w
Closed form solution: w = (X X)
2
y||2
1
X y (if inverse exists)
Overfitting and Generalization

We want good predictions on new data, i.e., generalization
Least squares regression minimizes training error, and could overfit
Simpler models are more likely to generalize (Occams razor)
Can we change the problem to penalize for model complexity?
Intuitively, models with smaller weights are simpler
X
y
y
n d

n
n
= Xw
d
Ridge Regression: Learn mapping ( w) that minimizes

residual sum of squares along with a regularization term:
Training Error
min ||Xw
w
2
y||2
Closed-form solution: w = (X X + Id )
Model Complexity
+
1
2
||w||2
X y
free parameter trades off

between training error and
model complexity
Millionsong
Regression Pipeline
full
dataset
Obtain Raw Data
Supervised Learning Pipeline
training
set
full
dataset
test set
Obtain Raw Data
Split Data
training
set
full
dataset
test set
Obtain Raw Data
Split Data
Feature Extraction
training
set
model
full
dataset
test set
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
training
set
model
full
dataset
test set
accuracy
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
training
set
model
full
dataset
test set
accuracy
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Goal: Predict songs release year from audio features

Raw Data: Millionsong Dataset from UCI ML Repository
Western, commercial tracks from 1980-2014
12 timbre averages (features) and release year (label)
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Split Data: Train on training set, evaluate with test set

Test set simulates unobserved data
Test error tells us whether weve generalized well
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Feature Extraction: Quadratic features

Compute pairwise feature interactions
Captures covariance of initial timbre features
Leads to a non-linear model relative to raw features
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
Given 2 dimensional data, quadratic features are:
x = x1
z = z1
x2
z2
=
=
(x) =
2
x1
(z) =
2
z1
x1 x2
z1 z2
x2 x 1
z2 z 1
2
x2
2
z2
More succinctly:
(x) =
2
x1
2x1 x2
2
x2
(z) =
2
z1
2z1 z2
2
z2
Equivalent inner products:
(x) (z) =
2 2
x 1 z1
+ 2x1 x2 z1 z2 +
2 2
x 2 z2
= (x) (z)
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Supervised Learning: Least Squares Regression

Learn a mapping from entities to continuous
labels given a training set
Audio features Song year
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
X
y
y
n d

n
n
= Xw
d

Training Error
min ||Xw
w
2
y||2
Model Complexity
2
||w||2
Closed-form solution: w = (X X + Id )
X y

Training Error
min ||Xw
w
2
y||2
Model Complexity
2
||w||2
free parameter trades off between training

error and model complexity
How do we choose a good value for this free parameter?

Most methods have free parameters / hyperparameters to tune
First thought: Search over multiple values, evaluate each on test set
But, goal of test set is to simulate unobserved data
We may overfit if we use it to choose hyperparameters
Second thought: Create another hold out dataset for this search
training
set
full
dataset
model
validation
set
test set
new entity
accuracy
prediction
Obtain Raw Data
Evaluation (Part 1): Hyperparameter tuning

Training: train various models
Validation: evaluate various models (e.g., Grid Search)
Test: evaluate final models accuracy
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
10-8
10-6
10-4
10-2
Regulariza*on-Parameter-(--)
Hyperparameter-1
Hyperparameter-2
Grid Search: Exhaustively search through hyperparameter space

Define and discretize search space (linear or log scale)
Evaluate points via validation error
Evaluating Predictions
How can we compare labels and predictions for n validation points?
2
Least squares optimization involves squared loss, (y y) , so it

seems reasonable to use mean squared error (MSE):
1
MSE =
n
(i)
(y
(i) 2
y )
i=1
But MSEs unit of measurement is square of quantity being

measured, e.g., squared years for song prediction
More natural to use root-mean-square error (RMSE), i.e.,
MSE
training
set
full
dataset
model
validation
set
test set
new entity
accuracy
prediction
Obtain Raw Data
Evaluation (Part 2): Evaluate final model

Training set: train various models
Validation set: evaluate various models
Test set: evaluate final models accuracy
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set
full
dataset
model
validation
set
test set
new entity
accuracy
prediction
Obtain Raw Data
Split Data
Predict: Final model can then be used to make

predictions on future observations, e.g., new songs
Feature Extraction
Supervised Learning
Evaluation
Predict
Distributed ML:
Computation and Storage
Challenge: Scalability
Classic ML techniques are not always suitable for modern datasets
60"
50"
Data Grows Faster

than Moores Law
Moore's"Law"
Overall"Data"
40"
Par8cle"Accel."
30"
[IDC report, Kathy Yelick, LBNL]
DNA"Sequencers"
20"
10"
0"
2010"
2011"
2012"
2013"
2014"
2015"
Machine
Learning
Data
Data"Grows"faster"than"Moores"Law"
[IDC%report,%Kathy%Yelick,%LBNL]%
Distributed
Computing

min ||Xw
w
Closed form solution: w = (X X)
2
y||2
1
X y (if inverse exists)
How do we solve this computationally?

Computational profile similar for Ridge Regression
Computing Closed Form Solution

1
w = (X X) X y
Computation: O(nd2 + d3) operations
Consider number of arithmetic operations ( +, , , / )

Computational bottlenecks:
2
Matrix multiply of X X : O(nd ) operations
3
Matrix inverse: O(d ) operations
Other methods (Cholesky, QR, SVD) have same complexity
Storage Requirements
1
w = (X X) X y
Storage: O(nd + d2) floats
Consider storing values as floats (8 bytes)
Storage bottlenecks:
2
X X and its inverse: O(d ) floats
X : O(nd) floats
Big n and Small d

1
w = (X X) X y
3
O(d )
Assume
computation and
single machine
2
O(d )
storage feasible on
Storing X and computing X X are the bottlenecks

Can distribute storage and computation!
Store data points (rows of X ) across machines
Compute X X as a sum of outer products
Matrix Multiplication via Inner Products

Each entry of output matrix is result of inner product of inputs matrices
9
4
3
1
5
2
9
1
3
2
1+3
2
28
5 =
11
3
3+5
2 = 28
18
9

9
4
3
1
5
2
1
3
2
2
28
5 =
11
3
18
9

9
4
3
1
5
2
1
3
2
2
28
5 =
11
3
18
9
Matrix Multiplication via Outer Products

Output matrix is sum of outer products between corresponding
rows and columns of input matrices
5
2
1
3
2
9
4
3
1
9
4
18
9
+
8
3
2
28
5 =
11
3
15
10
+
5
4
18
9
15
6

5
2
1
3
2
9
4
3
1
9
4
18
9
+
8
3
2
28
5 =
11
3
15
10
+
5
4
18
9
15
6

5
2
1
3
2
9
4
3
1
9
4
18
9
+
8
3
2
28
5 =
11
3
15
10
+
5
4
18
9
15
6

5
2
1
3
2
9
4
3
1
9
4
18
9
+
8
3
2
28
5 =
11
3
15
10
+
5
4
18
9
15
6
x(1)
x(2)
x(i)
x(i)
X X=
x(n)
x(1)
x(2)
i =1
x(n)
Example: n = 6; 3 workers
x(3)
x(4)
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(1)
x(5)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage
x(1)
x(2)
x(i)
x(i)
X X=
x(n)
x(1)
x(2)
i =1
x(n)
x(3)
x(4)
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(1)
x(5)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
Computation Storage
Distributed ML:
Computation and Storage,
Part II
Big n and Small d

1
w = (X X) X y
3
O(d )
Assume
computation and
single machine
2
O(d )
storage feasible on

Big n and Small d

1
w = (X X) X y
3
O(d )
Assume
computation and
single machine
2
O(d )
storage feasible on

Big n and Big d

1
w = (X X) X y
As before, storing X and computing X X are bottlenecks
Now, storing and operating on X X is also a bottleneck
Cant easily distribute!
x(1)
x(2)
x(i)
x(i)
X X=
x(n)
x(1)
x(2)
i =1
x(n)
x(3)
x(4)
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(1)
x(5)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
Computation Storage
x(1)
x(2)
x(i)
x(i)
X X=
x(n)
x(1)
x(2)
i =1
x(n)
x(3)
x(4)
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(1)
x(5)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
Computation Storage
Big n and Big d

1
w = (X X) X y
As before, storing X and computing X X are bottlenecks
Now, storing and operating on X X is also a bottleneck
Cant easily distribute!
1st Rule of thumb

Computation and storage should be linear (in n, d)
Big n and Big d

We need methods that are linear in time and space
One idea: Exploit sparsity
Explicit sparsity can provide orders of magnitude storage and
computational gains
Sparse data is prevalent
Text processing: bag-of-words, n-grams

Collaborative filtering: ratings matrix
Graphs: adjacency matrix
Categorical features: one-hot-encoding
Genomics: SNPs, variant calling
dense : 1. 0. 0. 0. 0. 0. 3.
8
size : 7
>
<
sparse : indices : 0 6
>
:
values : 1. 3.
Big n and Big d

computational gains
Latent sparsity assumption can be used to reduce dimension,
e.g., PCA, low-rank approximation (unsupervised learning)
d
d
r
Low-rank
Big n and Big d

computational gains
Latent sparsity assumption can be used to reduce dimension,
e.g., PCA, low-rank approximation (unsupervised learning)
Another idea: Use dierent algorithms
Gradient descent is an iterative algorithm
that requires O(nd) computation and O(d)
local storage per iteration
Closed Form Solution for Big n and Big d

x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(3)
x(4)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
x(1)
x(5)
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
Computation Storage
Gradient Descent for Big n

and Big d
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(3)
x(4)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
x(1)
x(5)
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
Computation Storage

and Big d
x(3)
x(4)
x(2)
x(6)
map:
reduce:
x(i)
workers:
x(1)
x(5)
x(i)
-1
O(nd) Distributed
Storage
O(nd)
O(d)
2
O(nd )
O(d2) Local
Distributed
Storage
Computation
Computation Storage

and Big d
workers:
x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
reduce:
O(nd) Distributed
Storage
O(nd)
O(d)
2
O(nd )
O(d2) Local
Distributed
Storage
Computation
O(d)
O(d)
Computation Storage

and Big d
workers:
x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
reduce:
O(nd) Distributed
Storage
O(nd)
O(d)
2
O(nd )
O(d2) Local
Distributed
Storage
Computation
O(d)
O(d)
Computation Storage

CS190.1x Week3a

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS190.1x Week3a

Uploaded by

Copyright:

Available Formats

Linear Regression

Linear Least Squares Regression

For each observation we have a feature vector, x, and label, y

We assume a linear mapping between features and label:

Linear Least Squares Regression

We can augment the feature vector to incorporate offset:

We can then rewrite this linear mapping as scalar product:

Why a Linear Mapping?

Shoe size: better to be off by one size than 5 sizes

What is an appropriate evaluation metric or loss function?

Has nice mathematical properties

How Can We Learn Model (w)?

denotes the ith point

Recall two earlier points:

Idea: Find w that minimizes squared loss over training points:

Given n training points with d features, we define:

R : matrix storing points

Least Squares Regression: Learn mapping ( w ) from

y ) by definition of Euclidean norm

Find solution by setting derivative to zero

Closed form solution: w = (X X)

X y (if inverse exists)

Overfitting and Generalization

Given n training points with d features, we define:

R : matrix storing points

Ridge Regression: Learn mapping ( w) that minimizes

free parameter trades off

Obtain Raw Data

Supervised Learning Pipeline

Supervised Learning Pipeline

Supervised Learning Pipeline

Supervised Learning Pipeline

Supervised Learning Pipeline

Supervised Learning Pipeline

Supervised Learning Pipeline

Goal: Predict songs release year from audio features

Split Data: Train on training set, evaluate with test set

Feature Extraction: Quadratic features

Given 2 dimensional data, quadratic features are:

Equivalent inner products:

Supervised Learning: Least Squares Regression

Given n training points with d features, we define:

R : matrix storing points

Ridge Regression: Learn mapping ( w) that minimizes

Ridge Regression: Learn mapping ( w) that minimizes

free parameter trades off between training

How do we choose a good value for this free parameter?

Evaluation (Part 1): Hyperparameter tuning

Grid Search: Exhaustively search through hyperparameter space

Least squares optimization involves squared loss, (y y) , so it

But MSEs unit of measurement is square of quantity being

Evaluation (Part 2): Evaluate final model

Predict: Final model can then be used to make

Data Grows Faster

[IDC report, Kathy Yelick, LBNL]

Least Squares Regression: Learn mapping ( w ) from

Closed form solution: w = (X X)

X y (if inverse exists)

How do we solve this computationally?

Computing Closed Form Solution

Consider number of arithmetic operations ( +, , , / )

Big n and Small d

Storing X and computing X X are the bottlenecks