Professional Documents
Culture Documents
http://www.cs.yale.edu/homes/mmahoney
http://www.cs.rpi.edu/~drinep
Randomized Linear Algebra Algorithms
Goal: To develop and analyze (fast) Monte Carlo algorithms for performing
useful computations on large matrices and tensors.
• Matrix Multiplication
• Computation of the Singular Value Decomposition
• Computation of the CUR Decomposition
• Testing Feasibility of Linear Programs
• Least Squares Approximation
• Tensor computations: SVD generalizations
• Tensor computations: CUR generalization
2
Randomized Linear Algebra Algorithms
Motivation:
• (Algorithmic) To speed up computations in applications where extremely
large data sets are modeled by matrices/tensors and, e.g., O(n3) time is
not an option.
• (Algorithmic) To reduce the memory requirements in applications where
the data sets are modeled by matrices/tensors, and storing the whole
data is not an option, because the data sets are extremely large.
• (Equivalent to the above) To provide some analysis of the performance of
the accuracy performance of simple algorithms when only a small sample of
the full data set is available.
• (Structural) To reveal novel structural properties of the datasets, given
sufficient computational time.
3
Example: the CUR decomposition
Carefully
chosen U
Goal: make (some norm) of A-CUR small.
Why? Given a sample consisting of a few Why? Given sufficient time, we can find C,
columns (C) and a few rows (R) of A, we can U and R such that A – CUR is “very” small.
compute U and “reconstruct” A as CUR; if
This might lead to better understanding
the sampling probabilities are not “too bad”,
of the data.
we get provably good accuracy.
4
Applications of such algorithms
Matrices arise, e.g., since m objects (documents, genomes, images, web
pages), each with n features, may be represented by an m x n matrix A.
• Covariance Matrices
• Latent Semantic Indexing
• DNA Microarray Data
• Eigenfaces and Image Recognition
• Similarity Queries
• Matrix Reconstruction
• LOTS of other data applications!!
More generally,
• Linear and Nonlinear Programming Applications
• Design of Approximation Algorithms
• Statistical Learning Theory Applications
5
Overview (1/3)
• Matrix Multiplication
• Feasibility testing of Linear Programs
6
Overview (2/3)
• Applications of Tensor-CUR
• Hyperspectral data
• Recommendation systems
7
Overview (3/3)
• Regression problems
• Least squares problems
8
Computation on Massive Data Sets
Data are too large to fit into main memory; they are either not stored or are
stored in external memory.
Algorithms that compute on data streams examine the stream, keep a small
“sketch” of the data, and perform computations on the sketch.
Munro & Paterson ’78: studied “the relation between the amount of internal
storage available and the number of passes required to select the k-th
highest of n inputs.”
9
The Pass Efficient Model
Motivation: Amount of disk/tape space has increased enormously; RAM and
computing speeds have increased less rapidly.
• Can store large amounts of data, but
• Cannot process these data with traditional algorithms.
10
Random Sampling
Random sampling is used to estimate some parameter defined over a very
large set by looking at only a very small subset.
12
Overview (1/3)
• Matrix Multiplication
• Feasibility testing of Linear Programs
13
Approximating Matrix Multiplication …
(D. & Kannan FOCS ’01, and D., Kannan, & M. TR ’04, SICOMP ’05)
Problem Statement
Given an m-by-n matrix A and an n-by-p matrix B, approximate the product A·B,
OR, equivalently,
Approximate the sum of n rank-one matrices.
Algorithm
1. Fix a set of probabilities pi, i=1…n, summing up to 1.
2. For t=1 up to s,
set jt = i, where Pr(jt = i) = pi;
(Pick s terms of the sum, with replacement, with respect to the pi.)
3. Approximate AB by the sum of the s terms, after scaling.
15
Random sampling (cont’d)
16
The algorithm (matrix notation)
Algorithm
1. Pick s columns of A to form an m-by-s matrix C and the corresponding
s rows of B to form an s-by-p matrix R.
2. (discard A and B) Approximate A · B by C · R.
17
The algorithm (matrix notation, cont’d)
• For t=1 up to s, pick a column A(j t) and a row B(j t) with probability
19
Error Bounds
For the above algorithm,
If ||AB||F = Ω (||A||F ||B||F), then the above bound is a relative error bound.
This happens if there is “not much cancellation” in the multiplication.
20
Error Bounds (tight concentration)
For the above algorithm, with probability at least 1-δ ,
Notice that we removed the expectation (by applying a martingale argument) and
having an extra log(1/δ ) factor.
(Markov’s inequality would also remove the expectation, introducing an extra 1/δ
factor.)
21
Special case: B = AT
If B = AT, then the sampling probabilities are
22
Special case: B = AT (cont’d)
(Rudelson & Vershynin ’04, Vershynin ’04)
Improvement for the spectral norm bound for the special case B = AT.
23
Empirical evaluation: setup
(Data from D. Lewis & E. Cohen, SODA ’97 & J. of Algorithms ’99)
Database
Dense document-concept matrix A with 5000 documents and 320 concepts.
Experiment
Our goal was to identify all document-document matches, e.g. compute AAT
and identify all entries larger than some threshold τ (proximity problem).
The algorithm
Approximate AAT (using our method) by CCT.
Find all entries of CCT that are larger that τ -ε , ε > 0 (“candidate
entries”).
Deterministically compute the dot products for “candidate entries”.
24
Empirical evaluation: results (τ = 0.85)
25
Empirical evaluation: results (τ = 0.75)
26
Experiments: uniform sampling
27
Random sampling of Linear Programs
(D., Kannan, & M. TR ‘04 & STACS ’05)
Let P(i) denote the i th column of the r-by-n matrix P and suppose that
the following Linear Program is feasible:
28
The picture …
If feasible …
Sample a few
variables!
…then feasible
(with high
probability).
29
Another picture …
The i-th constraint
is feasible
31
The picture …
If infeasible …
Sample a few
variables!
…then infeasible
(with high
probability).
32
Another picture …
The i-th constraint
is infeasible
33
Overview (1/3)
• Matrix Multiplication
• Feasibility testing of Linear Programs
34
Singular Value Decomposition (SVD)
35
Rank k approximations (Ak)
Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.
Also, Ak=UkUkTA.
36
Approximating SVD in O(n) time
(Frieze, Kannan & Vempala FOCS ‘98, D., Frieze, Kannan, Vempala & Vinay SODA ’99, JML ’04, D. Kannan, &
M. TR ’04, SICOMP ’05)
Given: m x n matrix A
• Sample c columns from A and rescale to form the m x c matrix C.
• Compute the m x k matrix Hk of the top k left singular vectors of C.
37
Example of randomized SVD
A C
Title:
C:\Petros\Image Processing\baboondet.eps
Creator:
MATLAB, The Mathworks, Inc.
Preview:
This EPS picture was not saved
with a preview included in it.
Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
Compute the top k left singular vectors of the matrix C and store them
in the 512-by-k matrix Hk.
38
Example of randomized SVD (cont’d)
Title:
C:\Petros\Image Processing\baboondet.eps
Creator:
MATLAB, The Mathworks, Inc.
Preview:
This EPS picture was not saved
with a preview included in it.
Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
A HkHkTA
39
Element-wise sampling
(Achlioptas & McSherry, STOC ’01, JACM ’05)
More details:
Let pij 2 [0,1] for all i,j. Create the matrix S from A such that:
||A-S||2 is bounded ! (i) the singular values of A and S are close, and (ii, under
additional assumptions) the top k left (right) singular vectors of S span a subspace
that is close the to subspace spanned by the top k left (right) singular vectors of A.
40
How to use it
Approximating singular values fast:
• Zero out (a large number of) elements of A, scale the remaining ones appropriately.
• Compute the singular values of the resulting sparse matrix using iterative techniques.
• (Good choice for pij : pij = sAij 2/∑i,j Aij 2, where s denotes the expected number of
elements that we seek to keep in S.)
• Note: Each element is kept or discarded independently of the others.
42
Overview (1/3)
• Matrix Multiplication
• Feasibility testing of Linear Programs
43
A novel CUR matrix decomposition
(D. & Kannan, SODA ’03, D., Kannan, & M. TR ’04, SICOMP ’05)
O(1) rows
O(1) columns
44
The CUR decomposition
Given a large m-by-n matrix A (stored on disk), compute a decomposition
CUR of A such that:
45
The CUR decomposition (cont’d)
46
Computing U
Intuition (which can be formalized):
The CUR algorithm essentially expresses every row of the matrix A as a
linear combination of a small subset of the rows of A.
• This small subset consists of the rows in R.
• Given a row of A – say A(i) – the algorithm computes a good fit for the
row A(i) using the rows in R as the basis, by approximately solving
Notice that only c = O(1) element of the i-th row are given as input.
However, a vector of coefficients u can still be computed.
47
Computing U (cont’d)
Given c elements of A(i) the algorithm computes a good fit for the row
A(i) using the rows in R as the basis, by approximately solving:
48
Error bounds for CUR
49
Other (randomized) CUR decompositions
• For any subset of the columns, denoted C (e.g., chosen by the practitioner)
50
Other CUR decompositions
(G.W. Stewart, Num. Math. ’99, TR ’04)
51
Other CUR decompositions, cont’d
(Goreinov, Tyrtyshnikov, and Zamarashkin, LAA ’97 & Goreinov, Tyrtyshnikov, Cont. Math. ’01)
Notes:
52
Lower Bounds
53
Overview (2/3)
• Applications of Tensor-CUR
• Hyperspectral data
• Recommendation systems
54
Approximating Max-Cut
Max-Cut (NP-hard for general and dense graphs, Max-SNP)
Given a graph G=(V,E), |V|=n, partition V in two disjoint subsets V1 and V2 such
that the number of edges of E that have one endpoint in V1 and one endpoint in
V2 is maximized.
Goemans & Williamson ’94: .878-approximation algorithm (might be tight!)
Let A denote the adjacency matrix of G. All previous algorithms also guarantee an
additive error approximation for the weighted Max-Cut problem, where Amax is the
maximum edge weight in the graph.
Our result:
We can approximate the weighted Max-Cut of G(V,E) up to additive error
in constant time and space after reading the graph a small number of times.
56
The algorithm (sketch)
Let A denote the adjacency matrix of the weighted graph G=(V,E). Then,
Follows from
We are given m (>106) objects and n(>105) features describing the objects.
Database
An m-by-n matrix A (Aij shows the “importance” of feature j for object i).
E.g., m documents, represented w.r.t. n terms.
Queries
Given a new object x, find similar objects in the database (nearest
neighbors).
58
Data mining with CUR (cont’d)
feature 2
corresponding (normalized) vectors is small. So,
xT·d = cos(x,d) Object d
59
Data mining with CUR (cont’d)
Assume that CUR is an approximation to A, such that CUR is stored
efficiently (e.g. in RAM).
Given a query vector x, instead of computing A · x, compute CUR · x to
identify its nearest neighbors.
(Also recall how we used the matrix multiplication algorithm to solve the
proximity problem, namely find all pairs of objects that are “close”.)
60
Genetic microarray data and CUR
Exploit structural properties of CUR in biological applications:
Experimental conditions
Find a “good” set of genes and arrays to
include in C and R?
genes
Provable and/or heuristic strategies are
acceptable.
Gene microarray data: M., D., & Alter (UT Austin) (sporulation and cell cycle data).
61
Recommendation Systems
(D., Raghavan, & Kerenidis, STOC ’02)
The problem:
Assume the existence of m customers and n products. Then, A is an (unknown)
matrix s.t. Aij is the utility of product j for customer i.
Our goal is to recreate A from a few samples, s.t. we can recommend high
utility products to customers.
Assuming strong clustering of the products, they offer competitive algorithms even
with only 2 samples/customer.
• (Azar, Fiat, Karlin, McSherry & Saia STOC ’01)
Assuming sampling of Ω (mn) entries of A and a certain gap requirement, they -very-
accurately recreate A.
62
Recommendation Systems
The problem:
Assume the existence of m customers and n products. Then, A is an (unknown)
matrix s.t. Aij is the utility of product j for customer i.
Our goal is to recreate A from a few samples, s.t. we can recommend high
utility products to customers.
63
Recommendation Systems (cont’d)
Question:
Can we get competitive performance by sampling less than Ω (mn) elements?
Answer:
Apply the CUR decomposition.
products
Customer sample
Customer sample
(purchases, small surveys)
64
Recommendation Systems (cont’d)
Details:
Sample a constant number of rows and columns of A and compute A’=CUR.
Assuming that A has a “good” low rank approximation,
65
Overview (2/3)
• Applications of Tensor-CUR
• Hyperspectral data
• Recommendation systems
66
Datasets modeled as tensors
Mode 3
Q. What do we know about
tensor decompositions?
A. Not much, although tensors
arise in numerous applications.
Mode 1
Mode 2
m £ n £ p tensor A
67
Tensors in Applications
Tensors appear both in Math and CS.
• Represent high dimensional functions.
• Connections to complexity theory (i.e., matrix multiplication complexity).
• Statistical applications (i.e., Independent Component Analysis, higher order
statistics, etc.).
• Large data-set applications (e.g., Medical Imaging & Hyperspectral Imaging)
Problem: There does not exist a definition of tensor rank (and associated
tensor SVD) with the – nice – properties found in the matrix case.
(Lek-Heng Lim ’05: strong impossibility results!)
Heuristic solution: “unfold” the tensor along a mode and apply Linear
Algebra.
68
Tensor rank
(Hastad, J of Algorithms ’90, DeLaVega, Kannan, Karpinski, Vempala STOC ’05)
“unfold”
70
Tensors in real applications
71
The TensorCUR algorithm (3-modes)
p time/frequency slices
2 samples
randomly sample
m genes m genes
n environmental n environmental
conditions conditions
72
The TensorCUR algorithm (cont’d)
Let R denote the tensor of the sampled snapshots. 2 samples
73
The TensorCUR algorithm (cont’d)
Theorem:
74
Overview (2/3)
• Applications of Tensor-CUR
• Hyperspectral data
• Recommendation systems
75
TensorCUR on scientific data sets
Apply random sampling methodology and kernel-based Laplacian methods to
large physical and chemical and biological data sets.
76
Data sets being considered
Sequence and mutational data from G-protein coupled receptors
• to identify mutants with enhanced stability properties.
Simulational data
• to more efficiently conduct large scale computations.
77
78
79
Sampling hyperspectral data
Sample slabs depending on total absorption:
81
The 65-th slab approximately reconstructed
82
Tissue Classification - Exact Data
83
Tissue Classification - Ns=12 & Nf=1000
84
TensorCUR on Recommendation Systems
Our previous setup:
Assume the existence of m customers and n products. Then, A is an (unknown)
matrix s.t. Aij is the utility of product j for customer i.
Our goal is to recreate A from a few samples, s.t. we can recommend high
utility products to customers.
products
Customer sample
Customer sample
(purchases, small surveys)
85
Recommendation Systems (cont’d)
Comment:
It is folklore knowledge in economics literature that product utility is an
ordinal and not a cardinal quantity.
Thus, it is more natural to compare products than assign utility values.
Model revisited:
Every customer has an n-by-n matrix (whose entries are §1) and represent
pairwise product comparisons.
Overall, there are m such matrices, forming an n-by-n-by-m 3-mode array or
a three-dimensional tensor, denoted by A.
We seek to extract the “structure” of this tensor by sampling.
86
Recommendation Systems (cont’d)
Our goal:
Recreate the tensor A from a few samples in order to recommend high utility
products to the customers.
m customers
n products
n products
87
Overview (3/3)
• Regression problems
• Least squares problems
88
Motivation for Kernels (1 of 3)
89
Motivation for Kernels (2 of 3)
90
Motivation for Kernels (3 of 3)
If the Gram matrix G, where Gij =kij =(φ (X(i) ), φ (X(j) )), is dense but has
low numerical rank, then calculations of interest still need O(n2) space
and O(n3) time:
• matrix inversion in GP prediction,
• quadratic programming problems in SVMs,
• computation of eigendecomposition of G.
91
Kernel-CUR
(D. & M., COLT ’05, TR ’05)
Main algorithm:
• Randomized algorithm to approximate a Gram matrix.
• Low-rank approximation in terms of columns (and rows) of G=XTX.
92
Kernel-CUR Algorithm
Algorithm:
• Pick c columns of G in i.i.d. trials, with replacement and with respect
to the {pi}; let L be the set of indices of the sampled columns.
• Scale each sampled column (with index i 2 L) by dividing its by (cpi)1/2 .
• Let C be the n x c matrix containing the rescaled sampled columns.
• Let W be the c x c submatrix of G with entries Gij /(cpi1/2 pj1/2 ), i,j 2 L.
• Compute Wk+.
93
Notes on the algorithm
Note the structural simplicity of the algorithm:
• C consists of a small number of representative data points.
• W consists of the induced subgraph defined by those points.
95
Interpreting the sampling probabilities
96
The Nystrom Method (1 of 3)
Consider the eigenfunction problem:
Discretize with a quadrature rule ({wj} are the weights and {sj} are the
quadrature points):
97
The Nystrom Method (2 of 3)
98
The Nystrom Method (3 of 3)
99
Fast Computation with Kernels
Adjacency matrix, t=0 Adjacency matrix, t=t*
Kernel-based
diffusion
• Regression problems
• Least squares problems
101
Regression problems
(D., M. & Muthukrishnan, ’05)
102
“Induced” regression problems
sampled sampled
rows of A “rows” of b
scaling to
account for
undersampling
103
Regression problems, definition
104
Exact solution
Projection of b on
the subspace spanned
by the columns of A
Pseudoinverse
of A
105
Singular Value Decomposition (SVD)
106
Questions …
107
Creating an induced subproblem
Algorithm
1. Fix a set of probabilities pi, i=1…n,
summing up to 1.
2. Pick r indices from {1…n} in r i.i.d.
trials, with respect to the pi’s.
3. For each sampled index j, keep the
j-th row of A and the j-th element
of b; rescale both by (1/rpj)1/2 .
108
The induced subproblem
109
Our results
If the pi satisfy certain conditions, then with probability at least 1-δ ,
110
Our results, cont’d
If the pi satisfy certain conditions, then with probability at least 1-δ ,
κ (A): condition
number of A
111
Back to induced subproblems …
112
Conditions on the probabilities, SVD
113
Conditions for the probabilities
The conditions that the pi must satisfy, for some β 1, β 2, β 3 2 (0,1]:
lengths of rows of
matrix of left
singular vectors of A
Component of b not
in the span of the
columns of A
Small β i )
more sampling
In O(nd2) time we can easily compute pi’s that satisfy all three conditions,
with β 1 = β 2 = β 3 = 1/3.
(Too expensive!)
115
Overview (3/3)
• Regression problems
• Least squares problems
116
Revisiting the CUR decomposition
(D., M. & Muthukrishnan, ’05)
Carefully
chosen U
Create an approximation
to A, using rows and
columns of A
O(1) rows
O(1) columns
Goal: provide (good) bounds for some norm of the error matrix
A – CUR
117
A simple algorithm
Carefully
chosen U
Create an approximation
to A, using rows and
columns of A
O(1) rows
O(1) columns
Algorithm
Step 1: pick a few columns of A and include them in C (“basis” columns).
(Any set of columns will do; “good” choices will be discussed later)
Step 2: express all columns of A as linear combinations of the “basis” columns.
118
Step 2 “basis”
columns
i-th column
119
Step 3
i-th column
of R
D: rescaling,
diagonal matrix
sampled
rows of C
holds with probability at least 1-δ . We need to pick r= O(c2 log(1/δ )/ε 2)
rows.
(upcoming writeup: we can reduce c2 to c.)
Constructing D and R: we compute a set of probabilities pi and sample
and rescale rows of A in r i.i.d. trials with respect to the pi’s.
121
Forming the pi (1/3)
122
Forming the pi (2/3)
Compute pi that satisfy, for some β 1, β 2, β 3 2 (0,1]:
lengths of rows of
matrix of left
singular vectors of C
Component of A not
in the span of the
columns of C
Small β i )
more sampling
124
Overall decomposition
columns of A
diagonal rescaling
matrix
rows of A
(r £ c)
”intersection”
of C and R
(r £ c)
125
Overall decomposition (a variant)
columns of A
rows of A
”intersection”
of C and R
Error bounds if C and R contain the columns and rows of A that define a
parallelpiped of maximal volume.
126
A (first) choice for C
(D., Frieze, Kannan, Vempala & Vinay ’99, ‘03 and D., Kannan, & M. TR ‘04 & SICOMP ‘05)
127
CUR error
128
A (second) choice for C
129
The MultiPass sampling algorithm
R = A; S = empty set;
For i=1…t
1. Pick c columns from A0; each column is sampled with probability
proportional to its Euclidean length squared.
2. S = S [ {sampled columns from step 1};
3. R = A – PS A;
End For
130
Intuition
R = A; S = empty set;
For i=1…t
1. Pick c columns from A0; each column is sampled with probability
proportional to its Euclidean length squared.
2. S = S [ {sampled columns from step 1};
3. R = A – PS A;
End For
131
CUR error
132
A (third) choice for C
( It seems possible to use the volume work by Goreinov et. al. to design
randomized strategies to pick such columns. )
133
CUR error
(For any k) In O(nc + mn) time, we can construct C, U, and R such that
Notice that we can pick the rows efficiently, given the columns.
134
Example: finding ht-SNPs
(data from K. Kidd’s lab at Yale University, joint work with Dr. Paschou at Yale University)
Single Nucleotide Polymorphisms: the most common type of genetic variation in the
genome.
(Locations at the human genome where typically two alternate nucleotide bases are observed.)
Why are SNPs important: they occur quite frequently within the genome allowing the
tracking of disease genes and population histories.
Thus, they are effective markers for genomic research!
There are ∼10 million SNPs in the human genome (under some assumptions).
135
Research topics
Research Topics (working within a specific chromosomal region of the human genome):
(i) Are different SNPs correlated, either within a population, or across different
populations? (Folklore knowledge: yes).
(ii) Find a “good” set of haplotype tagging-SNPs capturing the diversity of a
chromosomal region of the human genome.
(iii) Is extrapolation feasible? (Recall Nystrom/CUR extrapolation.) This would save (a
lot of) time and money!
Existing literature
Pairwise metrics of SNP correlation, called LD (linkage disequilibrium) distance, based on nucleotide
frequencies and co-occurrences.
Almost no metrics exist for measuring correlation between more than 2 SNPs and LD is very
difficult to generalize.
Exhaustive and semi-exhaustive algorithms in order to pick “good” ht-SNPs that have small LD
distance with all other SNPs.
Using Linear Algebra: an SVD based algorithm was proposed by Lin & Altman, Am. J. Hum. Gen. 2004.
136
The raw data
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG
GG GT GA AG
GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG
GG TT GG AA
GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG
GG GT GA AG
GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG
sl au di vi dni
GT GT GA AG
GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG
GG TT GG AA
Notes:
GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG
GG TT GG AA
GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG
• Each SNP genotype consists of two alleles.
GT TT GG AA
137
Encoding the data
SNPs
0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1
-1 -1 1
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0
sl au di vi dni
0 0 0
-1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1
How? -1 -1 1
-1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1
This means that if we select the top k left singular vectors Uk we can express every
column (i.e, SNP) of A as a linear combination of the left singular vectors loosing at most
10% of the information in the matrix:
Nice feature: SVD provides a non-trivial (maybe not achievable) lower bound.
In many cases, the lower bound is attained by the greedy heuristic!
(In our data, at most k+5 columns suffice to extract 90% of the structure.)
140
America
Oceania
Asia
Europe
Africa
141
Extrapolation
Given a small number of SNPs for all subjects, and all SNPs for some
subjects, extrapolate the values of the missing SNPs.
SNPs
“Training” data
SNP sample
(for all subjects, we are given
a small number of SNPs)
142
Extrapolation
We split our data in “training ” (90%) and “test” (10%).
The training set corresponds to R; for the “test” set we are given data on a
small number of SNPs, picked using our greedy multipass heuristic on the
“training” set.
SNPs
“Training” data
SNP sample
(for all subjects, we are given
a small number of SNPs)
143
Rondonian Surui
144
Arizona Pima
145
Atayal
146
Samaritan
147
European American
148
African American
149
Overview (3/3)
• Regression problems
• Least squares problems
150
Conclusions & Open Problems
•Empirical evaluation of different sampling probabilities
• Uniform
• Row/column lengths squared
• Lengths of the rows of the left/right singular vectors
• Inverse squares of row/column lengths to pick outliers
• Others, depending on the problem to be solved?
151
Conclusions & Open Problems
• Impose other structural properties in CUR-type decompositions
• Non-negativity
• Element quantization, e.g., to 0,1,-1
• Block-SVD type structure
152