You are on page 1of 26

Statistical Modeling

Let Your Data speak


(general methodology)

Ján Dolinský
2BridgZ Solutions
jan.dolinsky@2bridgz.com
Statistical Modeling

Introduction

Methodology

Case Studies

2BridgZ Solutions
Introduction

History of numerical modeling:

• parametric modeling (white-box)

• semi-parametric modeling (gray-box)

• non-parametric modeling (black-box)


energy consumption forecasts, financial market dynamics (fx), fraud detection,
event detection, disease detection, anomaly detection, ...

2BridgZ Solutions
Introduction

Challenge – construct a model so that its structure


reflects the underlying dynamics in data

Current Best Practice – model selection -


generate model, train its parameters using MSE
and validate it on a validation set

2BridgZ Solutions
Introduction

• often, all one has is data only

• which features relates to an output of interest –


variable selection

• how this data features relates to the output –


model structure building

2BridgZ Solutions
Methodology

Generally, model building involves:

• variable selection
- irrelevant variables may worsen prediction quality of a model,
- collinearity & multicollinearity,
- unstable models

• model structure building


- irrelevant model terms (units, neurons, …) worsen prediction quality,
- parsimonious model structures generalize significantly better

2BridgZ Solutions
Methodology – Variable Selection
10

1.43 4.21 0.62 … …. 2.09 0


3.84 0
1.26 0
… 1
...

N N

...
0
0
… 1
0.57 1

2BridgZ Solutions
Methodology – Variable Selection

Best-Subset selection

10 * 10*9 + 10*9*8 + … + 10! = 9 864 100

2BridgZ Solutions
Methodology – Variable Selection

Instead, one can seek a good path through all


possible subsets, e.g. orthogonal forward
selection, …

How to drive such a selection – information


criteria that optimize generalization

2BridgZ Solutions
Methodology – Variable Selection

● Knowledge Elucidation (interpretable models)

● Prediction Accuracy (generalization)

● Speed

2BridgZ Solutions
Methodology – Variable Selection

Example: Connectivity Analysis in Smart Grid


smart meter signals, feeder, noise in feeder, significance of signals, ...

2BridgZ Solutions
Methodology – Model Structure Building

Model structures which are linear-in-parameters


but nonlinear in dynamics

2BridgZ Solutions
Methodology – variable expansion

y=b*x+c
Methodology – variable expansion

y=b*x+c
Methodology – variable expansion

y=b*x+c y=b*x +d*z+c z = x2


Methodology – variable expansion

y=b*x+c y=b*x +d*z+c z = x2


Linear in parameters, non-linear in dynamics
Methodology – Model Structure Building

Generation of a model term dictionary

The dictionary may consist of many terms

Model term selection

2BridgZ Solutions
Methodology – Model Structure Building

Dictionary may be build using:

Polynomial terms
Basis Functions (RBF, Thin-Plate-Spline, …)
Fourier functions
Spatio-Temporal mixing (ESN, ...)
Example: RBF terms for non-linear classification
Example: RBF terms for non-linear classification

Dictionary of possible model terms

Statistical expansion 250 new variables Variable selection


rbf 1
rbf 2

7 term final model
Rbf 63
2 inputs Rbf 5
x1 Rbf 231
x2

Rbf 14


rbf 250
Example: RBF terms for non-linear classification
Example: RBF terms for non-linear classification

● Dimensionality is determined automatically


● Generalization is optimized directly using an IC
● Significantly less trials than with NN or SVM
● (often) No need for cross-validation
● Significantly lower dimensionality than NN or
SVM
Comparison
Neural Networks
- pass through data many times
- structure is nonlinear-in-parameters
- involves nonlinear optimization over entire model structure
- computationally expensive
- hard to determine an appropriate dimensionality
- elucidation of knowledge from model structure practically impossible

Support Vector Machines (& Relevance VM)


- structure is linear-in-parameters
- few hyper-parameters to optimize (cross-validation)
- dimensionality is determined via cross-validation
- elucidation of knowledge is difficult
Comparison

(Automatic) Model Structure Building


- structure is linear-in-parameters
- one or no hyper-parameter to optimize
- (often) cross-validation not necessary
- generalization is directly optimized
- data do not have to be divided into training and testing sets
- dimensionality is determined automatically
- parsimonious model structure with excellent generalization abilities
which often elucidate knowledge
Comparison

SVM MB using IC
methodology MS & CV Automatic MB
generated models 10-20 1
hyper-parameters 2-3 1-0
final dimensionality 120 7
missclas. rate 9.5% 9%
Future research

Information Geometry
Information Criteria

You might also like