Dolinsky Prezentacia Stat Modeling

Statistical Modeling
Let Your Data speak

(general methodology)
Ján Dolinský
2BridgZ Solutions
jan.dolinsky@2bridgz.com
Statistical Modeling
Introduction
Methodology
Case Studies
2BridgZ Solutions
Introduction
History of numerical modeling:
• parametric modeling (white-box)
• semi-parametric modeling (gray-box)
• non-parametric modeling (black-box)

energy consumption forecasts, financial market dynamics (fx), fraud detection,
event detection, disease detection, anomaly detection, ...
2BridgZ Solutions
Introduction
Challenge – construct a model so that its structure

reflects the underlying dynamics in data
Current Best Practice – model selection -

generate model, train its parameters using MSE
and validate it on a validation set
2BridgZ Solutions
Introduction
• often, all one has is data only
• which features relates to an output of interest –

variable selection
• how this data features relates to the output –

model structure building
2BridgZ Solutions
Methodology
Generally, model building involves:
• variable selection
- irrelevant variables may worsen prediction quality of a model,
- collinearity & multicollinearity,
- unstable models
• model structure building

- irrelevant model terms (units, neurons, …) worsen prediction quality,
- parsimonious model structures generalize significantly better
2BridgZ Solutions
Methodology – Variable Selection
10
1.43 4.21 0.62 … …. 2.09 0

3.84 0
1.26 0
… 1
...
N N
...
0
0
… 1
0.57 1
2BridgZ Solutions
Best-Subset selection
10 * 10*9 + 10*9*8 + … + 10! = 9 864 100
2BridgZ Solutions
Instead, one can seek a good path through all

possible subsets, e.g. orthogonal forward
selection, …
How to drive such a selection – information

criteria that optimize generalization
2BridgZ Solutions
● Knowledge Elucidation (interpretable models)
● Prediction Accuracy (generalization)
● Speed
2BridgZ Solutions
Example: Connectivity Analysis in Smart Grid

smart meter signals, feeder, noise in feeder, significance of signals, ...
2BridgZ Solutions
Methodology – Model Structure Building
Model structures which are linear-in-parameters

but nonlinear in dynamics
2BridgZ Solutions
Methodology – variable expansion
y=b*x+c
y=b*x+c
y=b*x+c y=b*x +d*z+c z = x2

y=b*x+c y=b*x +d*z+c z = x2

Linear in parameters, non-linear in dynamics
Generation of a model term dictionary
The dictionary may consist of many terms
Model term selection
2BridgZ Solutions
Dictionary may be build using:
Polynomial terms
Basis Functions (RBF, Thin-Plate-Spline, …)
Fourier functions
Spatio-Temporal mixing (ESN, ...)
Example: RBF terms for non-linear classification
Dictionary of possible model terms
Statistical expansion 250 new variables Variable selection

rbf 1
rbf 2
…
7 term final model
Rbf 63
2 inputs Rbf 5
x1 Rbf 231
x2
…
Rbf 14
…
rbf 250
● Dimensionality is determined automatically

● Generalization is optimized directly using an IC
● Significantly less trials than with NN or SVM
● (often) No need for cross-validation
● Significantly lower dimensionality than NN or
SVM
Comparison
Neural Networks
- pass through data many times
- structure is nonlinear-in-parameters
- involves nonlinear optimization over entire model structure
- computationally expensive
- hard to determine an appropriate dimensionality
- elucidation of knowledge from model structure practically impossible
Support Vector Machines (& Relevance VM)

- structure is linear-in-parameters
- few hyper-parameters to optimize (cross-validation)
- dimensionality is determined via cross-validation
- elucidation of knowledge is difficult
Comparison
(Automatic) Model Structure Building

- structure is linear-in-parameters
- one or no hyper-parameter to optimize
- (often) cross-validation not necessary
- generalization is directly optimized
- data do not have to be divided into training and testing sets
- dimensionality is determined automatically
- parsimonious model structure with excellent generalization abilities
which often elucidate knowledge
Comparison
SVM MB using IC
methodology MS & CV Automatic MB
generated models 10-20 1
hyper-parameters 2-3 1-0
final dimensionality 120 7
missclas. rate 9.5% 9%
Future research
Information Geometry
Information Criteria

Dolinsky Prezentacia Stat Modeling

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dolinsky Prezentacia Stat Modeling

Uploaded by

Copyright:

Available Formats

Statistical Modeling

Let Your Data speak

History of numerical modeling:

• parametric modeling (white-box)

• semi-parametric modeling (gray-box)

• non-parametric modeling (black-box)

Challenge – construct a model so that its structure

Current Best Practice – model selection -

• often, all one has is data only

• which features relates to an output of interest –

• how this data features relates to the output –

Generally, model building involves:

• model structure building

1.43 4.21 0.62 … …. 2.09 0

10 * 10*9 + 10*9*8 + … + 10! = 9 864 100

Instead, one can seek a good path through all

How to drive such a selection – information

● Knowledge Elucidation (interpretable models)

● Prediction Accuracy (generalization)

Example: Connectivity Analysis in Smart Grid

Model structures which are linear-in-parameters

y=b*x+c y=b*x +d*z+c z = x2

y=b*x+c y=b*x +d*z+c z = x2

Generation of a model term dictionary

The dictionary may consist of many terms

Model term selection

Dictionary may be build using:

Dictionary of possible model terms

Statistical expansion 250 new variables Variable selection

● Dimensionality is determined automatically

Support Vector Machines (& Relevance VM)

(Automatic) Model Structure Building

You might also like

10 * 109 + 109*8 + … + 10! = 9 864 100

y=bx+c y=bx +d*z+c z = x2

y=bx+c y=bx +d*z+c z = x2