You are on page 1of 13

Big Data, Data Mining, and Machine Learning: Value Creation for Business

Leaders and Practitioners. Jared Dean.


2014 SAS Institute Inc. Published 2014 by John Wiley & Sons, Inc.

Index

A predictive models, 6770, 68t


AdaBoost, 106 quality of recommendation
adaptive regression, 8283 systems, 170171
additive methods, 106 Auto-ID Center, 6, 236
agglomerative clustering, 138 average linkage, 134
Akaike information criterion averaged squared error (ASE),
(AIC), 69, 88 201
Akka, 46 averaging model process, 15
algorithms, future development averaging process, 1415
of, 238241
aligned box criteria (ABC), 135 B
alternating least squares, 167 BackRub, 6
Amazon, 9 bagging, 124125
analytical tools banking case study, 197204
about, 43 Barksdale, Jim, 163
Java and Java Virtual Machine baseline model, 165166
(JVM) languages, 4446 Bayes, Thomas, 111112
Python, 4950 Bayesian information criterion
R language, 4749 (BIC), 70, 88
SAS system, 5052 Bayesian methods network
Weka (Waikato Environment classification
for Knowledge Analysis), about, 113115
4344 inference in Bayesian
analytics, 1718 networks, 122123
Apache Hadoop project, 7, learning Bayesian networks,
3739, 38f,
8 40f,
0 45 120122
Apple iPhone, 7 Naive Bayes Network, 116117
application program interface parameter learning,
(API), 155, 227 117120
Ashton, Kevin, 6, 236 scoring for supervised learning,
assessing 123124

253
254 INDEX

Bayesian network-augmented Empirical Model Building and


naive Bayes (BAN), 120 Response Surfaces, 57
Bayesian networks Box-Cox transformation
inference in, 122123 method, 78
learning, 120122 breast cancer, 12
Bayesians, 114n14 Breiman, Leo
Beowulf cluster, 35, 35n1 Classification and Regression
Bernoulli distribution, 85n7 Trees, 101
big data Brin, Sergey, 6
See also specific topics Broyden-Fletcher-Goldfarb-
benefits of using, 1217 Shanno (BFGS) method,
complexities with, 1921 9697
defined, 3 building models
fad nature of, 912 about, 223
future of, 233241 methodology for, 5861
generalized linear model response model, 142143
(GLM) applications and, business rules, applying, 223
8990 Butte, Atul, 1
popularity of term, 1011, 11f
regression analysis applications C
and, 8384 C++, 49
timeline for, 58 C language, 4445, 49
value of using, 1112 C4.5 splitting method, 101, 103
Big Data Appliance, 40f 0 Cafarella, Mike, 7, 37
Big Data Dynamic Factor CAHPS, 207
Models for Macroeconomic campaigns, operationalizing,
Measurement and 201202
Forecasting (Diebolt), 3 cancer, 12
Big Data Research and capacity, of disk drives, 28
Development Initiative, 8 CART, 101
binary classification, 6465, 113 case studies
binary state, 225 financial services company,
binning, 199 197204
Bluetooth, 6 health care provider,
boosting, 124125 205214
Boston Marathon, 160n2 high-tech product
Box, George manufacturer, 229232
INDEX 255

mobile application number of clusters, 135137


recommendations, 225228 profiling clusters, 138
online brand management, semisupervised clustering, 133
221224 supervised clustering, 133
technology manufacturer, text clustering, 177
215219 unsupervised clustering, 132
Centers for Medicare and collaborative filtering, 164
Medicaid Services (CMS), complete linkage, 134
206207 computing environment
central processing unit (CPU), about, 2425
2930 analytical tools,
CHAID, 101, 105106 4352
Challenger Space Shuttle, distributed systems, 3541
2021 hardware, 2734
chips, 30 Conan Doyle, Sir Arthur, 55
Chi-square, 88, 105106 considerations, for platforms,
Churchill, Winston, 194 3941
city block (Manhattan) distance, content categorization,
135 177178, 222
classical supercomputers, 36 content-based filtering,
Classification and Regression Trees 163164
(Breiman), 101 contrastive divergence,
classification measures, 68t 169170
classification problems, 9495 Cooks D plot, 81, 88
Clojure, 45, 46 Corrected AIC (AICC), 88
cloud computing, 39 Cortes, Corinna
cluster analysis, 132133 Support-Vector Networks, 109
cluster computing, 3536 cosine similarity measure, 134
clustering techniques, 153, 231 cost, of hardware, 36
clusters/clustering Cox models, 83
agglomerative cluster, 138 Credit Card Accountability
Beowulf cluster, 35, 35n1 Responsibility and Disclosure
divisive clustering, 138 Act (2009), 199200
hierarchical clustering, cubic clustering criteria (CCC),
132, 138 135
K-means clustering, 132, customers, ranking potential, 201
137138 Cutting, Doug, 7, 37
256 INDEX

D database computing, 3637


DAG network structure, 120 decision matrix, 68t
data decision trees, 101107, 102f,
2
See also big data 103f,
3 231
about, 54 Deep Learning, 100
amount created, 6, 8 Defense Advanced Research
common predictive modeling Projects Agency (DARPA),
techniques, 71126 56
democratization of, 8 democratization of data, 8
incremental response detecting patterns, 151153
modeling, 141148 deviance, 88, 88n9
personal, 4 DFBETA statistic, 89, 89f
9
predictive modeling, 5570 Diebolt, Francis X.
preparing for building models, Big Data Dynamic Factor
5859 Models for Macroeconomic
preprocessing, 230 Measurement and
recommendation systems, Forecasting (Diebolt), 3
163174 differentiable, 85n6
segmentation, 127139 dimensionality, reducing,
text analytics, 175191 150151
time series data mining, discrete Fourier transformation
149161 (DFT), 151
validation, 9798 discrete wavelet transformation
data mining (DWT), 151
See also time series data mining disk array, 28n1
See also specific topics distance measures (metrics),
future of, 233241 133134
multidisciplinary nature of, distributed systems
5556, 56f6 about, 3536
shifts in, 2425 considerations, 3941
Data Mining with Decision Trees database computing, 3637
(Rokach and Maimon), 101 file system computing, 3739
Data Partition node, 62 distributed.net, 35
data scientist, 8 divisive clustering, 138
data sets, privacy with, ductal carcinoma, 12
234236 dynamic time warping
DATA Step, 50 methods, 158
INDEX 257

E frequentists, 114n14
economy of scale, 199 functional dimension analysis
Edison, Thomas, 60 (FDA), 219
Empirical Model Building and future
Response Surfaces (Box), 57 of big data, data mining and
ensemble methods, 124126 machine learning, 233241
enterprise data warehouses development of algorithms,
(EDWs), 37, 58, 198199, 238241
217 of software development,
entropy, 105 237238
Estimation of Dependencies Based on
Empirical Data (Vapnik), 107 G
Euclidean distance, 133 gain, 69
exploratory data analysis, Galton, Francis, 75
performing, 5960 gap, 135
external evaluation criterion, Gauss, Carl, 75
134135 Gaussian distribution, 119
extremity-based model, 15 Gaussian kernel, 112
generalized additive models
F (GAM), 83
Facebook, 7 generalized linear models
failing fast paradigm, 18 (GLMs)
feature films, as example of about, 8486
segmentation, 132 applications for big data, 8990
feedback loop, 223 probit, 8689
file crawling, 176 Gentleman-Givens
file system computing, 3739 computational method, 83
financial services company case global minimum, assurance of, 98
study, 197204 Global Positioning System (GPS),
Fisher, Sir Ronald, 98 56
forecasting, new product, Goodnight, Jim, 50
152153 Google, 6, 10
Forrester Wave and Gartner Gosset, William, 80n1
Magic Quadrant, 50 gradient boosting, 106
FORTRAN language, 4445, gradient descent method, 96
46, 49 graphical processing unit (GPU),
fraud, 16, 152 3031
258 INDEX

H I
Hadoop, 7, 3739, 38f,
8 40f,
0 45 IBM, 7, 239
Hadoop Streaming Utility, 49 identity output activation
hardware function, 94
central processing unit (CPU), incremental response modeling
2930 about, 141142
cost of, 36 building model, 142143
graphical processing unit measuring, 143148
(GPU), 3031 index, 177
memory, 3133 inference, in Bayesian networks,
network, 3334 122123
storage (disk), 2729 information retrieval, 176177,
health care provider case study, 183184, 211, 245246
205214 in-memory databases (IMDBs),
HEDIS, 207208 37, 40f
0
Hertzano, Ephraim, 129130 input scaling, 98
Hessian, 9697, 96n11 Institute of Electrical and
Hession Free Learning, 101 Electronic Engineers
Hickey, Rick (programmer), 46 (IEEE), 6
hierarchical clustering, 132, 138 internal evaluation criterion, 134
high-performance marketing Internet, birth of, 5
solution, 202203 Internet of Things, 6, 236237
high-tech product manufacturer interpretability, as drawback of
case study, 229232 SVM, 113
Hinton, Geoffrey E., 100 interval prediction, 6667
Learning Representation by iPhone (Apple), 7
Back-Propagating Errors, 92 IPv4 protocol, 7
Hornik, Kurt IRE, 208
Multilayer Feedforward iterative model building, 6061
networks are Universal
Approximators, 92 J
HOS, 208 Java and Java Virtual Machine
Huber M estimation, 83 (JVM) languages, 5,
hyperplane, 107n13 4446
Hypertext Transfer Protocol Jennings, Ken, 182
(HTTP), 5 Jeopardy (TV show), 7, 180191,
hypothesis, 234 239241
INDEX 259

K MapReduce paradigm, 38
Kaggle, 161n3 marketing campaign process,
kernels, 112 traditional, 198202
K-means clustering, 132, 137138 Markov blanket bayesian
Kolmogorov-Smirnov (KS) network (MB), 120
statistic, 70 Martens, James, 101
massively parallel processing
L
(MPP) databases, 37, 40f,0
Lagrange multipliers, 111
202203, 211213, 217
language detection, 222
Matplotlib library, 49
latent factor models, 164
matrix, rank of, 164n1
Learning Representation by
maximum likelihood
Back-Propagating Errors
estimation, 117
(Rumelhart, Hinton and
Mayer, Marissa, 24
Williams), 92
McAuliffe, Sharon, 20
Legrendre, Adrien-Marie, 75
McCulloch, Warren, 90
lift, 69
mean, 51
likelihood, 114
Medicare Advantage, 206
Limited Memory BFGS (LBFGS)
memory, 3133
method, 9798
methodology, for building
linear regression, 15
models, 5861
LinkedIn, 6
Michael J. Fox Foundation, 161
Linux, 237, 238
microdata, 235n1
Lisp programming language, 46
Microsoft, 237
lobular carcinoma, 2
Minsky, Marvin
locally estimated scatterplot
Perceptrons, 9192
smoothing (LOESS), 83
misclassification rate, 105
logistic regression model, 85
MIT, 236
low-rank matrix factorization,
mobile application
164, 166
recommendations case study,
M 225228
machine learning, future of, model lift, 204
233241 models
See also specific topics creating, 223
Maimon, Oded methodology for building,
Data Mining with Decision Trees, 5861
103 scoring, 201, 213
260 INDEX

monitoring process, 223224 Nike+iPod sensor, 154


monotonic, 85n6 No Harm Intended (Pollack),
Moore, Gordon, 2930 92
Moores law, 2930 Nolvadex, 23
Multicore, 48 nominal classification, 66
multicore chips, 30 nonlinear relationships, 1921
Multilayer Feedforward nonparametric techniques, 15
networks are Universal NoSQL database, 6, 40f0
Approximators (Hornik, Numpy library, 49
Stinchcombe and White), 92 Nutch, 37, 38n2
multilayer perceptrons (MLP),
9394 O
multilevel classification, 66 object-oriented (OO)
programming paradigms,
N 4546
Naive Bayes Network, 116117 one-class support vector
Narayanan, Arvind, 235 machines (SVMs), 144146
National Science Foundation online brand management case
(NSF), 7 study, 221224
National Security Agency, 10 open source projects, 238
neighborhood-based methods, operationalizing campaigns,
164 201202
Nelder, John, 84, 87 optimization, 1617
Nest thermostat, 237 Orange library, 49
Netflix, 235 ordinary least squares,
network, 3334 7682
NetWorkSpaces (NWS), orthogonal regression, 83
48, 48n1 overfitting, 9798
neural networks
about, 15, 9098 P
basic example of, 98101 Page, Larry, 6
diagrammed, 90f,0 93f
3 Papert, Seymour
key considerations, 9798 Perceptrons, 9192
Newtons method, 96 Parallel, 4849
next-best offer, 164 Parallel Virtual Machine (PVM),
Nike+ FuelBand, 154161, 48, 48n1
245246 parameter learning, 117120
INDEX 261

parent child Bayesian network creating models, 200201


(PC), 120 interval prediction, 6667
parsimony, 218 methodology for building
partial least squares regression, models, 5861
83 multilevel classification, 66
path analysis, 218 sEMMA, 6163
Patient Protection and Affordable predictive modeling techniques
Care Act (2010), 206 about, 7172
Pattern library, 49 Bayesian methods network
patterns, detecting, 152153 classification, 113124
Pearl, Judea decision and regression trees,
Probabilistic Reasoning in 101107
Intelligent Systems: Networks of ensemble methods, 124126
Plausible Inference, 121n16 generalized linear models
Pearsons chi-square statistic, 88 (GLMs), 8490
Perceptron, 9094, 91f neural networks, 90101
Perceptrons (Minsky and Papert), regression analysis, 7584
9192 RFM (recency, frequency,
performing exploratory data and monetary modeling),
analysis, 5960 7275, 73t
personal data, 4 support vector machines
piecewise aggregate (SVMs), 107113
approximation (PAA) prior, 114
method, 151 privacy, 234236
Pitts, Walter, 90 PRIZM, 132133
platforms, for file system Probabilistic Programming for
computing, 3739 Advanced Machine Learning
Pollack, Jordan B. (PPAML), 239
No Harm Intended, 92 Probabilistic Reasoning in Intelligent
Polynomial kernel, 112 Systems: Networks of Plausible
POSIX-compliant, 4849 Inference (Pearl), 121n16
posterior, 114 probit, 84n4, 8689
predictive modeling processes
about, 5558 averaging, 1415
assessment of predictive monitoring, 223224
models, 6770 tail-based modeling, 15
binary classification, 6465 Proctor & Gamble, 236
262 INDEX

profiling clusters, 140 how they work, 165170


Progressive Insurances SAS Library, 171173
SnapshotTM device, 162 where they are used,
proportional hazards regression, 164165
83 Red Hat, 238
public radio, as example to reduced-dimension models, 164
demonstrate incremental REG procedure, 51
response, 142143 regression
p-values, 88, 88n10 adaptive, 8283
Python, 4950 linear, 15
orthogonal, 83
Q problems with, 94
quantile regression, 83 quantile, 83
Quasi-Newton methods, 96, robust, 83
100101 regression analysis
about, 7576
R applications for big data, 8384
R language, 4749 assumptions of regression
radial basis function networks models, 82
(RBF), 93 ordinary least squares, 7682
radio-frequency identification power of, 8182
(RFID) tags, 234 techniques, 8283
random access memory (RAM), regression models, assumptions
3133, 44 of, 82
Random Forest, 106, regression trees, 101107
125126 relational database management
rank sign test, 15 system (RDBMS), 36
ranking potential customers, relationship management, 17
201 reproducable research, 234
receiver operating characteristic response model, building,
(ROC), 68 142143
recency, frequency, and Restricted Boltzmann
monetary modeling (RFM), Machines, 100101,
7275, 73t 164, 167169
recommendation systems Rhadoop, 48, 49
about, 163164 RHIPE, 48, 49
assessing quality of, 170171 robust regression, 83
INDEX 263

Rokach, Lior seasonal analysis, 155157


Data Mining with Decision segmentation
Trees, 103 about, 127132
root cause analysis, 231 cluster analysis, 132133
root mean square error (RMSE), distance measures (metrics),
170171 133134
Rosenblatt, Frank, 90 evaluating clustering, 134135
R-Square (R), 81 hierarchical clustering, 138
Rumelhart, David E. K-means algorithm, 137138
Learning Representation by number of clusters, 135137
Back-Propagating Errors, 92 profiling clusters, 138139
Rummikub, 127132, 127n1, RFM, 7273, 73t
128f
8 self-organizing maps (SOM), 93
semisupervised clustering, 133
S sEMMA, 6163
S language, 47 sentiment analysis, 222
Salakhutdinov, R.R., 100 SETI@Home, 35, 36
Sall, John, 50 Sigmoid kernel, 112
sampling, 1314 similarity analysis, 152, 153,
SAS Enterprise Miner software, 157161
71, 127128 SIMILARITY procedure, 158
SAS Global Forum, 158 single linkage, 134
SAS Library, 171173 singular value decomposition
SAS System, as analytical tool, (SVD), 151, 211212
5052 size, of disk drives, 2829
SAS Text Miner software, 175 slack variables, 111
SAS/ETS, 158 Snow, 48
SAS/STAT, 50 software
Scala, 4546 future developments in,
scaled deviance, 88, 88n9 237238
scikit-learn toolkit, 4950 shifts in, 2425
Scipy library, 49 three-dimensional (3D)
scoring computer-aided design
models, 201, 213 (CAD), 3031
for supervised learning, solid state devices (SSDs), 28
123124 spanning tree, 121n15
search, 177 Spark, 46
264 INDEX

speed support vector machines (SVMs),


of CPU, 2930 107113, 144146
of memory, 33 Support-Vector Networks (Cortes
network, 3334 and Vapnik), 111
star ratings, 206207 symbolic aggregate
starting values, 97 approximation (SAX), 219
Stinchcombe, Maxwell
Multilayer Feedforward T
networks are Universal tail-based modeling process, 15
Approximators, 92 tamoxifen citrate, 23
stochastic gradient descent, technology manufacturer case
166167 study, 215219
storage (disk), 2729 teraflop, 239n2
storage area network (SAN), text analytics
33n3 about, 175176
Strozzi, Carlo, 6 content categorization,
Structured Query Language 177178
(SQL), 199 example of, 180191
success stories (case studies) information retrieval, 176177,
financial services company, 183184
197204 text mining, 178180
health care provider, 205214 text clustering, 180
high-tech product text extraction, 176177
manufacturer, 229232 text mining, 178180
mobile application text topic identification, 177
recommendations, 225228 32-bit operating systems,
online brand management, 32n2
221224 three-dimensional (3D)
qualities of, 194195 computer-aided design
technology manufacturer, (CAD) software, 3031
215219 TIBCO, 47
sum of squares error (SSE), time series data mining
134 about, 149150
supercomputers, 36 detecting patterns, 151153
supervised clustering, 133 Nike+ FuelBand, 154161
supervised learning, scoring for, reducing dimensionality,
123124 150151
INDEX 265

TIOBE, 47, 47n1 Estimation of Dependencies Based


tools, analytical on Empirical Data, 107
about, 43 Support-Vector Networks, 111
Java and Java Virtual Machine variance, 105
(JVM) languages, 4446 volatility, of memory, 32
Python, 4950
R language, 4749 W
SAS system, 5052 Wards method, 138
Weka (Waikato Environment Watson, Thomas J., 239
for Knowledge Analysis), Watson computer, 7, 239240
4344 web crawling, 176
traditional marketing campaign Wedderburn, Robert, 84, 87
process, 198202 Weka (Waikato Environment for
tree-augmented naive Bayes Knowledge Analysis),
(TAN), 120121 4344
tree-based methods, 101107 White, Halbert
trend analysis, 157 Multilayer Feedforward
Truncated Newton Hessian Free networks are Universal
Learning, 101 Approximators, 92
Twitter, 221224 Wikipedia, 6, 7
Williams, Ronald J.
U Learning Representation by
UNIVARIATE procedure, 51 Back-Propagating Errors, 92
UnQL, 7 Winsor, Charles, 80n2
unsupervised clustering, 132 winsorize, 80, 80n2
unsupervised segmentation, 135 work flow productivity, 1719
World Wide Web, birth of, 5
V
validation data, 9798 Z
Vapnik, Vladimir Zuckerberg, Mark, 7

You might also like