Mastering Predictive Analytics with R
3.5/5
()
About this ebook
- Grasp the major methods of predictive modeling and move beyond black box thinking to a deeper level of understanding
- Leverage the flexibility and modularity of R to experiment with a range of different techniques and data types
- Packed with practical advice and tips explaining important concepts and best practices to help you understand quickly and easily
This book is intended for the budding data scientist, predictive modeler, or quantitative analyst with only a basic exposure to R and statistics. It is also designed to be a reference for experienced professionals wanting to brush up on the details of a particular type of predictive model. Mastering Predictive Analytics with R assumes familiarity with only the fundamentals of R, such as the main data types, simple functions, and how to move data around. No prior experience with machine learning or predictive modeling is assumed, however you should have a basic understanding of statistics and calculus at a high school level.
Related to Mastering Predictive Analytics with R
Related ebooks
R Machine Learning By Example Rating: 0 out of 5 stars0 ratingsMastering Data Analysis with R Rating: 5 out of 5 stars5/5Big Data Analytics with R Rating: 0 out of 5 stars0 ratingsLearning Predictive Analytics with Python Rating: 0 out of 5 stars0 ratingsR for Data Science Rating: 5 out of 5 stars5/5Mastering Text Mining with R Rating: 0 out of 5 stars0 ratingsMastering Machine Learning with R Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5Practical Data Analysis Rating: 4 out of 5 stars4/5Python Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsPractical Data Science Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsWeb Application Development with R Using Shiny - Second Edition Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science Rating: 4 out of 5 stars4/5Mastering Social Media Mining with R Rating: 5 out of 5 stars5/5R Machine Learning Essentials Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python Rating: 0 out of 5 stars0 ratingsIntroduction to R for Business Intelligence Rating: 0 out of 5 stars0 ratingsMastering Scientific Computing with R Rating: 3 out of 5 stars3/5Practical Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsLearning pandas Rating: 4 out of 5 stars4/5Simulation for Data Science with R Rating: 0 out of 5 stars0 ratingsR High Performance Programming Rating: 4 out of 5 stars4/5Regression Analysis with Python Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5Learning Quantitative Finance with R Rating: 4 out of 5 stars4/5Python Data Analysis Rating: 4 out of 5 stars4/5R Object-oriented Programming Rating: 3 out of 5 stars3/5
Programming For You
HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications Rating: 0 out of 5 stars0 ratingsLearn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 0 out of 5 stars0 ratingsHacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1 Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5SQL Guide for Microsoft Access: SQL Basics, Fundamental & Queries Exercise Rating: 5 out of 5 stars5/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5Learn JavaScript in 24 Hours Rating: 3 out of 5 stars3/5Learn SQL in 24 Hours Rating: 5 out of 5 stars5/5Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming Rating: 0 out of 5 stars0 ratingsSQL All-in-One For Dummies Rating: 3 out of 5 stars3/5C++ Learn in 24 Hours Rating: 0 out of 5 stars0 ratingsPYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5
Reviews for Mastering Predictive Analytics with R
3 ratings0 reviews
Book preview
Mastering Predictive Analytics with R - Rui Miguel Forte
Table of Contents
Mastering Predictive Analytics with R
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Gearing Up for Predictive Modeling
Models
Learning from data
The core components of a model
Our first model: k-nearest neighbors
Types of models
Supervised, unsupervised, semi-supervised, and reinforcement learning models
Parametric and nonparametric models
Regression and classification models
Real-time and batch machine learning models
The process of predictive modeling
Defining the model's objective
Collecting the data
Picking a model
Preprocessing the data
Exploratory data analysis
Feature transformations
Encoding categorical features
Missing data
Outliers
Removing problematic features
Feature engineering and dimensionality reduction
Training and assessing the model
Repeating with different models and final model selection
Deploying the model
Performance metrics
Assessing regression models
Assessing classification models
Assessing binary classification models
Summary
2. Linear Regression
Introduction to linear regression
Assumptions of linear regression
Simple linear regression
Estimating the regression coefficients
Multiple linear regression
Predicting CPU performance
Predicting the price of used cars
Assessing linear regression models
Residual analysis
Significance tests for linear regression
Performance metrics for linear regression
Comparing different regression models
Test set performance
Problems with linear regression
Multicollinearity
Outliers
Feature selection
Regularization
Ridge regression
Least absolute shrinkage and selection operator (lasso)
Implementing regularization in R
Summary
3. Logistic Regression
Classifying with linear regression
Introduction to logistic regression
Generalized linear models
Interpreting coefficients in logistic regression
Assumptions of logistic regression
Maximum likelihood estimation
Predicting heart disease
Assessing logistic regression models
Model deviance
Test set performance
Regularization with the lasso
Classification metrics
Extensions of the binary logistic classifier
Multinomial logistic regression
Predicting glass type
Ordinal logistic regression
Predicting wine quality
Summary
4. Neural Networks
The biological neuron
The artificial neuron
Stochastic gradient descent
Gradient descent and local minima
The perceptron algorithm
Linear separation
The logistic neuron
Multilayer perceptron networks
Training multilayer perceptron networks
Predicting the energy efficiency of buildings
Evaluating multilayer perceptrons for regression
Predicting glass type revisited
Predicting handwritten digits
Receiver operating characteristic curves
Summary
5. Support Vector Machines
Maximal margin classification
Support vector classification
Inner products
Kernels and support vector machines
Predicting chemical biodegration
Cross-validation
Predicting credit scores
Multiclass classification with support vector machines
Summary
6. Tree-based Methods
The intuition for tree models
Algorithms for training decision trees
Classification and regression trees
CART regression trees
Tree pruning
Missing data
Regression model trees
CART classification trees
C5.0
Predicting class membership on synthetic 2D data
Predicting the authenticity of banknotes
Predicting complex skill learning
Tuning model parameters in CART trees
Variable importance in tree models
Regression model trees in action
Summary
7. Ensemble Methods
Bagging
Margins and out-of-bag observations
Predicting complex skill learning with bagging
Predicting heart disease with bagging
Limitations of bagging
Boosting
AdaBoost
Predicting atmospheric gamma ray radiation
Predicting complex skill learning with boosting
Limitations of boosting
Random forests
The importance of variables in random forests
Summary
8. Probabilistic Graphical Models
A little graph theory
Bayes' Theorem
Conditional independence
Bayesian networks
The Naïve Bayes classifier
Predicting the sentiment of movie reviews
Hidden Markov models
Predicting promoter gene sequences
Predicting letter patterns in English words
Summary
9. Time Series Analysis
Fundamental concepts of time series
Time series summary functions
Some fundamental time series
White noise
Fitting a white noise time series
Random walk
Fitting a random walk
Stationarity
Stationary time series models
Moving average models
Autoregressive models
Autoregressive moving average models
Non-stationary time series models
Autoregressive integrated moving average models
Autoregressive conditional heteroscedasticity models
Generalized autoregressive heteroscedasticity models
Predicting intense earthquakes
Predicting lynx trappings
Predicting foreign exchange rates
Other time series models
Summary
10. Topic Modeling
An overview of topic modeling
Latent Dirichlet Allocation
The Dirichlet distribution
The generative process
Fitting an LDA model
Modeling the topics of online news stories
Model stability
Finding the number of topics
Topic distributions
Word distributions
LDA extensions
Summary
11. Recommendation Systems
Rating matrix
Measuring user similarity
Collaborative filtering
User-based collaborative filtering
Item-based collaborative filtering
Singular value decomposition
R and Big Data
Predicting recommendations for movies and jokes
Loading and preprocessing the data
Exploring the data
Evaluating binary top-N recommendations
Evaluating non-binary top-N recommendations
Evaluating individual predictions
Other approaches to recommendation systems
Summary
Index
Mastering Predictive Analytics with R
Mastering Predictive Analytics with R
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: June 2015
Production reference: 1100615
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78398-280-6
www.packtpub.com
Credits
Author
Rui Miguel Forte
Reviewers
Ajay Dhamija
Prasad Kothari
Dawit Gezahegn Tadesse
Commissioning Editor
Kartikey Pandey
Acquisition Editor
Subho Gupta
Content Development Editor
Govindan Kurumangattu
Technical Editor
Edwin Moses
Copy Editors
Stuti Srivastava
Aditya Nair
Vedangi Narvekar
Project Coordinator
Shipra Chawhan
Proofreaders
Stephen Copestake
Safis Editing
Indexer
Priya Sane
Graphics
Sheetal Aute
Disha Haria
Jason Monteiro
Abhinash Sahu
Production Coordinator
Shantanu Zagade
Cover Work
Shantanu Zagade
About the Author
Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist who has over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects include the predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes, and fraud detection for job scams. Currently, he teaches R, MongoDB, and other data science technologies to graduate students in the business analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured at a number of seminars, specialization programs, and R schools for working data science professionals in Athens. His core programming knowledge is in R and Java, and he has extensive experience working with a variety of database technologies, such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a master's degree in electrical and electronic engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.
Acknowledgments
Behind every great adventure is a good story, and writing a book is no exception. Many people contributed to making this book a reality. I would like to thank the many students I have taught at AUEB, whose dedication and support has been nothing short of overwhelming. They should be rest assured that I have learned just as much from them as they have learned from me, if not more. I also want to thank Damianos Chatziantoniou for conceiving a pioneering graduate data science program in Greece. Workable has been a crucible for working alongside incredibly talented and passionate engineers on exciting data science projects that help businesses around the globe. For this, I would like to thank my colleagues and in particular, the founders, Nick and Spyros, who created a diamond in the rough. I would like to thank Subho, Govindan, Edwin, and all the folks at Packt for their professionalism and patience. To the many friends who offered encouragement and motivation I would like to express my eternal gratitude. My family and extended family have been an incredible source of support on this project. In particular, I would like to thank my father, Libanio, for inspiring me to pursue a career in the sciences and my mother, Marianthi, for always believing in me far more than anyone else ever could. My wife, Despoina, patiently and fiercely stood by my side even as this book kept me away from her during her first pregnancy. Last but not least, my baby daughter slept quietly and kept a cherubic vigil over her father during the book's final stages of preparation. She helped in ways words cannot describe.
About the Reviewers
Ajay Dhamija is a senior scientist working in Defense R&D Organization, Delhi. He has more than 24 years' experience as a researcher and instructor. He holds an MTech (computer science and engineering) degree from IIT, Delhi, and an MBA (finance and strategy) degree from FMS, Delhi. He has more than 14 research works of international repute in varied fields to his credit, including data mining, reverse engineering, analytics, neural network simulation, TRIZ, and so on. He was instrumental in developing a state-of-the-art Computer-Aided Pilot Selection System (CPSS) containing various cognitive and psychomotor tests to comprehensively assess the flying aptitude of the aspiring pilots of the Indian Air Force. He has been honored with the Agni Award for excellence in self reliance, 2005, by the Government of India. He specializes in predictive analytics, information security, big data analytics, machine learning, Bayesian social networks, financial modeling, Neuro-Fuzzy simulation and data analysis, and data mining using R. He is presently involved with his doctoral work on Financial Modeling of Carbon Finance data from IIT, Delhi. He has written an international best seller, Forecasting Exchange Rate: Use of Neural Networks in Quantitative Finance (http://www.amazon.com/Forecasting-Exchange-rate-Networks-Quantitative/dp/3639161807), and is currently authoring another book on R named Multivariate Analysis using R.
Apart from analytics, Ajay is actively involved in information security research. He has associated himself with various international and national researchers in government as well as the corporate sector to pursue his research on ways to amalgamate two important and contemporary fields of data handling, that is, predictive analytics and information security.
You can connect with Ajay at the following:
LinkedIn: ajaykumardhamija
ResearchGate: Ajay_Dhamija2
Academia: ajaydhamija
Facebook: akdhamija
Twitter: akdhamija
Quora: Ajay-Dhamija
While associating with researchers from Predictive Analytics and Information Security Institute of India (PRAISIA @ www.praisia.com) in his research endeavors, he has worked on refining methods of big data analytics for security data analysis (log assessment, incident analysis, threat prediction, and so on) and vulnerability management automation.
I would like to thank my fellow scientists from Defense R&D Organization and researchers from corporate sectors such as Predictive Analytics & Information Security Institute of India (PRAISIA), which is a unique institute of repute and of its own kind due to its pioneering work in marrying the two giant and contemporary fields of data handling in modern times, that is, predictive analytics and information security, by adopting custom-made and refined methods of big data analytics. They all contributed in presenting a fruitful review for this book. I'm also thankful to my wife, Seema Dhamija, the managing director of PRAISIA, who has been kind enough to share her research team's time with me in order to have technical discussions. I'm also thankful to my son, Hemant Dhamija, who gave his invaluable inputs many a times, which I inadvertently neglected during the course of this review. I'm also thankful to a budding security researcher, Shubham Mittal from MakeMyTrip, for his constant and constructive critiques of my work.
Prasad Kothari is an analytics thought leader. He has worked extensively with organizations such as Merck, Sanofi Aventis, Freddie Mac, Fractal Analytics, and the National Institute of Health on various analytics and big data projects. He has published various research papers in the American Journal of Drug and Alcohol Abuse and American public health. His leadership and analytics skills have been pivotal in setting up analytics practices for various organizations and helping grow them across the globe.
Dawit Gezahegn Tadesse is currently a visiting assistant professor in the Department of Mathematical Sciences at the University of Cincinnati, Cincinnati, Ohio, USA. He obtained his MS in mathematics and PhD in statistics from Auburn University, Auburn, AL, USA in 2010 and 2014, respectively. His research interests include high-dimensional classification, text mining, nonparametric statistics, and multivariate data analysis.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
This book is dedicated to my loving wife Despoina, who makes all good things better and every adventure worthwhile. You are the light of my life and the flame of my soul.
Preface
Predictive analytics, and data science more generally, currently enjoy a huge surge in interest, as predictive technologies such as spam filtering, word completion and recommendation engines have pervaded everyday life. We are now not only increasingly familiar with these technologies, but these technologies have also earned our confidence. Advances in computing technology in terms of processing power and in terms of software such as R and its plethora of specialized packages have resulted in a situation where users can be trained to work with these tools without needing advanced degrees in statistics or access to hardware that is reserved for corporations or university laboratories. This confluence of the maturity of techniques and the availability of supporting software and hardware has many practitioners of the field excited that they can design something that will make an appreciable impact on their own domains and businesses, and rightly so.
At the same time, many newcomers to the field quickly discover that there are many pitfalls that need to be overcome. Virtually no academic degree adequately prepares a student or professional to become a successful predictive modeler. The field draws upon many disciplines, such as computer science, mathematics, and statistics. Nowadays, not only do people approach the field with a strong background in only one of these areas, they also tend to be specialized within that area. Having taught several classes on the material in this book to graduate students and practicing professionals alike, I discovered that the two biggest fears that students repeatedly express are the fear of programming and the fear of mathematics. It is interesting that these are almost always mutually exclusive. Predictive analytics is very much a practical subject but one with a very rich theoretical basis, knowledge of which is essential to the practitioner. Consequently, achieving mastery in predictive analytics requires a range of different skills, from writing good software to implement a new technique or to preprocess data, to understanding the assumptions of a model, how it can be trained efficiently, how to diagnose problems, and how to tune its parameters to get better results.
It feels natural at this point to want to take a step back and think about what predictive analytics actually covers as a field. The truth is that the boundaries between this field and other related fields, such as machine learning, data mining, business analytics, data science and so on, are somewhat blurred. The definition we will use in this book is very broad. For our purposes, predictive analytics is a field that uses data to build models that predict a future outcome of interest. There is certainly a big overlap with the field of machine learning, which studies programs and algorithms that learn from data more generally. This is also true for data mining, whose goal is to extract knowledge and patterns from data. Data science is rapidly becoming an umbrella term that covers all of these fields, as well as topics such as information visualization to present the findings of data analysis, business concepts surrounding the deployment of models in the real world, and data management. This book may draw heavily from machine learning, but we will not cover the theoretical pursuit of the feasibility of learning, nor will we study unsupervised learning that sets out to look for patterns and clusters in data without a particular predictive target in mind. At the same time, we will also explore topics such as time series, which are not commonly discussed in a machine learning text.
R is an excellent platform to learn about predictive analytics and also to work on real-world problems. It is an open source project with an ever-burgeoning community of users. Together with Python, they are the two most commonly used languages by data scientists around the world at the time of this writing. It has a wealth of different packages that specialize in different modeling techniques and application domains, many of which are directly accessible from within R itself via a connection to the Comprehensive R Archive Network (CRAN). There are also ample online resources for the language, from tutorials to online courses. In particular, we'd like to mention the excellent Cross Validated forum (http://stats.stackexchange.com/) as well as the website R-bloggers (http://www.r-bloggers.com/), which hosts a fantastic collection of articles on using R from different blogs. For readers who are a little rusty, we provide a free online tutorial chapter that evolved from a set of lecture notes given to students at the Athens University of Economics and Business.
The primary mission of this book is to bridge the gap between low-level introductory books and tutorials that emphasize intuition and practice over theory, and high-level academic texts that focus on mathematics, detail, and rigor. Another equally important goal is to instill some good practices in you, such as learning how to properly test and evaluate a model. We also emphasize important concepts, such as the bias-variance trade-off and overfitting, which are pervasive in predictive modeling and come up time and again in various guises and across different models.
From a programming standpoint, even though we assume that you are familiar with the R programming language, every code sample has been carefully explained and discussed to allow readers to develop their confidence and follow along. That being said, it is not possible to overstress the importance of actually running the code alongside the book or at least before moving on to a new chapter. To make the process as smooth as possible, we have provided code files for every chapter in the book containing all the code samples in the text. In addition, in a number of places, we have written our own, albeit very simple implementations of certain techniques. Two examples that come to mind are the pocket perceptron algorithm in Chapter 4, Neural Networks and AdaBoost in Chapter 7, Ensemble Methods. In part, this is done in an effort to encourage users to learn how to write their own functions instead of always relying on existing implementations, as these may not always be available.
Reproducibility is a critical skill in the analysis of data and is not limited to educational settings. For this reason, we have exclusively used freely available data sets and have endeavored to apply specific seeds wherever random number generation has been needed. Finally, we have tried wherever possible to use data sets of a relatively small size in order to ensure that you can run the code while reading the book without having to wait too long, or force you to have access to better hardware than might be available to you. We will remind you that in the real world, patience is an incredibly useful virtue, as most data sets of interest will be larger than the ones we will study.
While each chapter ends in two or more practical modeling examples, every chapter begins with some theory and background necessary to understand a new model or technique. While we have not shied away from using mathematics to explain important details, we have been very mindful to introduce just enough to ensure that you understand the fundamental ideas involved. This is in line with the book's philosophy of bridging the gap to academic textbooks that go into more detail. Readers with a high-school background in mathematics should trust that they will be able to follow all of the material in this book with the aid of the explanations given. The key skills needed are basic calculus, such as simple differentiation, and key ideas in probability, such as mean, variance, correlation, as well as important distributions such as the binomial and normal distribution. While we don't provide any tutorials on these, in the early chapters we do try to take things particularly slowly. To address the needs of readers who are more comfortable with mathematics, we often provide additional technical details in the form of tips and give references that act as natural follow-ups to the discussion.
Sometimes, we have had to give an intuitive explanation of a concept in order to conserve space and avoid creating a chapter with an undue emphasis on pure theory. Wherever this is done, such as with the backpropagation algorithm in Chapter 4, Neural Networks, we have ensured that we explained enough to allow the reader to have a firm-enough hold on the basics to tackle a more detailed piece. At the same time, we have given carefully selected references, many of which are articles, papers, or online texts that are both readable and freely available. Of course, we refer to seminal textbooks wherever necessary.
The book has no exercises, but we hope that you will engage your curiosity to its maximum potential. Curiosity is a huge boon to the predictive modeler. Many of the websites from which we obtain data that we analyze have a number of other data sets that we do not investigate. We also occasionally show how we can generate artificial data to demonstrate the proof of concept behind a particular technique. Many of the R functions to build and train models have other parameters for tuning that we don't have time to investigate. Packages that we employ may often contain other related functions to those that we study, just as there are usually alternatives available to the proposed packages themselves. All of these are excellent avenues for further investigation and experimentation. Mastering predictive analytics comes just as much from careful study as from personal inquiry and practice.
A common ask from students of the field is for additional worked examples to simulate the actual process an experienced modeler follows on a data set. In reality, a faithful simulation would take as many hours as the analysis took in the first place. This is because most of the time spent in predictive modeling is in studying the data, trying new features and preprocessing steps, and experimenting with different models on the result. In short, as we will see in Chapter 1, Gearing Up for Predictive Modeling, exploration and trial and error are key components of an effective analysis. It would have been entirely impractical to compose a book that shows every wrong turn or unsuccessful alternative that is attempted on every data set. Instead of this, we fervently recommend that readers treat every data analysis in this book as a starting point to improve upon, and continue this process on their own. A good idea is to try to apply techniques from other chapters to a particular data set in order to see what else might work. This could be anything, from simply applying a different transformation to an input feature to using a completely different model from another chapter.
As a final note, we should mention that creating polished and presentable graphics in order to showcase the findings of a data analysis is a very important skill, especially in the workplace. While R's base plotting capabilities cover the basics, they often lack a polished feel. For this reason, we have used the ggplot2 package, except where a specific plot is generated by a function that is part of our analysis. Although we do not provide a tutorial for this, all the code to generate the plots included in this book is provided in the supporting code files, and we hope that the user will benefit from this as well. A useful online reference for the ggplot2 package is the section on graphs in the Cookbook for R website (http://www.cookbook-r.com/Graphs).
What this book covers
Chapter 1, Gearing Up for Predictive Modeling, begins our journey by establishing a common language for statistical models and a number of important distinctions we make when categorizing them. The highlight of the chapter is an exploration of the predictive modeling process and through this, we showcase our first model, the k Nearest Neighbor (kNN) model.
Chapter 2, Linear Regression, introduces the simplest and most well-known approach to predicting a numerical quantity. The chapter focuses on understanding the assumptions of linear regression and a range of diagnostic tools that are available to assess the quality of a trained model. In addition, the chapter touches upon the important concept of regularization, which addresses overfitting, a common ailment of predictive models.
Chapter 3, Logistic Regression, extends the idea of a linear model from the previous chapter by introducing the concept of a generalized linear model. While there are many examples of such models, this chapter focuses on logistic regression as a very popular method for classification problems. We also explore extensions of this model for the multiclass setting and discover that this method works best for binary classification.
Chapter 4, Neural Networks, presents a biologically inspired model that is capable of handling both regression and classification tasks. There are many different kinds of neural networks, so this chapter devotes itself to the multilayer perceptron network. Neural networks are complex models, and this chapter focuses substantially on understanding the range of different configuration and optimization parameters that play a part in the training process.
Chapter 5, Support Vector Machines, builds on the theme of nonlinear models by studying support vector machines. Here, we discover a different way of thinking about classification problems by trying to fit our training data geometrically using maximum margin separation. The chapter also introduces cross-validation as an essential technique to evaluate and tune models.
Chapter 6, Tree-based Methods, covers decision trees, yet another family of models that have been successfully applied to regression and classification problems alike. There are several flavors of decision trees, and this chapter presents a number of different training algorithms, such as CART and C5.0. We also learn that tree-based methods offer unique benefits, such as built-in feature selection, support for missing data and categorical variables, as well as a highly interpretable output.
Chapter 7, Ensemble Methods, takes a detour from the usual motif of showcasing a new type of model, and instead tries to answer the question of how to effectively combine different models together. We present the two widely known techniques of bagging and boosting and introduce the random forest as a special case of bagging with trees.
Chapter 8, Probabilistic Graphical Models, tackles an active area of machine learning research, that of probabilistic graphical models. These models encode conditional independence relations between variables via a graph structure, and have been successfully applied to problems in a diverse range of fields, from computer vision to medical diagnosis. The chapter studies two main representatives, the Naïve Bayes model and the hidden Markov model. This last model, in particular, has been successfully used in sequence prediction problems, such as predicting gene sequences and labeling sentences with part of speech tags.
Chapter 9, Time Series Analysis, studies the problem of modeling a particular process over time. A typical application is forecasting the future price of crude oil given historical data on the price of crude oil over a period of time. While there are many different ways to model time series, this chapter focuses on ARIMA models while discussing a few alternatives.
Chapter 10, Topic Modeling, is unique in this book in that it presents topic modeling, an approach that has its roots in clustering and unsupervised learning. Nonetheless, we study how this important method can be used in a predictive modeling scenario. The chapter emphasizes the most commonly known approach to topic modeling, Latent Dirichlet Allocation (LDA).
Chapter 11, Recommendation Systems, wraps up the book by discussing recommendation systems that analyze the preferences of a set of users interacting with a set of items, in order to make recommendations. A famous example of this is Netflix, which uses a database of ratings made by its users on movie rentals to make movie recommendations. The chapter casts a spotlight on collaborative filtering, a purely data-driven approach to making recommendations.
Introduction to R, gives an introduction and overview of the R language. It is provided as a way for readers to get up to speed in order to follow the code samples in this book. This is available as an online chapter at https://www.packtpub.com/sites/default/files/downloads/Mastering_Predictive_Analytics_with_R_Chapter.
What you need for this book
The only strong requirement for running the code in this book is an installation of R. This is freely available from http://www.r-project.org/ and runs on all the major operating systems. The code in this book has been tested with R version 3.1.3.
All the chapters introduce at least one new R package that does not come with the base installation of R. We do not explicitly show the installation of R packages in the text, but if a package is not currently installed on your system or if it requires updating, you can install it with the install.packages() function. For example, the following command installs the tm package:
> install.packages(tm
)
All the packages we use are available on CRAN. An Internet connection is needed to download and install them as well as to obtain the open source data sets that we use in our real-world examples. Finally, even though not absolutely mandatory, we recommend that you get into the habit of using an Integrated Development Environment (IDE) to work with R. An excellent offering is RStudio (http://www.rstudio.com/), which is open source.
Who this book is for
This book is intended for budding and seasoned practitioners of predictive modeling alike. Most of the material of this book has been used in lectures for graduates and working professionals as well as for R schools, so it has also been designed with the student in mind. Readers should be familiar with R, but even those who have never worked with this language should be able to pick up the necessary background by reading the online tutorial chapter. Readers unfamiliar with R should have had at least some exposure to programming languages such as Python. Those with a background in MATLAB will find the transition particularly easy. As mentioned earlier, the mathematical requirements for the book are very modest, assuming only certain elements from high school mathematics, such as the concepts of mean and variance and basic differentiation.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Finally, we'll use the sort() function of R with the index.return parameter set to TRUE.
A block of code is set as follows:
> iris_cor <- cor(iris_numeric)
> findCorrelation(iris_cor)
[1] 3
> findCorrelation(iris_cor, cutoff = 0.99)
integer(0)
> findCorrelation(iris_cor, cutoff = 0.80)
[1] 3 4
New terms and important words are shown in bold.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books — maybe a mistake in the text or the code — we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at <questions@packtpub.com>, and we will do our best to address the problem.
Chapter 1. Gearing Up for Predictive Modeling
In this first chapter, we'll start by establishing a common language for models and taking a deep view of the predictive modeling process. Much of predictive modeling involves the key concepts of statistics and machine learning, and this chapter will provide a brief tour of the core distinctions of these fields that are essential knowledge for a predictive modeler. In particular, we'll emphasize the importance of knowing how to evaluate a model that is appropriate to the type of problem we are trying to solve. Finally, we will showcase our first model, the k-nearest neighbors model, as well as caret, a very useful R package for predictive modelers.
Models
Models are at the heart of predictive analytics and for this reason, we'll begin our journey by talking about models and what they look like. In simple terms, a model is a representation of a state, process, or system that we want to understand and reason about. We make models so that we can draw inferences from them and, more importantly for us in this book, make predictions about the world. Models come in a multitude of different formats and flavors, and we will explore some of this diversity in this book. Models can be equations linking quantities that we can observe or measure; they can also be a set of rules. A simple model with which most of us are familiar from school is Newton's Second Law of Motion. This states that the net sum of force acting on an object causes the object to accelerate in the direction of the force applied and at a rate proportional to the resulting magnitude of the force and inversely proportional to the object's mass.
We often summarize this information via an equation using the letters F, m, and a for the quantities involved. We also use the capital Greek letter sigma (Σ) to indicate that we are summing over the force and arrows above the letters that are vector quantities (that is, quantities that have both magnitude and direction):
This simple but powerful model allows us to make some predictions about the world. For example, if we apply a known force to an object with a known mass, we can use the model to predict how much it will accelerate. Like most models, this model makes some assumptions and generalizations. For example, it assumes that the color of the object, the temperature of the environment it is in, and its precise coordinates in space are all irrelevant to how the three quantities specified by the model interact with each other. Thus, models abstract away the myriad of details of a specific instance of a process or system in question, in this case the particular object in whose motion we are interested, and limit our focus only to properties that matter.
Newton's Second Law is not the only possible model to describe the motion of objects. Students of physics soon discover other more complex models, such as those taking into account relativistic mass. In general, models are considered more complex if they take a larger number of quantities into account or if their structure is more complex. Nonlinear models are generally more complex than linear models for example. Determining which model to use in practice isn't as simple as picking a more complex model over a simpler model. In fact, this is a central theme that we will revisit time and again as we progress through the many different models in this book. To build our intuition as to why this is so, consider the case where our instruments that measure the mass of the object and the applied force are very noisy. Under these circumstances, it might not make sense to invest in using a more complicated model, as we know that the additional accuracy in the prediction won't make a difference because of the noise in the inputs. Another situation where we may want to use the simpler model is if in our application we simply don't need the extra accuracy. A third situation arises where a more complex model involves a quantity that we have no way of measuring. Finally, we might not want to use a more complex model if it turns out that it takes too long to train or make a prediction because of its complexity.
Learning from data
In this book, the models we will study have two important and defining characteristics. The first of these is that we will not use mathematical reasoning or logical induction to produce a model from known facts, nor will we build models from technical specifications or business rules; instead, the field of predictive analytics builds models from data. More specifically, we will assume that for any given predictive task that we want to accomplish, we will start with some data that is in some way related to or derived from the task at hand. For example, if we want to build a model to predict annual rainfall in various parts of a country, we might have collected (or have the means to collect) data on rainfall at different locations, while measuring potential quantities of interest, such as the height above sea level, latitude, and longitude. The power of building a model to perform our predictive task stems from the fact that we will use examples of rainfall measurements at a finite list of locations to predict the rainfall in places where we did not collect any data.
The second important characteristic of the problems for which we will build models is that during the process of building a model from some data