You are on page 1of 40

Understanding the

R Ecosystem

David Weisman, Ph.D.


Weisman@Computer.Org

2012. All rights reserved.

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

Density

disp

***
0.66

drat

10 15 20 25 30

150

0.71

am

wt

250

vs
drat
mpg

Sample automotive data from Motor Trend magazine.

mpg

drat

vs

am

gear

qsec

gear

***

50

qsec

50

carb

250

**
0.45

3.0 3.5 4.0 4.5 5.0

hp

150

disp

wt
cyl

hp

*** 0.89
***

0.71

Density

100

***

0.79

Density

300

Density

Density

*** 0.87
***

0.68

carb

***

0.78

hp

***

0.85

cyl

3.0 3.5 4.0 4.5 5.0

wt

300

10 15 20 25 30

100

mpg

disp

R does exploratory data analysis

R clusters multivariate datasets


Hierarchical clustering of
nancial assets

K-centroids clustering

https://www.rmetrics.org/node/34

http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=flexclust:kcca

R classies data to make predictions


 
  

  
  



 

  
  




$

Decision tree predicts


credit card ownership

% 
 


$
 

 ! " 


$
% 







$

 

 

#









LHS

390 (whole milk +81)

922 (root vegetables +69)

36 (tropical fruit +21)

147 (whole milk +57)

50 (other vegetables +35)

464 (tropical fruit +51)

292 (other vegetables +65)

334 (root vegetables +51)

367 (other vegetables +56)

389 (other vegetables +54)

125 (whole milk +44)

206 (yogurt +42)

136 (other vegetables +27)

30 (tropical fruit +12)

28 (fruit/vegetable juice +16)

77 (butter +24)

730 (other vegetables +95)

296 (whole milk +75)

645 (other vegetables +85)

4 (processed cheese +2)

R mines transaction data for association rules


Grouped matrix for 5668 rules

size: support
color: lift

{hamburger meat}
{salty snack}
{sugar}
{cream cheese }
{white bread}
{beef}
{curd}
{butter}
{bottled beer}
{domestic eggs}
{fruit/vegetable juice}
{pip fruit}
{whipped/sour cream}
{citrus fruit}
{sausage}
{pastry}
{shopping bags}
{tropical fruit}
{root vegetables}
{bottled water}
{yogurt}
{other vegetables}
{soda}
{rolls/buns}
{whole milk}

RHS

R analyzes social networks


Choir, barbershop quartet (4 men)
Choir, vocal ensemble (4 women)
Basketball, boys
V
QuizBowl
(all)
Asian Club

Swim & Dive Team, boys

Aliation network
school activity clubs

Full IB Diploma Students (12th)


Choir, chamber singers
Drunk Driving Officers
Thespian Society (ITS)
Band,
German NHS
Football,
V Jazz
Theatre Productions
Golf, boys V
Internships
Yearbook Staff
Baseball, V
Soccer, V Hispanic Club
Spanish NHS
Chess Club
Science Olympiad
Choir, a capella
Closeup
Band, Marching (Symphonic)
Junior Class Board
Spanish Club (high)
Baseball, JV (10th) Track, boys V
Newspaper Staff
Forensics
Academic
Cross Country,
girls V decathalon NHSFrench Club (high)
French NHS
Basketball, boys JV
Drunk Driving
Art
Club
Volleyball, V
Orchestra, Symphonic
Basketball, boys 9th
Wrestling, V
Key Club League)
Forensics
Cheerleaders, V
Club(National Forensics
Cross Country, boys VGerman
PEER
Cross Country, boys 8th
Basketball, girls V
Basketball, boys 8th
Yearbook Contributors
Debate
Football, 9th
Drill Team
Latin Club
STUCO
Pep Club
Cheerleaders, JV
Spanish Club
Basketball,
Choir, women's
ensemble girls JV
Track, boys 8th
Tennis, boys V

Choir, concert
Wrestling, 8th

French Club (low)


Band, 8th

Teachers of Tomorrow
Orchestra, Full Concert

Football, 8th

Track, girls V
Softball, V
Softball, JV (10th)

Pep Club Officers

Track, girls 8th


Volleyball, 8th
Choir, treble
Orchestra, 8th

Cross Country, girls 8th

Tennis girls V

Cheerleaders, Spirit Squad


Volleyball, JV
Swim & Dive Team, girls

Basketball,
girls 9th9th
Volleyball,
Cheerleaders, 9th

Basketball, girls 8th


Cheerleaders, 8th

http://sna.stanford.edu/sna_R_labs/output/lab_5/5.7_magact_stdnt_actvts_1996_clubs.pdf

R performs geospatial analysis


Napa Valley
Wine Tasting
1. Import winery
street address xls
2. Geocode with
Google Maps
3. Overlay points onto
Google Map

http://blog.revolutionanalytics.com/2012/07/making-beautiful-maps-in-r-with-ggmap.html

R performs nancial analytics


Ecient frontier
Swiss pension
portfolios

Colors & lines


indicate
Value-at-risk

https://www.rmetrics.org/blog/RiskSurfaces

R retrieves stock data, generates plots in


two lines of R code

http://www.quantmod.com/examples/intro/

R does traditional stats too


anova(lm(yield ~ block + N*K, npk) )
F-value
4.8
13.2
6.6
2.3

Pr(>F)
0.00822 **
0.00247 **
0.02114 *
0.14959
S&P versus jobless claims

6e+05

5e+05

Rsquared = 0.88

800

initial jobless claims

Mean Sq
68.7
189.3
95.2
33.1
14.4

4e+05

Df Sum Sq
5
343.3
1
189.3
1
95.2
1
33.1
15
215.5

3e+05

block
N
K
N:K
Residuals

1000

1200
S&P 500

1400

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

R produces publication-quality PDF reports


knitr graphics manual

knitr Graphics Manual

library(ggplot2)

Yihui Xie

pie <- ggplot(diamonds, aes(x = factor(1), fill = cut)) +


xlab("cut") + geom_bar(width = 1)

April 6, 2012

pie + coord_polar(theta = "y") # a pie chart


pie + coord_polar() # the bullseye chart

This manual shows features of graphics in the knitr package (version


0.4.4) in detail, including the graphical devices, plot recording, plot rearrangement, control of plot sizes, the tikz device, figure captions, animations and other types of plots such as rgl or GGobi plots.
1

50000
40000
30000
20000
10000
0

http://bit.ly/knitr-graphics-src

(Rnw source)
2

http://bit.ly/knitr-main-pdf

Fair
Good
Very Good
Premium
Ideal

Graphical Devices
The knitr package comes with more than 20 built-in graphical devices,
and you can specify them through the dev option. This document uses
the global option dev=tikz, i.e., the plots are recorded by the tikz
device by default, but we can change the device locally. Since tikz will
be used extensively throughout this manual and you will see plenty of
tikz graphics later, now we first show a few other devices.

cut

80

inches = FALSE, bg = "deeppink", fg = "gray30"))

20

Figure 1 and 2 show two standard devices in the grDevices package.


We can also use devices in the Cairo or cairoDevice package, e.g., the
chunk below uses the Cairo_png() device in the cairoDevice package.

When multiple plots are produced by a code chunk, we may want


to show them as an animation with the option fig.show=animate.
Figure 7 shows a simple clock animation; you may compare the code
to Figure 5 to understand that high-level plots are always recorded,
regardless of where they appeared.

60
Volume

40

with(trees, symbols(Height, Volume, circles = Girth/16,

Figure 6: Two plots were produced in this chunk, but only


the last one is kept. This can
be useful when we experiment
with many plots, but only want
the last result. (Adapted from
the ggplot2 website)

cut

count

Before reading this specific manual1 , you must have finished the main
manual2 .

60

65

70

75

80

par(mar = rep(3, 4))

85

90

for (i in seq(pi/2, -4/3 * pi, length = 12)) {


plot(0, 0, pch = 20, ann = FALSE, axes = FALSE)
arrows(0, 0, cos(i), sin(i))

Height

axis(1, 0, "VI"); axis(2, 0, "IX")

Figure 1: The default PDF device.

axis(3, 0, "XII"); axis(4, 0, "III"); box()

Figure 7: A clock animation.


You have to view it in Adobe
Reader: click to play/pause;
there are also buttons to speed
up or slow down the animation.

IX

III

XII

Plot Recording
As mentioned in the main manual, knitr uses the evaluate package to
record plots. There are two sources of plots: first, whenever plot.new()
or grid.newpage() is called, evaluate will try to save a snapshot of the

VI

Figure 2: The PNG device.

We can also set the alignment of plots easily with the fig.align option; this document uses fig.align=center as a global option, and
we can also set plots to be left/right-aligned. Figure 8 is an example

http://yihui.name/knitr/demo/graphics/

R produces HTML web content

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

R runs on Windows, Macintosh & Linux

R runs on high-performance, grid, and cloud systems


Hardware:
I Netezza
I Amazon Web services
I Hadoop / MapReduce
I MPI
I Multicore
I Grid
I GPUs
Big data:
I RevoScaleR
I biglm
I bigmemory biganalytics, bigtabulate, bigalgebra

R interoperates with other technologies


I

Relational databases:
I

NoSQL Databases:
I

MongoDB, HBase, Cassandra, CouchDB

Web:
I
I

Netezza, Oracle, DB2, PostgreSQL,


Microsoft SQL Server, MySQL, SQLite

Web client via http, html, XML, json


Web server via RApache

Languages and statistical systems:


I
I
I

c++, Java, perl, and Python direct interoperability


Excel, read and write les, RExcel Excel add-in
SAS, SPSS, Stata, Systat les

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

R has a long history


1980s S language dened at Bell Labs
1993 R released informally
1995 R fully open sourced
1998 Association for Computing Machinery gave
Software System Award to John Chambers,
principal designer of S
2009 New York Times published a feature story on R
2012 Active open source project and community
Commercial products and support
http://cm.bell- labs.com/cm/ms/departments/sia/S/history.html
http://cran.r- project.org/doc/FAQ/R- FAQ.html
http://cran.r- project.org/doc/html/interface98- paper/paper_2.html

The New York Times recognized R


Business Computing

Data Analysts
Captivated by Rs Power

Stuart Isett for The New York Times


R first appeared in 1996, when the statistics professors Robert Gentleman, left, and Ross Ihaka
released the code as a free software package.
By ASHLEE VANCE
Published: January 6, 2009

The number of R packages is growing exponentially

http://r4stats.com/articles/popularity/

There are several major R products available


Open source R
Revolution Computing R Workstation
R Enterprise Server
Directly supported in Netezza
Tibco Spotre S-Plus

Large enterprises are using R

http://www.revolutionanalytics.com

Kaggle, KDNuggets, and Rexer quantied R usage

R Usage

http://blog.kaggle.com

http://www.kdnuggets.com

http://www.rexeranalytics.com

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

Boston has a highly active R community

R has an active online community

Software # Blogs
R
SAS
Stata
Others

365
40
8
0-3

Aggregated at http://r-bloggers.com/

For Q & A:
I

http://stackoverflow.com/questions/tagged/r

https://stat.ethz.ch/mailman/listinfo/r-help

http://r4stats.com/articles/popularity/

R has a peer-reviewed journal and conferences


The
Home
Current Issue
Accepted Articles
Archive
Submissions
Editorial Board

Journal

RSS Feed
ISSN: 2073-4859

Volume 4/1, June 2012


Download complete issue

May 11 & 12, Chicago, IL, USA

Refereed articles may be downloaded individually using the links below.


[Bibliography of refereed articles]

Table of Contents
Editorial

Contributed Research Articles


Analysing Seasonal Data
Adrian G Barnett, Peter Baker and Annette J Dobson

MARSS: Multivariate Autoregressive State-space Models for Analyzing Timeseries Data


Elizabeth E. Holmes, Eric J. Ward, Kellie Wills

11

openair - Data Analysis Tools for the Air Quality Community


Karl Ropkins and David C. Carslaw

20

Foreign Library Interface


Daniel Adler

30

Vdgraph: A Package for Creating Variance Dispersion Graphs


John Lawson

41

xgrid and R: Parallel Distributed Processing Using Heterogeneous Groups of


Apple Computers
Sarah C. Anoke, Yuting Zhao, Rafael Jaeger and Nicholas J. Horton

45

maxent: An R Package for Low-memory Multinomial Logistic Regression with


Support for Semi-automated Text Classification
Timothy P. Jurka

56

Sumo: An Authenticating Web Application with an Embedded R Session


Timothy T. Bergsma and Michael S. Smith

60

From the Core


Who Did What? The Roles of R Package Authors and How to Refer to Them
Kurt Hornik, Duncan Murdoch and Achim Zeileis

R/Finance 2012:
Applied Finance with R

64

News and Notes


Changes in R

70

Changes on CRAN

80

R Foundation News

96

Here are several more resources for learning R

R-Project: http://www.r-project.org/

The Comprehensive R Archive Network

local copy

http://cran.r-project.org/mirrors.html

local copy

Task Views: http://cran.stat.ucla.edu/

Quick-R: http://www.statmethods.net/

Stack Overow: http://stackoverflow.com/tags/r/info

local copy

local copy

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

R Books
Many, many R textbooks are available. Choice depends on:
I statistics emphasis
I data mining emphasis
I software emphasis
I intended application
Some good free R books to get started:
http://cran.r-project.org/doc/manuals/R-intro.pdf
http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
http://cran.r-project.org/doc/contrib/usingR.pdf
http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
http://cran.r-project.org/doc/contrib/Lam-IntroductionToR_LHL.pdf
http://ipsur.org/index.html
http://cran.r-project.org/other-docs.html
http://www.oup.com/uk/orc/bin/9780199299881/01student/companions/R_
Companion.pdf

Vignettes are handy tutorials

R package

Support Vector Machines

The Interface to libsvm in package e1071


by David Meyer
Technische Universitat Wien, Austria
David.Meyer@ci.tuwien.ac.at
September 12, 2011
Hype or Hallelujah? is the provocative title used by Bennett & Campbell
(2000) in an overview of Support Vector Machines (SVM). SVMs are currently
a hot topic in the machine learning community, creating a similar enthusiasm at
the moment as Artificial Neural Networks used to do before. Far from being a
panacea, SVMs yet represent a powerful technique for general (nonlinear) classification, regression and outlier detection with an intuitive model representation.
The package e1071 offers an interface to the award-winning1 C++implementation by Chih-Chung Chang and Chih-Jen Lin, libsvm (current version: 2.6), featuring:
C- and -classification
one-class-classification (novelty detection)

http://cran.stat.ucla.edu/web/packages/e1071/vignettes/svmdoc.pdf

local copy

Large libraries of sample graphics with R code

local copy

http://rgm2.lab.nig.ac.jp/RGM2/

More libraries of sample graphics with R code

local copy

http://gallery.r-enthusiasts.com/thumbs.php

Outline
R Is a Powerful System for Data Analysis
R Does Data Mining, Visualization, and Traditional Statistics
R Generates Reports and Web Content
R Interoperates Widely
R is Growing Rapidly
R has a strong support community
You can learn from the R community
You can learn R on your own

Next steps:
Download R
Try a data mining interface like Rattle
Try a statistics interface like Deducer or R Commander
Try a programmer interface like RStudio

You might also like