You are on page 1of 7

WORD CLOUD ANALYSIS

Assignment III

Ansuman Chattopadhyay - 14PGP003


Praveen Kumar J R - 14PGP057
Robin Singh- 14PGP059
Vaibhav Bhatia- 14PGP100
S. Sreenvas - 14PGP119

Text analysis of Tweets

Tool - R
Rationale - it is scripting language which can be used quite effectively in Word Cloud
formation and analysis. Our version is 3.2.1
Procedure
We first installed the packages required to run a word cloud-
The following script was run to achieve the same:
install.packages("ROAuth")
install.packages("bitops")
install.packages("digest")
install.packages("rjson")
install.packages("NLP")
install.packages("twitteR")
install.packages("stringr")
install.packages("ggplot2")
install.packages("tm")
install.packages("RColorBrewer")
install.packages("wordcloud")
install.packages("RCurl")
install.packages("httpuv")
install.packages("plyr")
install.packages("RJSONIO")
install.packages("httr")

The libraries were then called for referencing.


To find out about the predominant terms used in my twitter account we have to first
connect to the account through our R script.
The following codec does that job to perfection:
# Download Certificate File
download.file(url = "http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")

# Set SSL certs globally


options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package =
"RCurl"))).

The next very important step is to codify the API keys from my account.The following
codec does just that:

# set API key and API secret from Twitter developer site
reqURL <- https://api.twitter.com/oauth/request_token
accessURL <- https://api.twitter.com/oauth/access_token
authURL <- "https://api.twitter.com/oauth/authorize"

#Generate the accessToken after creating the app in twitter, replace with your
values
#the below values dont work
apiKey <- "sb0mWFVbEFNtJnBQO0fWRUcV"
apiSecret <- "7XRvv9FrrL77Z2mHcecF9pygon4GjHtRw49J5RQA3jHWBVpY7"

oauthKey <- "2853123974-OUVIt05vqZRQXYjalZE0kWdoy6ubJyFFWvEzmU"


oauthSecret <- "mJGmEk45558v3xOTWacX28179fzqnBQgwf1jAJhexdqm"

We are essentially searching for the string Android so


# search tweets for Twitter Trends
tweets = searchTwitter("#Android", n=100)
The above codec creates a handle called as tweets which is used to search the Twitter
world for Android
We then create an array also called as a data frame.From that array we create vector
called as a Corpus
Significance of a corpus
A corpus is significant in the sense that this vector can be used to perform semantic
analysis on a data set.
The following codec illustrates this:
# Converting Tweets to Data Frame
tweets = do.call("rbind", lapply(tweets, as.data.frame))

dim(tweets)
#Building the corpus
corpus = Corpus(VectorSource(tweets$text))
corpus[[3]]

Now if we need to analyze word clouds using a machine interface like R we need
to first prep the source. The prepping was done by converting to lower case then
removing punctuation and finally forming a stemmed corpus

# Lower Case
corpus = tm_map(corpus, content_transformer(tolower))
corpus[[1]]
#Remove punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[2]]
#Remove Stop Words
stopwords("english") [1:1000]
corpus = tm_map(corpus, removeWords, c("Android ", stopwords("english")))
corpus[[1]]
#Stemming
corpus = tm_map(corpus, stemDocument)
corpus[[1]].
The last and the most important step is the Word Cloud Formation-

myDTM = DocumentTermMatrix(corpus, control = list(minWordLength=1))


m = as.matrix(sparse)
v = sort(colSums(m), decreasing=TRUE)
myNames = names(v)
myNames
d = data.frame(word=myNames, freq=v)
wordcloud(d$word, d$freq, min.freq=4)

Output

CONCLUSION
we can conclude from the word cloud that the most happening things concerning
Android on Twitter are androidgam which is probably stemmed form of Android
gaming and something called as Gameinsight

You might also like