Professional Documents
Culture Documents
-
The dataset for this analysis is derived from a labelled -9
dataset for Facebook news post by multiple news 1
𝑝3 = 𝑤%7 / 𝑛%
agencies available on GitHub. Using the post ID and 𝑛
7:(
Facebook API (in R) we have collected the information
%:(
associated with each post.
Our goal is to classify the posts using the post
The first approach involved training the vectors on
message, user comments and reactions on it. the comments corpus and for second approach a pre-
trained vector on 3 billion words (on Mikolov’s
word2vec skip gram model) from google news was
II. METHOD chosen.
The wordVectors package by bmschimdt was used to
tokenize the post message and comments into a 300-
dimensional vector space using Mikolov’s word2vec
skip gram model. Then the post message and comments
vectors were averaged for each post and combined in a
proportion selected by cross validation.
A. Data Collection
C. Exploration
The original labelled dataset available on GitHub had
more than 2000 unique post with class labels. However, Due to high no of predictors (~304) we use various
the proportion of each class in the dataset was highly unsupervised clustering techniques like k-means to
detect clusters and Principal Components Analysis
biased. To overcome this limitation, we sample 288
(PCA) as shown in figure 2 and figure 3, to inspect the
unique posts containing 67 percent posts belonging to
predictors in a lower dimensional space.
class labelled as true and 33 percent to false. Then this
data set was randomly partitioned into 30 percent test
and 70 percent train dataset.
B. Vectorization
Per Mikolov et al., words and phrases can be
represented by the Skip-gram model. In this respect, it
becomes possible to apply analogical reasoning by
simple vector calculations. The linear form of the
Figure 2. Biplot of principal components of cosine D. Modelling Techniques
similarity vectors of "true" and "false" trained on
comments corpus
The Ratio of no of predictors to observations is
quite high ~ 1.5 and hence we decided to explore
However, the first few principal components failed to machine learning techniques like Nearest Shrunken
explain a large proportion of variance and the clustering Centroid and Random Forest Machine Learning
technique using k-means performed poorly in detecting Techniques. Since most of the 300 dimensional vectors
separable clusters. were uncorrelated, we also explored the performance
using a Naive Bayes Classifier.
Let
𝜇𝑖𝑘 − 𝜇𝑖
𝑑%P = . .
[1]
T (𝑠 T 𝑠0 )0
1R 1 𝑖
V. REFERENCES
[1] http://statweb.stanford.edu/~tibs/PAM/Rdist/howw
ork.html)
[2] Joachims, T. (2002). Learning to Classify Text
Using Support Vector Machines. doi:10.1007/978-
1-4615-0907-3
[3] Mikolov, Tomas, et al. "Distributed representations
of words and phrases and their
compositionality." Advances in neural information
processing systems. 2013H. Poor, An Introduction
to Signal Detection and Estimation. New York:
Springer-Verlag, 1985, ch. 4.
[4] Mikolov, Tomas, et al. "Efficient estimation of
word representations in vector space." arXiv
preprint arXiv:1301.3781 (2013)
[5] Sarkar, S. D., Goswami, S., Agarwal, A., & Aktar,
J. (2014). A Novel Feature Selection Technique
for Text Classification Using Naïve
Bayes. International Scholarly Research
Notices,2014, 1-10. doi:10.1155/2014/717092
[6] Tibshirani, R., Hastie, T., Narasimhan, B., & Chu,
G. (2002). Diagnosis of multiple cancer types by
shrunken centroids of gene
expression. Proceedings of the National Academy
of Sciences,99(10), 6567-6572.
doi:10.1073/pnas.082099299
Statement of Contributions
Shantam Gupta
1) Helped in developing the idea and exploration of
technique to use word2vec as a tokenizer for converting
text to a n dimensional vector space.
2) Research and selection of technique to be used for a
large no of predictors.
3) Wrote the template r code for Data preparation,
statistical exploration, word vector exploration, nearest
shrunken centroid, randomforest and naïve bayes.
4) Contributed in writing and formatting the paper.
5) Created the model diagram in Visio illustrating the
modeling and prediction process.
Satish Chirra
1) Wrote a code to extract data of all the post data in
our dataset using Facebook API.
2) Created a model with 300 dimensional vectors from
all the comments and post from extracted data.
3) Helped in idea exploration and templet creation for
selecting algorithms best suited for high dimensional
predictors.
4) Created different weightage for comment, post
vectors and used different methods to measure the
performance of the model.
5) Contributed in writing paper and created
visualization on the end results.