You are on page 1of 29

Undergraduate Research Opportunity Program (UROP) Project Report

Sentiment Analysis of the Twitter Corpus

By Samantha Wong Shin Nee

Department of Computer Science School of Computing National University of Singapore

2013

Undergraduate Research Opportunity Program (UROP) Project Report

Sentiment Analysis of the Twitter Corpus

By Samantha Wong Shin Nee

Department of Computer Science School of Computing National University of Singapore

2013

Project No: U063010 Advisor: Professor Tan Chew Lim Deliverables: Report: 1 Volume Source Code and Output: Uploaded to IVLE

Abstract Today the reputation of a brand, product or service can change much more rapidly than before with social media, and the range of consumer sentiment and attitude through social media such as Twitter, Facebook, etc. has also grown exponentially. Brand reputation mining by monitoring social media sources for language that may aect reputation in a positive or negative way can be a powerful tool for a companys public relations and marketing departments. We frame our problem as a text classication problem. We work with Twitter data (tweets containing a company name, or several company names), and survey two existing methods in sentiment analysis and text classication. Our goal is to propose a novel approach to detect the polarity of a tweet with respect to a particular company: whether it is positive, negative, or neutral.

Subject Descriptors: I.2.7 Natural Language Processing

Keywords: sentiment analysis, opinion mining, similarity measures, text classication

Implementation Software and Hardware: MACOS X 10.8, Python

Acknowledgement I would like to thank my family, friends and advisor. Without them, I would not have be able to complete this project.

List of Tables
2.1 3.1 3.2 3.3 3.4 3.5 RepLab2013 Corpora Entity-Domain Classication . . . . . . . . . . . . . . . . . Using the Most Common Tweet scheme, with no stemming. . . . . . . . . . . . . Using the Top Tweet scheme, with no stemming. . . . . . . . . . . . . . . . . . . Using the Top Tweet scheme, on the EO dataset, a comparison amongst methods. A Comparison between Our CP Approach and the Best-Performer in RepLab2013. A Comparison of Time Complexity on the NP dataset. . . . . . . . . . . . . . . . 4 12 13 14 15 16

iv

Table of Contents
Title Abstract Acknowledgement List of Tables 1 Introduction 1.1 Background . . . . . 1.2 The Problem . . . . 1.3 Our Solution . . . . 1.4 Report Organization i ii iii iv 1 1 2 2 2 3 3 4 4 5 6 6 6 7 7 8 10 10 11 15 15 15 16 17 17 18 18 18 18

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2 Methodology 2.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Dealing with Twitter-specic Artifacts . . . . . . . . . . . . . . . 2.2.2 Manipulating Punctuations and Emoticons . . . . . . . . . . . . 2.2.3 Traditional Pre-processing Techniques . . . . . . . . . . . . . . . 2.3 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Jaccard Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 A Novel Approach Based on Conditional Probabilities . . . . . . . . . . 2.4.1 Variants on the Normalized Conditional Probability-Based Score 3 Results and Discussion 3.1 Finding the Best Pre-processing Sequence . . . . . . . . 3.2 Comparing Old and New Approaches and their Variants 3.3 Discussion on Best Performing Methods . . . . . . . . . 3.4 The Eect of Stemming . . . . . . . . . . . . . . . . . . 3.5 A Benecial Complexity Analysis . . . . . . . . . . . . . 3.6 A Note on the Performance of the Naive Bayes Classier 4 Conclusion 4.1 Sum of Conditional Probabilities: A Novel 4.2 Other Contributions . . . . . . . . . . . . 4.2.1 Use of Tokens . . . . . . . . . . . . 4.2.2 A Lexicon from Training Data . . 4.3 Future Work . . . . . . . . . . . . . . . . v

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Approach . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

References

19

A Algorithms A-1 A.1 Tokenizing non-space-splitter text . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 B Proofs B-1 B.1 Proof of Relation with Naive Bayes Classier . . . . . . . . . . . . . . . . . . . . B-1

vi

Chapter 1

Introduction
Today the reputation of a brand, product or service can change much more rapidly than before with social media, and the range of consumer sentiment and attitude through social media such as Twitter, Facebook, etc. has also grown exponentially. Brand reputation mining by monitoring social media sources for language that may aect reputation in a positive or negative way can be a powerful tool for a companys public relations and marketing departments. We frame our problem as a text classication problem. We work with Twitter data (tweets containing a company name, or several company names), and survey two existing methods in sentiment analysis and text classication. Our goal is to propose a novel approach to detect the polarity of a tweet with respect to a particular company: whether it is positive, negative, or neutral.

1.1

Background

In this section, we briey discuss the history and background of the problem. None of this previous work uses the technique that we propose in this project. Thus, we believe that our algorithm is novel.

1.2

The Problem

In this section, we formally dened the problem. We wish to classify a tweet as Positive, Neutral or Negative with respect to a particular entity (brand, product). We rate our classiers in terms of overall accuracy, and then within each class: precision (sensitivity), recall (reliability) and f-measure.

1.3

Our Solution

We use a method that involves the Sum of Conditional Probabilities in a test tweet to calculate its score with respect to a particular class.

1.4

Report Organization

This report rst talks about our Methodology, goes on to Results and Discussion and nally the Conclusion.

Chapter 2

Methodology
Our approach was to consider the opinion mining task as a text classication problem in three sentiment classes: positive, negative and neutral. We surveyed two existing similarity measures as baselines, Jaccard Similarity and Cosine Similarity, and propose a new method based on conditional probabilities.

2.1

Data Set

We used the RepLab 2013 Corpora, which provided a corpus of 45,679 labeled training tweets, with respect to 61 dierent entities in four domains. They also had 96,848 test tweets for their competition framework. Tweets were in both Spanish and English. We compare our results with the best-performing groups in the competition. The domains were: automotive, banking, music and universities. Some example entities in each respective domain are: Audi, Capital One, Britney Spears and MIT.
1

Each tweet was related to a particular entity by a single canonical search term. For example, the company

Bayerische Motoren Werke AG was associated by the canonical search term BMW. It was possible to have some tweets be related to more than two entities in the same domain, for example, if the tweet mentions both Audi and BMW, but this was rare.

Domain automotive banking music universities

No. of Entities in Domain 20 11 20 10

Example Entities Audi, BMW, Fiat, Honda, Suzuki, Volvo Barclays, Goldman Sachs, HSBC, RBS Bank The Beatles, Bon Jovi, PSY, Shakira University of Berkeley, Harvard University, MIT

Table 2.1: RepLab2013 Corpora Entity-Domain Classication

2.2

Preprocessing

We survey a variety of pre-processing options. It was important to eliminate elements that did not contribute to the polarity of the tweet so as to reduce noise and increase overall precision.

2.2.1

Dealing with Twitter-specic Artifacts

Tweets may or may not represent a dierent lexicon from the commonly used lexicographical standard. Twitter users often use non-standard lexicon such as abbreviations and emoticons, as is common in casual communication on social media. We rst dealt with Twitter-specic articles that included hashtags, reply tokens and links.

Hashtags: # Hashtags were either removed so that the hashtag word-group (e.g. #thisisawesome) was treated as a regular word (thisisawesome is kept), or the entire hashtag word-group was removed (#thisisawesome is removed). We compare the eectiveness of retaining hashtags using the similarity measures.

Reply Tokens: @ Reply tokens (eg. @123tomboy) were removed. The author identiers (ie. their usernames) were also removed. One possible extension is to keep the usernames and match the presence of a particular entity within a username, retaining them to match for a particular polarity. For example, the username iloveBMW may consistently put out positively rated tweets concerning

BMW. Appendix shows our algorithm for tokenizing a non-space-separated sequence of words.

Links: http Links were separated by punctuation delimiters. The link http://instagr.am/p/THcLAvu-pT/ would have been pre-processed into the string http instagr am p THcLAvu-pT.

2.2.2

Manipulating Punctuations and Emoticons

Another issue we had to contend with was the presence of non-standard punctuation. We rst matched all string tokens that have been split by space to a given set of emoticons (the full list of emoticons is provided in the Appendix). Once a segment of the sequence of punctuations have been matched to an emoticon from the emoticon list, we replace that token with the emoticon. For example, the punctuation sequence ...o.o would be a match for the emoticon o.o. We then replace that sequence with o.o in the text. We store the list of match emoticons for each tweet separately from the string that would undergo a second pre-processing step. In a second punctuation-related pre-processing step, we handle all other non-matched punctuation sequences. Non-standard punctuation would include !!, !?, ?!, !!!, ?!!, !!?!. We treat all punctuation symbols that are not separated by digits or alphabets as one punctuation token. So punctuation sequences such as !!1! would be tokenized as !! and !, whilst punctuation sequences such as !!! remain as !!!. Emoticons such as :), and even space-splitted unusual unicode sequences like ? would as a side-eect, be kept intact. As this step would split up emoticons such as o.o, which comprise of both alphabets and punctuation, we re-add our previous list of matched emoticons to the text after this step. We also had one data set that removed all punctuation and digits that had not been matched to the list of emoticons. We also had another two separate pre-processed data sets, one that removed all single length punctuation such as !, and one that did not. We compare the usability of these dierently processed data sets using the similarity measures.

2.2.3

Traditional Pre-processing Techniques

Traditional pre-processing techniques such as stemming was also used. We used Pythons Lancaster Stemmer, a stricter form of the Porter Stemmer algorithm.

2.3

Similarity Measures

Similarity measures are a classical means of text classication. We calculate the similarity between a test tweet and every training tweet in the training corpus. We had two methods of assigning a polarity to a given test tweet. The Most Common Tweet (MCT) method assigned the test tweet the most common polarity of the top 50 most similar training tweets. The Top Tweet (TT) method assigned the test tweet the polarity of the training tweet that was calculated as most similar to it. We will compare the performance of these two methods in the Results section.

2.3.1

Jaccard Similarity

Jaccard Similarity follows from the bag-of-words model of text, which disregards word order, and word frequency in the tweet. It is simply the ratio of the number of tokens 2 in the test tweet that also appear in the training tweet, to the total number of tokens in both training and test tweets. The formula may be thus represented: Jaccard Similarity = AB AB (2.1)

where A is the set of tokens in the test tweet, and B is the set of tokens in the training tweet. An example: A = {warm, summer, night} B = {warm, coee}. Then the Jaccard Similarity between A and B is
2

Size of {warm} Size of {warm,summer,night,coee}

1 4

= 0.25.

Henceforth, tokens refer to any sequence of alphabets, digits or punctuation that are split by space after the

pre-processing procedures.

2.3.2

Cosine Similarity

Cosine Similarity follows from the vector space model of text, which disregards word order, but takes into account word frequency in the text. It can be represented in such a way: Cosine Similarity = AB |A||B| (2.2)

where A is the vector set of tokens in the test tweet, and B is the vector set of tokens in the training tweet. Using the same example as before: A = [1, 1, 1, 0] B = [1, 0, 0, 1], where the values correspond to the frequency of a certain word in the respective tweet (in this case the frequencies of [warm, summer, night, coee]). Then the Cosine Similarity between A and B is
11+10+1 0+01 12 +12 +12 +02 12 +02 +02 +12

1 6

0.408.

2.4

A Novel Approach Based on Conditional Probabilities

We propose and experiment with a new model based on conditional probabilities. Like the vector space model, the proposed model disregards token order, but takes into account token frequency in the text. We calculate the conditional probability measure P (classi |tokenj ) for each token in a test tweet. For each of the three classes, P ositive, N eutral and N egative, we sum up the measure amongst tokens, i.e. Score(classi ) =
j Wtweet

P (classi |tokenj )

(2.3)

where Wtest is the list of tokens in the tweet and i {P ositive, N eutral, N egative}. To preserve the frequency value of tokens, duplicate tokens were not eliminated. We found however, that it was necessary to normalize the score given the uneven distribution of the training tweets in each class. On average, 70% of the training tweets are classied as P ositive, 20% as N eutral, and 10% as N egative. So it was more likely for a word to be in the P ositive class. This gave an unfair advantage to the P ositive class score, that was rectied with the normalization. 7

Normalized Score(classi ) =

j Wtweet

P (classi |tokenj )

P (classi )

(2.4)

The conditional probabilities P (classi |tokenj ) are calculated using the text in the training set, i.e. the training tweets, where P (classi |tokenj ) =
Count of tokenj in training tweets classied in classi . Count of tokenj in entire training corpus

Tokens that are not in the training corpus but occur in the test tweet do not contribute to the score of the test tweet. This scoring system quantitatively takes into account every token that has appeared in training that appears in the test text. The class that has the highest score is the assigned polarity of the test tweet.

2.4.1

Variants on the Normalized Conditional Probability-Based Score

Instead of choosing a score that is the sum of conditional probabilities, we selected schemes that summed a function of the conditional probabilities.

Normalization by Probability of a Token Score(classi ) =


j Wtweet

P (classi |tokenj ) P (tokenj )


P (classi |tokenj ) P (tokenj )

(2.5)

Normalized Score(classi ) =

j Wtweet

P (classi )

(2.6)

This was an attempt to emulate the tf-idf concept in information retrieval by inversely relating the value of the conditional probability of a token given a class, by the probability of a token in general.

Normalization by Logarithm Score(classi ) =


j Wtweet

P (classi |tokenj ) log2 P (tokenj )


P (classi |tokenj ) j Wtweet log2 P (tokenj )

(2.7)

Normalized Score(classi ) =

log2 P (classi )

(2.8)

Though the equation stated uses log2 , it would be equivalent to use logarithms of other bases since all logarithms are monotonically increasing functions, and the conversion of the value of one logarithm in base a to base b is a constant factor of loga (b), which would ultimately have no eect on the relative sizes of the scores.

Normalization by Exponentiation Score(classi ) =


j Wtweet

exp P (classi |tokenj )

(2.9)

Normalized Score(classi ) =

j Wtweet

exp P (classi |tokenj )

exp P (classi )

(2.10)

The rationale behind these variations was to determine the most appropriate normalization scheme, based on a standard list of functional patterns.

Chapter 3

Results and Discussion


In this section, we present and discuss our results. For our evaluation measures, we used Precision, Recall and a standard F-Measure where = 1 for the P - Positive, E - Neutral, N Negative classes, as well as O - Overall.

3.1

Finding the Best Pre-processing Sequence

We used the Jaccard and Cosine Similarity Measures to select from a large range of preprocessing options. Tables 3.1 and 3.2 show a comparison amongst the various pre-processing options, with Table 3.1 was derived using the Most Common Tweet (MCT) Scheme, and Table 3.2 was derived using the Top Tweet (TT) Scheme. It is clear that the TT Scheme outperforms the MCT Scheme on every data set. However, it is not clear which data set is best suited for processing as the data sets that perform well are not consistent between the two similarity measures. NP is No Punctuation - all punctuations and emoticons have been removed from the texts; SP is Split Punctuation - all punctuations and emoticons have been split into characters of length one; APSE is All Punctuation Split Emoticon - punctuations have been retained in their original state; LPSE is Less Punctuation Split Emoticon - punctuations have been retained in their original state except for those of length one; EO is Emoticons Only - only matched emoticons have been retained except for those starting with special Twitter symbols; EOS is Emoticons Only

10

Swapped - only matched emoticons have been retained. We look at the values from the Top Tweet Table 3.2 for the following comparisons. For the Jaccard Similarity Measure, the SP data set and the LPSE data set fared the best with a 0.004 advantage in overall F-Measure over the worst performing data set EO, which had an F-Measure of 0.535. This was a dierence of 0.748%. For the Cosine Similarity Measure, the EO data set had the best overall F-Measure of 0.532, which was a 2.31% dierence over its worst performing data set SP. Since our Sum of Conditional Probabilities model is more similar to the Cosine Similarity Measure, we present the results of our approach, and its variants in Table 3.3 on the EO data set.

3.2

Comparing Old and New Approaches and their Variants

CP stands for the conditional probabilities approach represented in Equation 2.4. CPN stands for the inverse-probability conditional approach represented in Equation 2.6. CPL stands for the inverse-log-probability conditional approach represented in Equation 2.8. CPE stands for the inverse-exponential-probability approach represented in Equation 2.10. NB represents the Naive Bayes Classier. COS is the vanilla Cosine Similarity Measure. COSN is the Cosine Similarity Measure where each vector entry has been normalized by the frequency of the word in a class. COSL is the Cosine Similarity Measure where each vector entry has been normalized by the log-frequency of the word in a class (essentially using the tf-idf concept). JAC is the Jaccard Similarity Measure.

11

Table 3.1: Using the Most Common Tweet scheme, with no stemming. Similarity Pre-processing Evaluation Measures Accuracy P NP SP Jaccard 12 EO EOS NP SP Cosine APSE LPSE EO EOS 0.487 0.475 0.470 0.433 0.454 0.470 0.477 0.466 0.595 0.564 0.557 0.475 0.514 0.556 0.574 0.546 0.337 0.351 0.342 0.375 0.365 0.344 0.338 0.348 0.348 0.364 0.370 0.381 0.389 0.372 0.363 0.376 0.427 0.426 0.423 0.410 0.423 0.424 0.425 0.423 0.658 0.667 0.678 0.686 0.687 0.680 0.670 0.675 0.389 0.378 0.360 0.343 0.357 0.362 0.375 0.369 0.214 0.210 0.208 0.189 0.201 0.207 0.209 0.204 0.421 0.418 0.415 0.406 0.415 0.416 0.418 0.416 0.625 0.611 0.612 0.561 0.588 0.612 0.618 0.604 0.361 0.364 0.351 0.358 0.361 0.353 0.356 0.358 0.265 0.266 0.266 0.252 0.265 0.266 0.265 0.264 0.417 0.414 0.410 0.391 0.405 0.410 0.413 0.409 APSE LPSE 0.485 0.470 0.473 0.482 0.577 0.534 0.540 0.572 E 0.353 0.378 0.374 0.357 Recall N 0.372 0.392 0.394 0.362 O 0.434 0.434 0.436 0.430 P 0.676 0.690 0.687 0.675 Precision E 0.383 0.377 0.379 0.378 N 0.218 0.209 0.212 0.214 O 0.4254 0.425 0.426 0.422 P 0.623 0.558 0.605 0.620 F-Measure E 0.368 0.325 0.377 0.367 N 0.275 0.223 0.276 0.269 O 0.422 0.417 0.419 0.419

Table 3.2: Using the Top Tweet scheme, with no stemming. Similarity Pre-processing Evaluation Measures Accuracy P NP SP Jaccard 13 EO EOS NP SP Cosine APSE LPSE EO EOS 0.606 0.604 0.597 0.588 0.594 0.597 0.600 0.595 0.733 0.721 0.714 0.699 0.703 0.710 0.715 0.708 0.457 0.465 0.460 0.459 0.471 0.469 0.468 0.463 0.395 0.409 0.400 0.395 0.401 0.400 0.399 0.402 0.528 0.531 0.525 0.518 0.525 0.526 0.527 0.524 0.690 0.694 0.691 0.687 0.692 0.692 0.692 0.691 0.479 0.481 0.470 0.457 0.464 0.470 0.473 0.471 0.466 0.449 0.440 0.427 0.441 0.444 0.448 0.430 0.545 0.541 0.534 0.524 0.532 0.535 0.538 0.531 0.711 0.707 0.702 0.693 0.697 0.700 0.703 0.699 0.468 0.473 0.465 0.458 0.468 0.470 0.471 0.467 0.428 0.428 0.419 0.410 0.420 0.421 0.422 0.415 0.535 0.536 0.529 0.520 0.528 0.530 0.532 0.527 APSE LPSE 0.605 0.606 0.604 0.606 0.725 0.721 0.717 0.723 E 0.461 0.474 0.475 0.469 Recall N 0.410 0.407 0.410 0.409 O 0.532 0.534 0.534 0.534 P 0.693 0.698 0.697 0.695 Precision E 0.481 0.480 0.478 0.481 N 0.456 0.454 0.454 0.458 O 0.544 0.544 0.543 0.545 P 0.709 0.709 0.707 0.709 F-Measure E 0.471 0.477 0.476 0.475 N 0.432 0.429 0.431 0.432 O 0.537 0.539 0.538 0.539

Table 3.3: Using the Top Tweet scheme, on the EO dataset, a comparison amongst methods. Stemming Method Evaluation Measures Accuracy P CP CPN CPL CPE Not Stemmed 14 Stemmed NB COS COSN COSL JAC CP CPN CPL CPE NB COS COSN COSL JAC 0.662 0.515 0.659 0.551 0.636 0.600 0.589 0.611 0.606 0.661 0.506 0.656 0.551 0.637 0.608 0.589 0.611 0.611 0.815 0.558 0.819 0.604 0.864 0.715 0.685 0.719 0.733 0.814 0.544 0.816 0.604 0.864 0.723 0.685 0.719 0.735 E 0.488 0.458 0.475 0.494 0.339 0.468 0.481 0.489 0.457 0.484 0.454 0.469 0.492 0.341 0.476 0.481 0.489 0.464 Recall N 0.396 0.459 0.379 0.448 0.306 0.399 0.420 0.420 0.395 0.398 0.455 0.382 0.452 0.307 0.409 0.420 0.420 0.411 O 0.566 0.491 0.558 0.515 0.503 0.527 0.529 0.543 0.528 0.565 0.484 0.555 0.516 0.504 0.536 0.529 0.543 0.536 P 0.702 0.705 0.695 0.714 0.655 0.692 0.697 0.706 0.690 0.701 0.700 0.693 0.714 0.655 0.696 0.697 0.706 0.695 Precision E 0.573 0.409 0.576 0.435 0.558 0.473 0.460 0.487 0.479 0.573 0.397 0.569 0.429 0.558 0.480 0.460 0.487 0.483 N 0.603 0.289 0.595 0.333 0.620 0.448 0.425 0.455 0.466 0.601 0.285 0.594 0.341 0.633 0.470 0.425 0.455 0.481 O 0.626 0.468 0.622 0.494 0.611 0.538 0.528 0.549 0.545 0.625 0.461 0.619 0.495 0.615 0.549 0.528 0.549 0.553 P 0.754 0.623 0.752 0.655 0.745 0.703 0.691 0.712 0.711 0.753 0.612 0.750 0.655 0.745 0.709 0.691 0.712 0.714 F-Measure E 0.527 0.432 0.520 0.463 0.422 0.471 0.470 0.488 0.468 0.525 0.424 0.514 0.459 0.423 0.478 0.470 0.488 0.473 N 0.478 0.355 0.463 0.382 0.409 0.422 0.423 0.437 0.428 0.479 0.350 0.465 0.389 0.413 0.437 0.423 0.437 0.443 O 0.586 0.470 0.579 0.500 0.526 0.532 0.528 0.546 0.535 0.586 0.462 0.576 0.501 0.527 0.541 0.528 0.546 0.544

Table 3.4: A Comparison between Our CP Approach and the Best-Performer in RepLab2013. Approach CP RepLab2013 Accuracy 0.662 0.683 Recall 0.566 0.466 Precision 0.626 0.345 F-Measure 0.586 0.382

3.3

Discussion on Best Performing Methods

Our proposed Sum of Conditional Probabilities Method outperformed all the methods and their variants. It had an overall F-Measure of 0.586 for both stemmed and unstemmed versions of the EO data set. This was a 1.21% increase over the second-best performing method, which is the inverse-log variant of the Sum of Conditional Probabilities method, in the unstemmed section, and a 1.74% increase over the same second-best method in the stemmed section. Finally, we would like to compare our best-performing approach with the best-performing approach in the RepLab 2013 competition. Though our accuracy falls below theirs by 3.17%, our recall and precision measures are signicantly higher by 21.5% and 81.4% respectively. As a result, our F-Measure also experienced a signicant improvement of 53.4%.

3.4

The Eect of Stemming

Based on the results in Table 3.3, Lancaster Stemming did not have a signicant, nor consistent, impact on the performance of the classiers. Whilst it positively inuenced some classiers, namely CPE, COS and JAC, it had no impact on CP, NB, COSN, and COSL, and a negative impact on CPN and CPL.

3.5

A Benecial Complexity Analysis

The Sum of Conditional Probabilities Method also provides a signicant reduction in time complexity as compared to the Similarity Measures. The Similarity Measures incur an O(CT ) time complexity during processing, where C is

15

Table 3.5: A Comparison of Time Complexity on the NP dataset. Approach Jaccard Cosine CP Time 11 minutes 1 hour 4 minutes

the size of the training corpus and T is the size of the test corpus, since each test tweet has to be compared with every training tweet individually. The Sum of Conditional Probabilities Method only incurs a O(T ) time complexity, as the conditional probabilities of every token in the training class have already been written out to le in a separate pre-processing step. We experienced such a signicant time dierence during experimentation as demonstrated in Table 3.5.

3.6

A Note on the Performance of the Naive Bayes Classier

The Naive Bayes Classier performed relatively poorly, especially in the evaluation measures of the Neutral and Negative classes. This could be due to the highly skewed nature of the training set, which provided many more Positive examples as compared to Neutral and Negative examples. There would be a tendency for words to appear in the Positive set, simply by virtue of its frequency. This however, is an unexpected anomaly and worth more investigation. A normalizing factor on the nal score for example, might increase performance.

16

Chapter 4

Conclusion
We set out to survey some existing methods of text classication on a relatively new corpus, and came up with a simple but eective, non-platform dependent method for improving sentiment analysis.

4.1

Sum of Conditional Probabilities: A Novel Approach

Our Sum of Conditional Probabilities Approach is simple, straight-forward and eective. It also has a superior time complexity compared to existing similarity measures. It quantitatively takes into account all the tokens that had appeared in the training corpus, and does not penalize tweets for being too long, as the Naive Bayes (NB) Classier does. It also relies only on a training corpus, and is non-platform dependent. While it bears many resemblances to the NB Classier (we provide a complete analysis in the Appendix) because of its use of conditional probabilities, it is not restricted by strict conditional independence assumptions of the tokens, which appear to be completely untrue given our NB results. It is an approach that may be adopted for future work in sentiment analysis, unrestricted to the Twitter corpus.

17

4.2
4.2.1

Other Contributions
Use of Tokens

We dened and used tokens instead of words, allowing for the fact that some punctuations may contain polarity information.

4.2.2

A Lexicon from Training Data

We have a le of tokens and the probability of a class given their occurrence in a le, which may be translated into a lexicon of positive, negative and neutral ratings fro each token.

4.3

Future Work

One idea is to use Parts-Of-Speech tagging to rene this method. Other similarity measures may also be surveyed, such as the Kullback-Leibler Divergence and Minimum Entropy measures. It would be nice to have some conclusive information on which pre-processing technique can be expected to work best for which subsequent processing method.

18

References
B. Pang, L. Lee Opinion Mining and Sentiment Analysis. 2008. B. Liu, L. Zhang A Survey of Opinion Mining and Sentiment Analysis. 2013. Y. Yang, X. Liu A Re-Examination of Text Categorization Methods. 1999.

19

Appendix A

Algorithms
A.1 Tokenizing non-space-splitter text

You have a dictionary of all valid words in your corpus. You are given a line of text with no space, and wish to separate it into meaningful words. Attempt to match any of the words in this dictionary (in alphabetical order) to the line of text starting from the rst character. Once a match is found, repeat the procedure for the remaining part of the string until the next matched word (again starting from the beginning of the dictionary) is found. Keep doing this until you reach the end of the input line. If no match can be found given the remaining characters, backtrack to the last matched character, and continuing from that position in the dictionary, try to nd another word that matches some part of the beginning of the remaining string. The algorithm runs in O(DL) time where D is the size of the dictionary and L is the length of the input line.

A-1

Appendix B

Proofs
B.1 Proof of Relation with Naive Bayes Classier

We begin with substituting Bayes Theorem into the conditional probability with class as a posterior.

P (token1 token2 ...tokenn |classi ) = P (token1 |classi )P (token2 |classi )...P (tokenn |classi ) = P (classi |tokenn )P (tokenn ) P (classi ) =
j Wtweet

P (classi |token1 )P (token1 ) P (classi |token2 )P (token2 ) ... P (classi ) P (classi )

P (classi |tokenj )P (tokenj ) P (classi ) (B.1)

The Naive Bayes Classier assigns the class with the maximum value in the equation above.

B-1

argmax P (token1 token2 ...tokenn |classi ) = argmax


j Wtweet

P (classi |tokenj )P (tokenj ) P (classi ) log P (classi |tokenj ) |Wtweet | log P (classi )

= argmax
j Wtweet

+ |Wtweet | log P (tokenj ) = argmax


j Wtweet

log P (classi |tokenj ) |Wtweet | log P (classi ) (B.2)

since the value of |Wtweet | log P (tokenj ) is the same for all classes. This second equation looks very similar to our Normalized Log for Sum of Conditional Probabilities procedure, but rather than taking away log P (classi ) for each j , we divide the quantity by log P (classi ) instead. On a feasibility note, we would not be able to get log P (classi |tokenj ) for many tokens as the P (classi |tokenj ) value is often times 0.0. So to calculate the NB Classier, we have to manipulate its original values.

B-2

You might also like