You are on page 1of 17

INFORMS 2012

c 2012 INFORMS | isbn 978-0-9843378-3-5 http://dx.doi.org/10.1287/educ.1120.0105

Mining Social Media: A Brief Introduction


Pritam Gundecha, Huan Liu
Arizona State University, Tempe, Arizona 85287 {pritam@asu.edu, huan.liu@asu.edu} Abstract The pervasive use of social media has generated unprecedented amounts of social data. Social media provides easily an accessible platform for users to share information. Mining social media has its potential to extract actionable patterns that can be benecial for business, users, and consumers. Social media data are vast, noisy, unstructured, and dynamic in nature, and thus novel challenges arise. This tutorial reviews the basics of data mining and social media, introduces representative research problems of mining social media, illustrates the application of data mining to social media using examples, and describes some projects of mining social media for humanitarian assistance and disaster relief for real-world applications.

Keywords social media; data mining; social data; social media mining; social networking sites; blogging; microblogging; crowdsourcing; HADR; privacy; trust

1. Introduction
Data mining research has successfully produced numerous methods, tools, and algorithms for handling large amounts of data to solve real-world problems. Traditional data mining has become an integral part of many application domains including bioinformatics, data warehousing, business intelligence, predictive analytics, and decision support systems. Primary objectives of the data mining process are to eectively handle large-scale data, extract actionable patterns, and gain insightful knowledge. Because social media is widely used for various purposes, vast amounts of user-generated data exist and can be made available for data mining. Data mining of social media can expand researchers capability of understanding new phenomena due to the use of social media and improve business intelligence to provide better services and develop innovative opportunities. For example, data mining techniques can help identify the inuential people in the vast blogosphere, detect implicit or hidden groups in a social networking site, sense user sentiments for proactive planning, develop recommendation systems for tasks ranging from buying specic products to making new friends, understand network evolution and changing entity relationships, protect user privacy and security, or build and strengthen trust among users or between users and entities. Mining social media is a burgeoning multidisciplinary area where researchers of different backgrounds can make important contributions that matter for social media research and development. The objective of this tutorial is to introduce social media, data mining, and their conuencemining social media. We attempt to achieve the goal by presenting representative and interesting research issues and important social media tasks based on our experience and research. This tutorial rst reviews data mining, social media and its types, and the importance of social media mining. In 2, we briey introduce representative issues in social media mining. In 3, we highlight the impact of social media mining using three examples based on our current research. Section 4 illustrates how social media mining is applied in some real-world applicationstwo projects on humanitarian assistance and disaster relief (HADR) carried out in the Data Mining and Machine Learning Laboratory (DMML) at Arizona State University (ASU). We conclude this tutorial in 5.
1

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

1.1. Data Mining


Data mining (Tan et al. [55]) is a process of discovering useful or actionable knowledge in large-scale data. Data mining also means knowledge discovery from data (KDD) (Han et al. [24]), which describes the typical process of extracting useful information from raw data. The KDD process broadly consists of the following tasks: data preprocessing, data mining, and postprocessing. These steps need not be separate tasks and can be combined together. Data mining is an integral part of many related elds including statistics, machine learning, pattern recognition, database systems, visualization, data warehouse, and information retrieval (Han et al. [24]). Data mining algorithms are broadly classied into supervised, unsupervised, and semisupervised learning algorithms. Classication is a common example of supervised learning approach. For supervised learning algorithms, a given data set is typically divided into two parts: training and testing data sets with known class labels. Supervised algorithms build classication models from the training data and use the learned models for prediction. To evaluate a classication models performance, the model is applied to the test data to obtain classication accuracy. Typical supervised learning methods include decision tree induction, k -nearest neighbors, naive Bayes classication, and support vector machines. Unsupervised learning algorithms are designed for data without class labels. Clustering is a common example of unsupervised learning. For a given task, unsupervised learning algorithms build the model based on the similarity or dissimilarity between data objects. Similarity or dissimilarity between the data objects can be measured using proximity measures including Euclidean distance, Minkowski distance, and Mahalanobis distance. Other proximity measures such as simple matching coecient, Jaccard coecient, cosine similarity, and Pearsons correlation can also be used to calculate similarity or dissimilarity between the data objects. K -means, hierarchical clustering (agglomerative or partitional methods), and density-based clustering are typical examples of unsupervised learning. Semisupervised learning algorithms are most applicable where there exist small amounts of labeled data and large amounts of unlabeled data. Two typical types of semisupervised learning are semisupervised classication and semisupervised clustering. The former uses labeled data to make classication and unlabeled data to rene the classication boundaries further, and the latter uses labeled data to guide clustering. Cotraining is a representative semisupervised learning algorithm. Active learning algorithms allow users to play an active role in the learning process via labeling. Typically, users are domain experts and their skills are employed to label some data instances for which a machine learning algorithm are condent about its classication. Minimum marginal hyperplane and maximum curiosity are two popular active learning algorithms. Data mining includes other techniques such as association rule mining, anomaly detection, feature selection, instance selection, and visual analytics. Additional details related to these data mining techniques can be found in Han et al. [24], Tan et al. [55], Witten et al. [59], Zhao and Liu [63], and Liu and Motoda [37].

1.2. Social Media


Social media (Kaplan and Haenlein [28]) is dened as a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchanges of user-generated content. Social media is conglomerate of dierent types of social media sites including traditional media such as newspaper, radio, and television and nontraditional media such as Facebook, Twitter, etc. Table 1 shows characteristics of dierent types of social media. Social media gives users an easy-to-use way to communicate and network with each other on an unprecedented scale and at rates unseen in traditional media. The popularity of social media continues to grow exponentially, resulting in an evolution of social networks, blogs,

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

Table 1. Characteristics of dierent types of social media. Type Online social networking Characteristics Online social networks are Web-based services that allow individuals and communities to connect with real-world friends and acquaintances online. Users interact with each other through status updates, comments, media sharing, messages, etc. (e.g., Facebook, Myspace, LinkedIn). A blog is a journal-like website for users, aka bloggers, to contribute textual and multimedia content, arranged in reverse chronological order. Blogs are generally maintained by an individual or by a community (e.g., Hungton Post, Business Insider, Engadget). Microblogs can be considered same a blogs but with limited content (e.g., Twitter, Tumblr, Plurk). A wiki is a collaborative editing environment that allow multiple users to develop Web pages (e.g., Wikipedia, Wikitravel, Wikihow). Social news refers to the sharing and selection of news stories and articles by community of users (e.g., Digg, Slashdot, Reddit). Social bookmarking sites allow users to bookmark Web content for storage, organization, and sharing (e.g., Delicious, StumbleUpon). Media sharing is an umbrella term that refers to the sharing of variety of media on the Web including video, audio, and photo (e.g., YouTube, Flickr, UstreamTV). The primary function of such sites is to collect and publish usersubmitted content in the form of subjective commentary on existing products, services, entertainment, businesses, places, etc. Some of these sites also provide products reviews (e.g., Epinions, Yelp, Cnet). These sites provide a platform for users seeking advice, guidance, or knowledge to ask questions. Other users from the community can answer these questions based on previous experiences, personal opinions, or relevent research. Answers are generally judged using ratings and comments (e.g., Yahoo! answers, WikiAnswers).

Blogging

Microblogging Wikis

Social news Social bookmarking

Media sharing

Opinion, reviews, and ratings

Answers

microblogs, location-based social networks (LBSNs), wikis, social bookmarking applications, social news, media (text, photo, audio, and video) sharing, product and business review sites, etc. Facebook,1 the social networking site, recorded more than 845 million active users as of December 2011. This number suggests that China (approximately 1.3 billion) and India (approximately 1.1 billion) are the only two countries in the world that have larger populations than Facebook. Facebook and Twitter have accrued more than 1.2 billion users,2 more than thrice the population of the United States and more than the population of any continent except Asia.

1.3. Mining Social Media


Vast amounts of user-generated content are created on social media sites every day. This trend is likely to continue with exponentially more content in the future. Hence, it is critical for producers, consumers, and service providers to gure out management and utility of massive user-generated data. Social media growth is driven by these challenges: (1) How
1 2

http://newsroom.fb.com/content/default.aspx?NewsAreaId=22 (accessed March 2012). Facebook and Twitter have many overlapping users. Hence, 1.2 billion users are not unique.

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

can a user be heard? (2) Which source of information should a user use? (3) How can user experience be improved? Answers to these questions are hidden in the social media data. These challenges present ample opportunities for data miners to develop new algorithms and methods for social media. Data generated on social media sites are dierent from conventional attribute-value data for classic data mining. Social media data are largely user-generated content on social media sites. Social media data are vast, noisy, distributed, unstructured, and dynamic. These characteristics pose challenges to data mining tasks to invent new ecient techniques and algorithms. For example, Facebook3 and Twitter4 report Web trac data from approximately 149 million and 90 million unique U.S. visitors per month, respectively. According to the video sharing site YouTube,5 more than 4 billion videos are viewed per day, and 60 hours of videos are uploaded every minute. The picture sharing site Flickr,6 as of August 2011, hosts more than 6 billion photo images. Web-based, collaborative, and multilingual Wikipedia7 hosts over 20 million articles attracting over 365 million readers. Depending on social media platforms, social media data can often be very noisy. Removing the noise from the data is essential before performing eective mining. Researchers notice that spammers (Yardi et al. [61], Chu et al. [12]) generate more data than legitimate users. Social media data are distributed because there is no central authority that maintains data from all social media sites. Distributed social media data pose a daunting task for researchers to understand the information ows on the social media. Social media data are often unstructured. To make meaningful observations based on unstructured data from various data sources is a big challenge. For example, social media sites like LinkedIn, Facebook, and Flickr serve dierent purposes and meet dierent needs of users. Social media sites are dynamic and continuously evolving. For example, Facebook recently brought about many concepts including a users timeline, the creation of in-groups for a user, and numerous user privacy policy changes. The dynamic nature of social media data is a signicant challenge for continuously and speedily evolving social media sites. There are many additional interesting questions related to human behavior can be studied using social media data. Social media can also help advertisers to nd the inuential people to maximize the reach of their products within an advertising budget. Social media can help sociologists to uncover the human behavior such as in-group and out-group behaviors of users. Recently, social media was reported to play an instrumental role in facilitating mass movements such as the Arab Spring8 and Occupy Wall Street.9

2. Issues in Mining Social Media


In this section, we introduce some representative research issues in mining social media.

2.1. Community Analysis


A community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group. Based on the context, a community is also referred to as a group, cluster, cohesive subgroup, or module. Communities can be observed via connections in social media because social media allows people to expand social networks online (Tang and Liu [56]). Social media enables people to connect friends and
3 4 5 6 7 8 9

http://www.quantcast.com/facebook.com (accessed March 2012). http://www.quantcast.com/twitter.com (accessed March 2012). http://www.youtube.com/t/press statistics (accessed March 2012). http://news.softpedia.com/news/Flickr-Boasts-6-Billion-Photo-Uploads-215380.shtml (accessed March 2012). http://en.wikipedia.org/wiki/Wikipedia (accessed March 2012). http://en.wikipedia.org/wiki/Arab Spring (accessed March 2012). http://en.wikipedia.org/wiki/Occupy Wall Street (accessed March 2012).

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

nd new users of similar interests. Communities found in social media are broadly classied into explicit and implicit groups. Explicit groups are formed by user subscriptions, whereas implicit groups emerge naturally through interactions. Community analysts are generally faced with issues such as community detection, formation, and evolution. Community detection often refers to the extraction of implicit groups in a network. The main challenges of community detections are that (1) the denition of a community can be subjective, and (2) the lack of ground truth makes community evaluation dicult. Tang and Liu [56] divided community detection methods into four categories: (1) node-centric community detection, where each node satises certain properties such as complete mutuality, reachability, node degrees, frequency of within and outside ties, etc. (examples include cliques, k -cliques, and k -clubs); (2) group-centric community detection, where a group needs to satisfy certain properties (for example, minimum group densities); (3) network-centric community detection, where groups are formed based on partition of network into disjoint sets (examples are spectral clustering and modularity maximization); and (4) hierarchycentric community detection, where the goal is to build a hierarchical structure of communities. This allows the analysis of a network with dierent resolutions. Representative methods are divisive clustering and agglomerative clustering. Social media networks are highly dynamic. Communities can expand, shrink, or dissolve in dynamic networks. Community evolution aims to discover the patterns of a community over time with the presence of dynamic network interactions. Backstrom et al. [8] found that the more friends you have in a group, the more likely you are to join, and communities with cliques grow more slowly than those that are not tightly connected.

2.2. Sentiment Analysis and Opinion Mining


Sentiment analysis and opinion mining aim to automatically extract opinions expressed in the user-generated content. Sentiment analysis and opinion mining tools allow businesses to understand product sentiments, brand perception, new product perception, and reputation management. These tools help users to perceive product opinions or sentiments on a global scale. There are many social media sites reporting user opinions of products in many dierent formats. Monitoring these opinions related to a particular company or product on social media sites is a new challenge. Sentiment analysis is hard because languages used to create contents are ambiguous. Major steps of sentiment analysis are (1) nding relevant documents, (2) nding relevant sections, (3) nding the overall sentiment, (4) quantifying the sentiment, and (5) aggregating all sentiments to form an overview. Basic components of an opinion are (1) an object on which opinion is expressed, (2) an opinion expressed on a object, and (3) the opinion holder. Objects are generally represented as a nite set of features, where each feature represents a nite set of synonymous words or phrases. Opinion mining tasks can be performed at the document level (Turney [58], Pang et al. [49]), sentence level (Rilo and Wiebe [51], Yu and Hatzivassiloglou [62]), or feature level (Hu and Liu [25], Popescu and Etzioni [50], Liu and Maes [36]). Extracting opinions expressed in comparative sentences is a dicult task, and some preliminary work can be found in Jindal and Liu [26], Jindal and Liu [27], Liu [35], and Pang and Lee [48]. Performance evaluation of sentiment analysis is another challenge because of the lack of ground truth.

2.3. Social Recommendation


Traditional recommendation systems attempt to recommend items based on aggregated ratings of objects from users or past purchase histories of users. A social recommendation system makes use of users social network and related information in addition to the traditional recommendation means. Social recommendation is based on the hypothesis that people who are socially connected are more likely to share the same or similar interests

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

(homophily), and users can be easily inuenced by the friends they trust and prefer their friends recommendations to random recommendations. Objectives of social recommendation systems are to improve the quality of recommendation and alleviate the problem of information overload. Examples of social recommendation systems are book recommendations based on friends reading lists on Amazon or friend recommendations on Twitter and Facebook. More details on social recommendation systems can be found in Konstas et al. [30], Ma et al. [38], and Backstrom and Leskovec [7].

2.4. Inuence Modeling


Social scientists have been exploring inuence and homophily in social networks for quite some time (McPherson et al. [42], Lazarsfeld and Merton [34]). It is important to know whether the underlying social network is inuence driven or homophily driven. For example, in the advertisement industry, if the social network is inuence driven, then the inuential users should be identied and incentivized to promote the product or services to the members of the social network. However, if the social network is homophily (similarity) driven, then some individual users should be directly targeted to promote sales. Most social networks have a mixture of both homophily and inuence. Hence, distinguishing them is a challenge. Aral et al. [5] and La Fond and Neville [33] gave details on distinguishing social inuence and homophily in social networks. Kempe et al. [29] studied the inuence maximization problem. For a given information propagation model, inuence maximization aims to identify the set of initial inuential users from a given snapshot of a social network such that they can inuence the maximum number of other users within given budget constraints. Agarwal et al. [3, 4] presented a preliminary model to identify inuential bloggers in a community. The blogosphere obeys a power law distribution with a few blogs being extremely inuential and a huge number of blogs being largely unknown. They reported that active bloggers are not necessarily inuential and proposed eective inuence measures to identify inuential bloggers. A more general discussion on modeling and data mining in the blogosphere is given by Agarwal and Liu [2].

2.5. Information Diusion and Provenance


Researchers study how information diuses and explore dierent models of information diusion, including the independent cascade model, the threshold model, the susceptible infected model, and the susceptibleinfectedrecovered model. Many such models have been studied (Bailey [10], Granovetter [21], Mahajan [40], Macy [39], Berger [11]). Researchers apply these models to analyze the spread of rumors, computer viruses, and diseases during outbreaks. Two important problems from the social media viewpoint are (1) how information spreads in a social media network and which factors aect the spread, and (2) what plausible sources are, given some information from social media. The rst problem of information diusion has received good attention from researchers. The second problem is still an open research probleminformation provenance in social mediaand recognized as a key issue to dierentiate rumors from truth. Because social media data are distributed and dynamic, the conventional techniques used in classical provenance research cannot be directly applied in social media.

2.6. Privacy, Security, and Trust


The low barrier to and pervasive use of social media give rise to concerns on user privacy and security issues. New challenges arise due to users opposing needs: on one hand, a user would like to have as many friends and share as much as possible, and on the other hand, a user would like to be as private as possible when needed. However, being gregarious requires openness and transparency, but being private constricts ones sharing. In addition, a social networking site has its business needs to encourage users to easily nd each other and expand

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

their friendship networks as widely as possiblein other words, to be open. Hence, social media poses new security challenges to fend o security threats to users and organizations. With the variety of personal information disclosed in user proles (e.g., information about other users and user networks may be indirectly accessible), individuals may put themselves and members of their social networks at risk for a variety of attacks. Social media has been the target of numerous passive as well as active attacks including stalking, cyberbullying, malvertizing, phishing, social spamming, scamming, and clickjacking. Gross and Acquisti [22] showed that only a few users change the default privacy preferences on Facebook. In some cases, user proles are completely public, making information available and providing a communication mechanism to anyone who wants to access it. It is no secret that when a prole is made public, malicious users including stalkers, spammers, and hackers can use sensitive information for their personal gain. Sometimes malevolent users can even cause physical or emotional distress to other users (Rosenblum [52]). Narayanan and Shmatikov [45, 46] demonstrated how users privacy can be weakened if an attacker knows the presence of connections among users. Wondracek et al. [60] presented a successful scheme to breach privacy by exploiting only the group membership information of users. Liu and Maes [36] pointed out a lack of privacy awareness and found a large number of social network proles in which people described themselves with a rich vocabulary in terms of their passions and interests. Krishnamurthy and Wills [31] discussed the problem of leakage of personally identiable information and how it can be misused by third parties (Narayanan and Shmatikov [46]). Squicciarini et al. [53] introduced a novel collective privacy mechanism for better managing shared content between users. Fang and LeFevre [14] focused on helping users to understand simple privacy settings, but did not consider additional problems such as attribute inference (Zheleva and Getoor [64]) or shared data ownership (Squicciarini et al. [53]). Zheleva and Getoor [64] showed how an adversary can exploit an online social network with a mixture of public and private user proles to predict the private attributes of users. Baden et al. [9] presented a framework where users dictate who may access their information based on publicprivate encryptiondecryption algorithms. Social trust depends on many factors that cannot be easily modeled in a computational system. Many dierent versions of denition of trust are proposed in the literature (Deutsch [13], Sztompka [54], Mui et al. [44], Olmedilla et al. [47], Grandison and Sloman [20], Artz and Gil [6]). A highly cited thesis on trust computation (Marsh [41]) provides theoretical perspectives of modeling trust, but its complex nature makes it very dicult to apply, especially to social networks (Golbeck and Hendler [18]). Trust between any two people is observed to be aected by many factors including past experiences, opinions expressed and actions taken, contributions to spreading rumors, inuence by others opinions, and motives to gain something extra. Another important aspect of trust is the trustworthiness of user-generated content. Moturu and Liu [43] provided an intuitive scoring measure to quantify the trustworthiness of health-related user-generated content in social media.

3. Illustrative Examples of Mining Social Media


In this section, we present some examples in our research to illustrate how to mine social media data to address novel research issues. The rst example is about assessing user vulnerability on a social networking sites to maintain user privacy. The second example explores the importance of social and historical ties on a location-based social network. The third example introduces a method that can take advantage of multifaceted trust for predictive analytics.

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

3.1. Assessing User Vulnerability on a Social Networking Site (Gundecha et al. [23])
Attribute-value data are a principal data form in social media. Attributes available for every user on a social networking site can be categorized into two major types: individual attributes and community attributes. Individual attributes are those attributes that contain individual user information. Individual attributes include personal information such as gender, birth date, phone number, home address, etc. Community attributes are those attributes that contain information about the friends of a user. Community attributes include friends that are traceable from a users prole (i.e., a users friends list), tagged pictures, and wall interactions. Using the privacy and security settings of a prole, a user can control the visibility of most individual attributes but has limited control over the visibility of most community attributes. For example, Facebook users these days can control photo tagging and the sharing of their friend list with the public but still can not control friends sharing their friend lists or uploading photos of them from their proles to the public. Our earlier work (Gundecha et al. [23]) used a large-scale Facebook data set to assess attribute visibility, which can be used to obtain general behavior of Facebook users. For example, Facebook users do not usually disclose their mobile phone number. Hence, users that do disclose phone numbers have a propensity to vulnerability because they disclose more sensitive information in their proles. A large portion of users are either not careful or not aware of consequences of their actions on the privacy information of their friends. Thus, protecting community attributes is important in protecting user privacy. A novel mechanism was proposed by Gundecha et al. [23] to enable users to protect against vulnerability. Often users on a social networking site are unaware that they could pose a threat to their friends because of their vulnerability. Gundecha et al. [23] showed that it is feasible to measure a users vulnerability based on three factors: (1) the users privacy settings that can reveal personal information, (2) the users action on a social networking site that can expose his or her friends personal information, and (3) friends actions on a social networking site that can reveal the users personal information. Based on these factors, Gundecha et al. [23] formally presented one of the earliest models for vulnerability reduction. They proposed a four-step procedure to estimate user vulnerability: (1) estimate risk to privacy due to individual attributes (referred to as the I -index), (2) estimate risk to privacy due to community attributes (referred to as the C -index), (3) estimate the visibility of a user based on the I -index and C -index (referred to as the P -index), and (4) estimate the user vulnerability based on the P -indexes of a user and his one-hop friends (referred to as the V -index). Figures 1(a) and 1(b) show the relationship between the I -index and C -index as well as that between the P -index and V -index for 100,000 randomly chosen Facebook users. Note that users are sorted in ascending order of their I -indexes and P -indexes. The x-axis and y -axis show users and their index values, respectively. A users vulnerable friend is dened as a friend whose unfriending will lower the V -index score of a user. This denition of a vulnerable friend can be generalized to multiple vulnerable friends. Figure 2 shows a comparison of the V -index values of each user before and after the unfriending of the users k most vulnerable friends. For each graph in Figure 2, the x-axis and y -axis indicate users and their V -index values, respectively. Without loss of generality, we sorted all users, in the ascending order based on the existing V -index values, before we plotted the graphs in Figure 2. We ran the experiment on 300,000 randomly selected users of the Facebook data set. Figure 2 shows the performance comparison of V -index values for each user before (+) and after () unfriending the k most vulnerable friends. We ran the experiments for dierent values of k including 1, 2, 10, and 50. As expected, vulnerability decreases consistently as the value of k increases, as seen in Figures 2(a)2(d).

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

Figure 1. Relationship among index values for each user.


(a) I -index and Cindex for each user
1.0 C-index 0.8 I-index 0.8 1.0 V -index P -index

(b) P -index and Vindex for each user

Index values

0.6 0.4 0.2 0 0 2.5 5.0 7.5 10

Index values

0.6 0.4 0.2 0 0 2.5 5.0 7.5 10

Users

104

Users

104

3.2. Exploring SocialHistorical Ties on Location-Based Social Networks (Gao et al. [16])
LBSNs have been a popular form of social media in recent years. They provide locationrelated services that allow users to check in at geographical locations and share such experience with their friends. Millions of check-in records in LBSNs contain rich social and geographical information and provide a unique opportunity for researchers to study users
Figure 2. Performance comparison of V -index values for each user before (+) and after () unfriending the k most vulnerable friends from his or her social network.
(a) Most vulnerable
M 1-index 0.6 V-index 0.6

(b) 2 most vulnerable


M 2-index V-index

Index values

0.4

Index values

0.4

0.2

0.2

0 0 1 2 3

0 0 1 2 3

Users (c) 10 most vulnerable


M 10-index 0.6 V-index

105

Users (d) 50 most vulnerable


M 50-index 0.6 V-index

105

Index values

0.4

Index values

0.4

0.2

0.2

0 0 1 2 3

0 0 1 2 3

Users

105

Users

105

10

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

social behavior from a spatialtemporal aspect, which in turn enables a variety of services including place advertisement or recommendation, trac forecasting, and disaster relief. To understand a users check-in behavior, it is inevitable to perform a historical analysis of users. It is because the historical check-ins provide rich information about a users interests and hints about when and where a particular user would like to go. In addition, social correlation theory suggests to consider users social ties, because human movement is usually aected by their social events, such as visiting friends, going out with colleagues, and so on. These two relationship ties can shape the users check-in experience on LBSNs, and each tie gives rise to a dierent probability of check-in activity, which indicates that people in dierent spatialtemporalsocial circles have dierent interactions. The historical ties of a users check-in behavior have two properties on LBSNs. First, a users check-in history approximately follows a power-law distribution; i.e., a user goes to a few places many times and to many places a few times. Second, the historical ties have a short-term eect. Taking advantage of the similarity between language modeling and location-based social network mining, the work of Gao et al. [16] introduced the PitmanYor process to location-based social networks to model the historical ties of a user i for his checkin behavior cn+1 = l at time (n + 1) and location l, specically, the power-law distribution and short-term eect of historical ties, denoted as historical model (HM) as shown below:
i, i i PH (cn+1 = l) = PHP Y (cn+1 = l | u, tul , d|u| , r|u| , tu ),

where u, tul , d|u| , r|u| , and tu are parameters. A socialhistorical model (SHM) is proposed to explore user is check-in behavior integrating both of the social and historical eects:
i i i PSH (cn+1 = l) = PH (cn+1 = l) + (1 )PS (cn+1 = l),

where
i PS (cn+1 = l) = uj N (ui ) i, j sim(ui , uj )PHP Y (cn+1 = l),

where N indicates user is set of neighbors. The experiments with location prediction on a real-world LBSN compare the proposed methods (HM and SHM) with some baseline methods. The results are plotted in Figure 3, demonstrating that the proposed methods best model users check-ins in terms of location prediction; in other words, social and historical ties can help location prediction. This work nds that a users friends can inuence his next location because users that have shared friends tend to go to similar locations than those without. The power-law property and short-term eect are observed in historical ties; thus, a historical model is introduced to capture these properties. The experimental results on location prediction demonstrate that the proposed approach is suitable in capturing a users check-in property and outperforms current models.

3.3. Discerning Multifaceted Trust in a Connected World (Tang et al. [57])


The issue of trust has attracted increasing attention from the community of social media research. Trust, as a social concept, naturally has multiple facets, indicating multiple and heterogeneous trust relationships between users. Here is a multifaceted trust example from Epinions. Figure 4(a) shows single trust relationships between user 1 and his 20 friends. Here, we can see that user 7 is the more trustable for user 1. Figures 4(b) and 4(c) show their multifaceted trust relationships in the categories home and garden and restaurants, respectively. For the category home and garden, user 7 is not necessary the most trusted

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

11

Figure 3. The performance comparison of prediction models demonstrates that by considering social and historical ties, the proposed models can help location prediction.
0.40 MFC MFT Order-1 Order-2 HM SHM

0.35

Prediction accuracy

0.30

0.25

0.20

0.15 10 20 30 40 50 60 70 80 90

Fraction of training set


Note. MFC, most frequent check-in model; MFT, most frequent time model; Order-1, order-1 Markov model; Order-2, order-2 Markov model.

friend of user 1. This shows that trust relationships in dierent categories vary. Thus, people trust others dierently in dierent facets. There are two challenges to study in obtaining multifaceted trust between users: rst, the representation of multiple and heterogeneous trust relationships between users, and second, estimating the strength of multifaceted trust. Traditionally, trust is represented by an adjacency matrix. However, this cannot capture the multifaceted trust relations. Tang et al. [57] developed a new algorithm, mTrust, that extends a matrix representation to a tensor representation, adding an extra dimension for facet description. Previous work observed a strong correlation between trust and user similarity in the context of rating systems. Therefore, it is reasonable to embed trust strength inference in rating prediction. Thus, to evaluate the usefulness of multifaced trust, this work embeds the multifaceted trust inference in the framework of rating prediction. Interesting ndings from the experiments are that (1) more than 20% of reciprocal links are heterogeneous, (2) more than 14% transitive trust relations are heterogeneous, and (3) more
Figure 4. Single trust and multifaceted trust relationships of one user in Epinions.
(a) Single trust
5 4 3 2 6 7 8 9 10 2 3 4

(b) Trust in home and garden


5 6 7 8 9 10 11 1 12 21 20 19 Pajek 18 15 17 16 Pajek 14 21 13 2

(c) Trust in restaurants


5 4 3 6 7 8 9 10 1 11 12 13 20 14 19 18 17 16 Pajek 15

1 21

11 12 13

20 14 19 18 17 16 15

Note. The thickness of a line indicates the level of trust.

12

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

than 11% of cocitation trust relations are heterogeneous. With these ndings, mTrust can be applied to many online tasks such as improving rating prediction, enabling facet-sensitive ranking, and making status theory applicable to reciprocal links.

4. Employing Social Media in Real-World Applications


Social media has been increasingly used in a wide range of domains; examples include political campaigns (e.g., presidential elections), mass movements (e.g., organizing Occupy Wall Street movements, Arab Spring), as well as disaster and crisis response and relief coordination. In this section, we show how social media mining is used for HADR. Government and nongovernmental organizations (NGOs) have been faced with challenges to eectively respond to crises such as natural disasters (e.g., tsunamis, hurricanes, earthquakes). Providing ecient relief to the victims is a primary focus in saving lives, minimizing further losses in the aftermath, and accelerating recovery. Crowdsourcing tools via social media such as Twitter, Ushahidi, and Sahana have proven useful in gathering information during and after a crisis. However, they are designed to go beyond crowdsourcing. Additional capabilities are required for response coordination, secured collaboration, and trust building among relief organizations. Goolsby [19] pointed out the need for such a social media system and described the eort to build it to allow dierent organizations to share crowdsourced and groupsourced information, and analyze and visualize the processed information for intelligent decision making. In this section, we describe two social media prototype systems designed to facilitate ecient collaboration among disparate organizations for eective and coordinated responses to crises.

4.1. ASU Coordination Tracker (Gao et al. [17])


When natural disasters occur, the international community would join forces to provide disaster relief and humanitarian assistance. Some prominent examples include the Haiti earthquake and the subsequent cholera outbreak, and the devastating earthquake and tsunami in Japan. Social media has revolutionized the use of traditional media and played an important role in these events as an information collector and disseminator, and as a communication and collaboration tool. As one of the most important functions of social media, crowdsourcing is a collaborative information sharing mechanism based on the principle of collective wisdom. Wikipedia10 is a perfect example of crowdsourcing, where people collectively publish information on various topics. Crowdsourcing is capable of leveraging participatory social media services and tools to collect information, and it allows crowds to participate in various HADR tasks. Its integration with crisis maps has been a very eective crowdsourcing application in HADR eorts. One of the earliest such systems was Alive in Afghanistan,11 where people in Afghanistan submitted their reports on accidents and terrorist attacks across Afghanistan. Although crowdsourcing applications can provide accurate and timely information about a crisis for decision making, current crowdsourcing applications still fall short in supporting disaster relief eorts (Gao et al. [15]). Most importantly, current applications do not provide a common mechanism specically designed for collaboration and coordination between disparate relief organizations. For example, relief organizations that work independently can cause conicts and complicate the relief eorts. If independent organizations duplicate a response, it will draw resources away from the other areas in need that could use the duplicated supplies. It can also result in delayed response to other disaster areas. Furthermore, because of the noisy and chaotic nature of crowdsourced data, current crowdsourcing applications cannot provide readily useful information for disaster relief eorts.
10 11

http://www.wikipedia.org/ (accessed March 2012). http://aliveinafghanistan.org/ (accessed March 2012).

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

13

To address these shortfalls of crowdsourcing to a certain extent, an online coordination system, the ASU Coordination Tracker (ACT), was devised. The ACT is an event response coordination system with the primary goal of facilitating multiorganization response (military, governments, NGOs, etc.) to an event such as a disaster and providing relief organizations the means for better collaboration and coordination during a crisis. It is a speedy and eective approach with easy, open communication and streamlined coordination. The major features and advantages of the ACT are summarized below: leveraging crowdsourcing information to provide the means for a groupsourcing response for organizations to assist persons aected by a crisis eectively, developing novel strategies to analyze the requests and maximize the coordinate eorts of crowdsourced data for disaster relief, visualizing the requests to give a global view of the request distribution to facilitate responders, and increasing the coordination eciency by optimally responding to requests. The ACT consists of ve functional modules: request collection, request analysis, response, coordination, and situation awareness. Raw requests from users are collected via crowdsourcing and groupsourcing methods. The request analysis module takes advantage of both data mining technology and expert knowledge to iteratively capture the essential content of raw requests. Based on the demand and disaster background, raw requests are analyzed and classied into various categories. Essential contents extracted from the raw data are then stored in a requests pool. The system visualizes the requests pool on its crisis map, with its selective visibility bonded to the access levels of users (organizations). The response module is designed to allow relief organizations to contribute, receive, and evaluate dierent response options and plan response actions. Relief organizations are able to respond to requests through the crisis map directly. The coordination model uses the interagency concept (Goolsby [19]) to avoid response conicts while maintaining centralized control. The ACT displays every available request on the crisis map. Each request is in one of the following four states: available, in process, in delivery, or delivered. To avoid conicts, relief organizations are not able to respond to requests that are being fullled by another organization. A statistics module runs in the background to help track relief progress for situation awareness, relief strategy adjustments, and further decision making. The ACT is a developing system that enables coordination among organizations during disaster relief. We are investing approaches to improve collaboration eciency and provide dierential security to relief organizations and decision makers. The disaster relief ASU Crisis Response Game (Abbasi et al. [1]) has demonstrated that by both leveraging crowdsourcing information and providing means for a groupsourcing response, organizations can eectively assist victims.

4.2. TweetTracker (Kumar et al. [32])


TweetTracker12 is a Twitter-based analytic and visualization tool. The focus of the tool is to help HADR relief organizations to acquire situational awareness during disasters and emergencies to aid disaster relief eorts (Kumar et al. [32]). New social media platforms, such as Twitter microblogs, demonstrate their value and cability to provide information that is not easily attainable from traditional media. For example, during the Mumbai blasts of 2011,13 rsthand information from the aected region was available on Twitter moments after the blast. TweetTracker is designed to help to track, analyze, review, and monitor tweets. This is achieved through near-real-time tracking of tweets with specic keywords/hashtags and tweets generated from the region aected by the crisis. The tool supports monitoring and
12 13

http://tweettracker.fulton.asu.edu/ (accessed March 2012). http://ibnlive.in.com/news/mumbai-blasts-twitter-joins-hands-to-help/167345-3.html (accessed March 2012).

14

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

analysis of the collected tweets via real-time trending, data reduction, historical review, and integrated data mining techniques. TweetTracker consists of three main components: (1) a Twitter stream reader, (2) a data storage module, and (3) a data mining and visualization module. The Twitter stream reader is a data collection module that continually crawls tweets through the Twitter streaming API (Application Programming Interface).14 Tweets are ltered based on user-specied keywords, hashtags, and geolocations. The data storage module is responsible for storing and indexing the collected tweets into a relational database for use by the visualization module. The data mining and visualization module is a Web-based user interface to the collected tweets and a means to analyze the collected tweets. It provides geospatial visualization of tweets related to a particular event on a map, summarizes the tweets, and visualizes the trending keywords in the form of a word cloud, and it can identify popular resources (URLs) and users mentioned in the tweets. The tool also includes built-in language translation support for monitoring of multilingual tweets. TweetTracker has been used in tracking, visualizing, and analyzing activities including the Arab Spring movement, the Occupy Wall Street movement, and various natural disasters such as earthquakes and cholera outbreaks.

5. Summary
Valuable information is hidden in vast amounts of social media data, presenting ample opportunities social media mining to discover actionable knowledge that is otherwise dicult to nd. Social media data are vast, noisy, distributed, unstructured, and dynamic, which poses novel challenges for data mining. In this tutorial, we oer a brief introduction to mining social media, use illustrative examples to show that burgeoning social media mining is spearheading the social media research, and demonstrate its invaluable contributions to real-world applications. As a main type of big data, social media is nding its many innovative uses, such as political campaigns, job applications, business promotion and networking, and customer services, and using and mining social media is reshaping business models, accelerating viral marketing, and enabling the rapid growth of various grassroots communities. It also helps in trend analysis and sales prediction. Social media data will continue their rapid growth in the foreseeable future. We are faced with an increasing demand for new algorithms and social media mining tools. Existing preliminary success in social media mining research eorts convincingly demonstrates the promising future of the emerging social media mining community and will help to expand research and development and explore online and o-line human behavior and interaction patterns. Acknowledgments
The authors thank Huiji Gao, Shamanth Kumar, Jiliang Tang, and DMML members for their assistance and feedback in preparing this tutorial. Some projects described in this brief introductory survey were sponsored by the Oce of Naval Research [ONR N000141010091]; the Army Research Oce [ARO 025071]; and the National Science Foundation [Grant 0812551].

References
[1] M. A. Abbasi, S. Kumar, J. A. Andrade Filho, and H. Liu. Lessons learned in using social media for disaster reliefASU Crisis Response Game. Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction. Springer-Verlag, Berlin, 282289, 2012. [2] N. Agarwal and H. Liu. Modeling and Data Mining in Blogosphere. Morgan & Claypool Publishers, San Rafael, CA, 2009.
14

https://dev.twitter.com/docs/streaming-api (accessed March 2012).

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

15

[3] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Identifying the inuential bloggers in a community. Proceedings of the International Conference on Web Search and Web Data Mining. Association for Computing Machinery, New York, 207218, 2008. [4] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Modeling blogger inuence in a community. Social Network Analysis and Mining 2(2):139162, 2012. [5] S. Aral, L. Muchnik, and A. Sundararajan. Distinguishing inuence-based contagion from homophily-driven diusion in dynamic networks. Proceedings of the National Academy of Sciences of the United States of America 106(51):21544, 2009. [6] D. Artz and Y. Gil. A survey of trust in computer science and the Semantic Web. Web Semantics: Science, Services and Agents on the World Wide Web 5(2):5871, 2007. [7] L. Backstrom and J. Leskovec. Supervised random walks: Predicting and recommending links in social networks. Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, New York, 635644, 2011. [8] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, 4454, 2006. [9] R. Baden, A. Bender, N. Spring, B. Bhattacharjee, and D. Starin. Persona: An online social network with user-dened privacy. ACM SIGCOMM Computer Communication Review 39(4):135146, 2009. [10] N. T. J. Bailey. The Mathematical Theory of Infectious Diseases and Its Applications. Charles Grin, High Wycombe, UK, 1975. [11] E. Berger. Dynamic monopolies of constant size. Journal of Combinatorial Theory, Series B 83(2):191200, 2001. [12] Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia. Who is tweeting on Twitter: Human, bot, or cyborg? Proceedings of the 26th Annual Computer Security Applications Conference. Association for Computing Machinery, New York, 2130, 2010. [13] M. Deutsch. Cooperation and trust: Some theoretical notes. Nebraska Symposium on Motivation. University of Nebraska Press, Lincoln, 1962. [14] L. Fang and K. LeFevre. Privacy wizards for social networking sites. Proceedings of the 19th International Conference on World Wide Web. Association for Computing Machinery, New York, 351360, 2010. [15] H. Gao, G. Barbier, and R. Goolsby. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems 26(3):1014, 2011. [16] H. Gao, J. Tang, and H. Liu. Exploring social-historical ties on location-based social networks. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media. Association for the Advancement of Articial Intelligence, Palo Alto, CA, 2012. [17] H. Gao, X. Wang, G. Barbier, and H. Liu. Promoting coordination for disaster relief: From crowdsourcing to coordination. Proceedings of the 4th Conference on Social Computing, Behavioral-Cultural Modeling and Prediction. Springer-Verlag, Berlin, 197204, 2011. [18] J. Golbeck and J. Hendler. Inferring binary trust relationships in web-based social networks. ACM Transactions on Internet Technology 6(4):497529, 2006. [19] R. Goolsby. Social media as crisis platform: The future of community maps/crisis maps. ACM Transactions on Intelligent Systems and Technology 1(1):111, 2010. [20] T. Grandison and M. Sloman. A survey of trust in Internet applications. IEEE Communications Surveys & Tutorials 3(4):216, 2009. [21] M. Granovetter. Threshold models of collective behavior. American Journal of Sociology 83(6):14201443, 1978. [22] R. Gross and A. Acquisti. Information revelation and privacy in online social networks. Proceedings of the 2005 ACM Workshop on Privacy in the Electronic Society. Association for Computing Machinery, New York, 7180, 2005. [23] P. Gundecha, G. Barbier, and H. Liu. Exploiting vulnerability to secure user privacy on a social networking site. The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, 2011.

16

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

[24] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 2011. [25] M. Hu and B. Liu. Mining and summarizing customer reviews. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, 168177, 2004. [26] N. Jindal and B. Liu. Identifying comparative sentences in text documents. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, 244251, 2006. [27] N. Jindal and B. Liu. Opinion spam and analysis. Proceedings of the International Conference on Web Search and Web Data Mining. Association for Computing Machinery, New York, 219230, 2008. [28] A. M. Kaplan and M. Haenlein. Users of the world, unite! The challenges and opportunities of social media. Business Horizons 53(1):5968, 2010. Tardos. Maximizing the spread of inuence through a social [29] D. Kempe, J. Kleinberg, and E. network. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, 137146, 2003. [30] I. Konstas, V. Stathopoulos, and J. M. Jose. On social networks and collaborative recommendation. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, 195202, 2009. [31] B. Krishnamurthy and C. E. Wills. On the leakage of personally identiable information via online social networks. ACM SIGCOMM Computer Communication Review 40(1):112117, 2010. [32] S. Kumar, R. Zafarani, and H. Liu. Understanding user migration patterns across social media. Twenty-Fifth International Conference on Articial Intelligence. Association for the Advancement of Articial Intelligence, Palo Alto, CA, 2011. [33] T. La Fond and J. Neville. Randomization tests for distinguishing social inuence and homophily eects. Proceedings of the 19th International Conference on World Wide Web. Association for Computing Machinery, New York, 601610, 2010. [34] P. F. Lazarsfeld and R. K. Merton. Friendship as a social process: A substantive and methodological analysis. Freedom and Control in Modern Society 18:1866, 1954. [35] B. Liu. Sentiment analysis and subjectivity. Handbook of Natural Language Processing. CRC Press, Boca Raton, FL, 627666, 2010. [36] H. Liu and P. Maes. InterestMap: Harvesting social network proles for recommendations. Workshop: Beyond Personalization, San Diego, 2005. [37] H. Liu and H. Motoda. Computational Methods of Feature Selection. Chapman & Hall, Boca Raton, FL, 2008. [38] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King. Recommender systems with social regularization. Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, New York, 287296, 2011. [39] M. W. Macy. Chains of cooperation: Threshold eects in collective action. American Sociological Review 56(6):730747, 1991. [40] V. Mahajan, E. Muller, and F. M. Bass. New product diusion models in marketing: A review and directions for research. Journal of Marketing 54(1):126, 1990. [41] S. P. Marsh. Formalising trust as a computational concept. Ph.D. thesis, Deptartment of Computing Science and Mathematics, University of Stirling, Stirling, UK, 1994. [42] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology 27:415444, 2001. [43] S. T. Moturu and H. Liu. Quantifying the trustworthiness of social media content. Distributed and Parallel Databases 29(3):239260, 2011. [44] L. Mui, M. Mohtashemi, and A. Halberstadt. A computational model of trust and reputation for E-businesses. Proceedings of the 35th Annual Hawaii Conference on System Sciences (HICSS02). IEEE Computer Society, Washington, DC, 24312439, 2002.

Gundecha and Liu: Mining Social Media: A Brief Introduction Tutorials in Operations Research, c 2012 INFORMS

17

[45] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. Proceedings of the 2008 IEEE Symposium on Security and Privacy. IEEE Computer Society, Washington, DC, 111-125, 2008. [46] A. Narayanan and V. Shmatikov. De-anonymizing social networks. Proceedings of the 2009 IEEE Symposium on Security and Privacy. IEEE Computer Society, Washington, DC, 173187, 2009. [47] D. Olmedilla, O. Rana, B. Matthews, and W. Nejdl. Security and trust issues in semantic grids. Proceedings of the Dagsthul Seminar, Semantic Grid: The Convergence of Technologies 5271:896902, 2005. [48] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trends Information Retrieval 2(12):1135, 2008. in

[49] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? Sentiment classication using machine learning techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, Vol. 10. Association for Computational Linguistics, Stroudsburg, PA, 7986, 2002. [50] A. M. Popescu and O. Etzioni. Extracting product features and opinions from reviews. Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, 339346, 2005. [51] E. Rilo and J. Wiebe. Learning extraction patterns for subjective expressions. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, 105112, 2003. [52] D. Rosenblum. What anyone can know: The privacy risks of social networking sites. IEEE Security and Privacy 5(3):4049, 2007. [53] A. C. Squicciarini, M. Shehab, and F. Paci. Collective privacy management in social networks. Proceedings of the 18th International Conference on World Wide Web. Association for Computing Machinery, New York, 521530, 2009. [54] P. Sztompka. Trust: A Sociological Theory. Cambridge University Press, Cambridge, UK, 1999. [55] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson Addison Wesley, Boston, 2006. [56] L. Tang and H. Liu. Community Detection and Mining in Social Media, Vol. 2. Morgan & Claypool Publishers, San Rafael, CA, 2010. [57] J. Tang, H. Gao, and H. Liu. mTrust: Discerning multi-faceted trust in a connected world. Proceedings of the Fifth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, New York, 93102, 2012. [58] P. D. Turney. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classication of reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, 417424, 2002. [59] I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2011. [60] G. Wondracek, T. Holz, E. Kirda, and C. Kruegel. A practical attack to de-anonymize social network users. Proceedings of the 2010 IEEE Symposium on Security and Privacy. IEEE Computer Society, Washington, DC, 223238, 2010. [61] S. Yardi, D. Romero, G. Schoenebeck, and D. Boyd. Detecting spam in a Twitter network. First Monday 15(1):14, 2009. [62] H. Yu and V. Hatzivassiloglou. Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, 129136, 2003. [63] Z. A. Zhao and H. Liu. Spectral Feature Selection for Data Mining. Chapman & Hall/CRC Press, Virginia Beach, VA, 2012. [64] E. Zheleva and L. Getoor. To join or not to join: the illusion of privacy in social networks with mixed public and private user proles. Proceedings of the 18th International Conference on World Wide Web. Association for Computing Machinery, New York, 531540, 2009.