1、A Unified Model for Stable and Temporal Topic Detection from Social Media Data,Hongzhi Yin, Bin Cui, Hua Lu, Yuxin Huang and Junjie Yao Peking University Aalobrg University,Outline,Motivation Problem Formulation A Basic Solution A User-Temporal Mixture Model Enhancement of the basic solution Regular
2、ization Technique Burst-Weighted Boosting Experiments Q/A,Outline,Motivation Problem Formulation A Basic Solution A User-Temporal Mixture Model Enhancement of the basic solution Regularization Technique Burst-Weighted Boosting Experiments Q/A,Motivation,Motivation (Cont.),Two different types of topi
3、cs are mixed up in the social media platforms such as Twitter, Weibo and Delicious; Temporal Topics are temporally coherent meaningful themes. They are time-sensitive and often on popular real-life events or hot spots, i.e., breaking events in the real world. Stable Topics are often on users regular
4、 interests and their daily routine discussions, e.g., their moods and statuses.,One Example in Twitter,Temporal Topic : Dead pigs in Shanghai,Stable Topic : Big Data,Another Example in Twitter,Temporal Topic: Independence Day,Stable Topic: Animal Adoption,We can tell the difference between temporal
5、and Stable topics from their temporal distributions and their description words.,Motivation (Cont.),Discovering different topics of events that are coherent in temporal space Detecting bursty events, such as disaster (e.g., earthquakes), politics (e.g., election), and public events (e.g., Olympics)
6、Analyzing topic trends Extracting stable topics that are coherent in user-interest space. Finding user intrinsic interests and better modeling user preference,Outline,Motivation Problem Formulation A Basic Solution A User-Temporal Mixture Model Enhancement of the basic solution Regularization Techni
7、que Burst-Weighted Smoothing Experiments Q/A,Problem Formulation,A user-time-associated document d is a text document associated with a time stamp and a user. A temporal topic is a temporally coherent theme. In other words, the words that are emerging in the close time dimension are clustered in a t
8、opic. An example of temporal topics: Given a collection of user-time-associated tweets, the desired temporal topics are the events happening in different times. Formally, a temporal/stable topic is represented by a word distribution where,Problem Formulation (Cont.),A topic distribution in time dime
9、nsion is the distribution of topics given a specific time interval. Formally, is the probability of temporal topic given time interval t. A topic distribution in user space is the distribution of topics given a specific user. Formally, is the probability of stable topic given user u.,Problem Formula
10、tion (Cont.),A User-Time-Keyword Matrix M is a hyper-matrix whose three dimensions refer to user, time and keyword. A cell in Mu, t, w stores the frequency of word w generated by user u within time interval t. Given a collection of user-time-associated documents C, we first formulate matrix M Detect
11、ing Temporal Topics Extracting Stable Topics,Task 1,Task 2,Problem Formulation (Cont.),Detecting a set of temporal topics that are event-driven. Detecting bursty events, such as disaster (e.g., earthquakes), politics (e.g., election), and public events (e.g., Olympics) Analyzing topic trends Extract
12、ing a set of stable topics that are interest-driven. Finding user intrinsic interests and better modeling user preference,Outline,Motivation Problem Formulation A Basic Solution A User-Temporal Mixture Model Enhancement of the basic solution Regularization Technique Burst-Weighted Boosting Experimen
13、ts Q/A,A User-Time Mixture Model,Main InsightsTo find both temporal and stable topics in a unified manner, we propose a topic model that simultaneously captures two observations: Words generated around the same time are more likely to have the same event-driven temporal topicWords generated by the s
14、ame user are more likely to have the same interest-driven stable topic. The former helps find event-driven temporal topics while the latter helps identify interest-driven stable topics.,Combine user and time information We assume that when a user u generates a word w at time t, he/she is probably in
15、fluenced by two factors: the breaking news/events occurring in time t and his/her intrinsic interests. Breaking events are modeled by temporal topics and user intrinsic interests are modeled by stable topics.,The likelihood that user u generates word w at time t is as follows:Parameters and are mixi
16、ng weights controlling the motivation factor choice, also denoting the proportions of temporal topics and stable topics in the dataset. It is worth mentioning that they are learnt automatically, instead of being fixed.,Parameter Estimation,The log-likelihood of the whole user-time-associated documen
17、t collection C is E-M algorithm to estimate,E-Step,M-Step,Compute expectation,Maximize, closed form solution,Please refer to the details of E-M algorithm in Section 4.2,Parameter Estimation,E-step:M-step:,Outline,Motivation Problem Formulation A Basic Solution A User-Temporal Mixture Model Enhanceme
18、nt of the basic solution Regularization Technique Burst-Weighted Boosting Experiments Q/A,Spatial Regularization,Intuitions If two users are connected in the social network space, they are more likely to enjoy same/similar interests/topics. A topic is interest-coherent if people who are interested i
19、n this topic also close in the network space.,22,DB,DB,DB,?,More likely to be an DB person or an IR person?,Intuition: users interests are similar to their neighbors,Spatial Regularization,Topic Model With Spatial Regularization A regularized data likelihood is defined as follows:,Regularizer,The Sp
20、atial Regularizer plays the role of spatial smoothing for user interests.,Parameter Estimation,24,Maximize, using Newton-Raphson,Smooth using a spatial regularizer; in each iteration, a user interest issmoothed by his/her spatial neighbors.,Outline,Motivation Problem Formulation A Basic Solution A U
21、ser-Temporal Mixture Model Enhancement of the basic solution Regularization Technique Burst-Weighted Boosting Experiments Q/A,Insights,In topic models, the words with high occurrence rate, i.e., popular words, enjoy high probabilities to appear at top positions in each discovered topic. These popula
22、r words are mostly general words, denoting abstract concepts. In stable topics, they can illustrate the domain of topics at the first glimpse. However, in temporal topics, words with notable bursty feature are superior in expressing temporal information since users are more interested in bursty word
23、s than in abstract concepts when browsing temporal topic,Example: Michael Jacksons Death,In this temporal topic, we expect that bursty words “mj”, “michael jackson” “moonwalk” become the dominant words rather than the general words “world”, “news” and “death”.But they cannot be removed as stop words
24、, since they can help illustrate the stable topics.,Burst-Weighted Boosting,We implement a bursty boosting step to escalate the probability of these bursty words during the procedure of detecting temporal topics. We first compute the bursty-degree of each word in each time interval. (Yao et al. ICDE
25、2010) A boosting step is then taken after each few E-M iterations, as follows. In this step, a word w will have its generation probability boosted in a temporal topic only if ws bursty period overlaps with that of the topic.,Outline,Motivation Problem Formulation A Basic Solution A User-Temporal Mix
26、ture Model Enhancement of the basic solution Regularization Technique Burst-Weighted Boosting Experiments Q/A,Data Sets,Twitter Data set (Mar. 2009 to Oct.2009) Delicious Data set (Feb.2008 to Dec. 2009) Sina Weibo (2011),Data Sets,Twitter: People in this platform often discuss many social events an
27、d their daily life. It contains 9,884,640 tweets posted by 456,024 users in the period of Mar. 2009 to Oct.2009. Each user in this data set at least published 200 posts. We first removed all the stop words. Delicious: Delicious is a collaborative tagging system on which users can upload and tag web
28、pages. We collected 200,000 users and their tagging behaviors from the period of Feb.2008 to Dec. 2009. The dataset contains 7,103,622 tags. Topics on technology and electronic cover more than half of tags. Breaking news also co-exists.,Compared Methods,Our models BUT is the basic model EUTS is the
29、model enhanced with spatial regularization EUTB is the model enhanced with both spatial regularization and burst-weighted boosting. PLSA Model on Time Slices (Mei et al. KDD05) Individual Detection Method (Wang et al. KDD07) Topic Over Time Model (TOT) (Wang et al. KDD06) TimeUserLDA (Diao et al. AC
30、L12),Time Stamp Prediction Comparison,Time Stamp Prediction Comparison,Topic Quality Comparison,Excellent: a nicely presented temporal topic; Good: a topic containing bursty features; Poor: a topic without obviousbursty features,Stable Topics Detected in Delicious,Temporal Topics Detected in Delicio
31、us,Stable Topics Detected in Twitter,Temporal Topics Detected in Twitter,Stable Topics (Sina Weibo),Temporal Topics (Sina Weibo),Temporal Topics (Sina Weibo),Temporal Topic Trends Analysis,Temporal Topic Trends Analysis,Outline,Motivation Problem Formulation A Basic Solution A User-Temporal Mixture Model Enhancement of the basic solution Regularization Technique Burst-Weighted Boosting Experiments Q/A,Thank You!,Any Question ?,Email: ,