presentation - SNOW Workshop

Report
One Day in Twitter:
Topic Detection Via Joint Complexity
Gérard Burnside1, Dimitris Milioris1,2 and Philippe Jacquet1
1Bell
Labs, Alcatel-Lucent, France
2École Polytechnique ParisTech
Snow Challenge @ WWW 2014
Overview






Motivation & Challenges
I-Complexity
Joint Complexity
Theoretical Background
Snow Challenge Dataset
Topic Detection






Headlines
Keywords
Media URLs
Benefits
Conclusions
Future work
2
Motivation

Online social media services have seen a huge expansion:

The value of information has increased dramatically

Interactions and communication between users help predict the
evolution of information

The ability to study Social Networks can provide relevant info in real
time
3
Challenges
The study of Soc. Networks has several research challenges
 Searching in social media is still an open problem
 short size of posts, tremendous quantity in real time
 Information of the correlation between groups of users
 predict media consumption, network resources, traffic
 improve QoS
 Analyze the relationship between members of a group/community
 reveal important teams
 Spam and adv. detection
 continuously growing amount of irrelevant info
4
I – Complexity

X is a sequence and I(X) is a set of factors (distinct substr.)

Example: X = apple, then:
I(X) = {a, p, l, e, ap, pp, pl, le, app, ppl, ple, appl, pple, apple, v}

|I(X)| is the complexity of a sequence

|I(X)| = 15 (v denotes the empty string)
5
Joint Complexity [1]

The information contained in a string may be revealed by
comparing with a reference string

The Joint Complexity is the number of common distinct
factors in two sequences

J(X, Y) = |I(X) ∩ I(Y)|

Efficient way to estimate similarity degree of two sequences

The analysis of a sequence in subcomponents is done by
Suffix Trees

Simple, fast and low complexity method to store and recall from memory
[1] P. Jacquet, D. Milioris and W. Szpankowski, “Classification of Markov Sources Through Joint String
Complexity: Theory and Experiments”, in IEEE International Symposium on Information Theory
6
(ISIT’13), Istanbul, Turkey, July 2013.
Suffix Trees Superposition
JC(apple, maple) = 9




Suffix Tree superposition of X = apple and Y = maple
It reveals the common factors of X and Y, and gives a similarity metric
Time to build a S.T. = O(n logn)
Space in memory = O(n), n is the length of the tweet
7
Theoretical Background [2]




k
JC is expected to be in n , κ < 2
2
n
In presence of quasi duplicates of JC is in

2log2
When topics are the same JC=
n , h = entropy of
the source.
h
Used to verify the thresholds Thlow and Thmax

[2] D. Milioris and P. Jacquet, “Joint Sequence Complexity Analysis: Application to Social Networks
Information Flow”, in Bell Laboratories Technical Journal, Issue on Data Analytics, Vol. 18, No. 4,
2014. (DOI: 10.1002/bltj.21647).
8
Snow Data Challenge

Collected Tweets for 24 hours;



between Tue Feb. 25, 18:00 and Wed Feb. 26, 18:00 (GMT)
by following 556,295 users, and
also looking for specific keywords (Syria; terror; Ukraine; bitcoin)

Total tweets: 1,041,062

N = 96 timeslots (new timeslot = every 15 minutes)

Challenge: Provide one or more (max 10) different topics
per timeslot (headline, set of keywords, Media URLs)
9
Topic Detection

Timeslot representation via connected weighted graphs

Each tweet is a node in the graph and an adjacency matrix
(triangular) holds the weight (JC) of every edge
10
Topic Detection
11
Algorithms
12
Most Representative and Central Tweets

The best-ranked tweet is chosen unconditionally

The second one is picked only if its JC score with the
first one is below a chosen threshold Thrlow, otherwise it
is added to the list of related tweets of the first tweet

Similarly, the third one is picked only if its JC score with
the first two is below Thrlow, etc.

This ensures that the topics are dissimilar enough and it
classifies best ranked tweets into topics at the same time
13
Headlines

By removing punctuation, special characters, etc., of each
central tweet, we construct the headlines of each topic and we
run through the list of related tweets to keep only tweets that
are different enough from the central one’s (no duplicates)

We do so by keeping only the tweets whose JC score with the
central tweet and all previous related tweets is above a chosen
threshold Thrmax.

We first chose the values 400 and 600 for Thrlow and Thrmax
respectively,

but many topics had only one related tweet (all the others were RT), so we
decided to lower that threshold to 240
14
Keywords

In the bag of words constructed from the list of related
tweets, we remove articles (stop-words), punctuation,
special characters, etc.

We get a list of words, and we order them by decreasing
frequency of occurrence.

Finally we report the k most frequent words, in a list of
keywords
15
Media URLs

The body of a tweet (in the json file format), contains a URL
information for links to media files

entities → media → media url.

We scan the original json format in order to retrieve such a
URL, from the most representative tweet or any of its related
tweets, pointing to valid photos or pictures in a jpg, png or
gif format, and then we report these pictures along with the
headlines and the set of keywords

Almost half of the headlines (47%) produced by our method
had an image retrieved from the original tweet.
16
Benefits

Both message classification and identification of the growing
trends in real time (trend sensing) -- > submitted to KDD’14

Track the information and timeline within a social network

Deal with languages other than English without specific preprocessing or dictionaries, because the method is:

simple, context-free, with no grammar and does not use semantics
17
Conclusions

Implementation of a topic detection method applied to a
dataset of tweets emitted during a 24 hour period

It relies heavily on the concept of Joint String Complexity
which has the benefit



of being language agnostic and does not require humans to deal
with list of keywords
has high algorithmic efficiency
The results obtained are satisfactory and promising on
the SNOW dataset and other non latin languages
(e.g. Greek)
18
Future Work, Improvements

Use the theoretical background in order to automatically fix
the threshold values, than empirical ones chosen in this
work

Fix a discarding threshold to remove not significant enough
topics; thus allowing some not very active time-slots to
contain less or more than a fixed number of topics.

Handle topics that where cut in half between two timeslots
(since they where arbitrary divided in 15 min.)

Extend the JC metric to make topological classification of
tweets and perform clustering based on this distance
19
Publications related to JC
D. Milioris and P. Jacquet, “Joint Sequence Complexity Analysis:
Application to Social Networks Information Flow”, in Bell Laboratories
Technical Journal, Issue on Data Analytics, Vol. 18, No. 4, 2014
 P. Jacquet, D. Milioris, and W. Szpankowski, “Classification of Markov
Sources Through Joint String Complexity: Theory and Experiments,” Proc.
IEEE Internat. Symp. Inform. Theory (ISIT ’13)
 P. Jacquet and W. Szpankowski, “Joint String Complexity for Markov
Sources,” Proc. 23rd Internat. Meeting on Probabilistic, Combinatorial, and
Asymptotic Methods for the Anal. of Algorithms (AofA ’12)
 P. Jacquet, “Common Words Between Two Random Strings,” Proc. IEEE
Internat. Symp. on Inform. Theory (ISIT ’07)
------------------------------------------------------------------------------------------------------ P. Jacquet and W. Szpankowski, “Analytical Depoissonization and Its
Applications,” Theoret. Comput. Sci., 201:1-2 (1998), 1–62.
 P. Jacquet and W. Szpankowski, “Autocorrelation on Words and Its
Applications: Analysis of Suffix Trees by String-Ruler Approach,” J.
Combin. Theory Ser. A, 66:2 (1994), 237–269.

20
Questions ?
[email protected]
[email protected]
21

similar documents