Which Learning Algorithms Really Matter (Industrially)?

Report
Which Algorithms Really Matter?
©MapR Technologies 2013
1
Me, Us

Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG

MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s

Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
©MapR Technologies 2013
2
Topic For Today

What is important? What is not?

Why?

What is the difference from academic research?

Some examples
©MapR Technologies 2013
4
What is Important?

Deployable

Robust

Transparent

Skillset and mindset matched?

Proportionate
©MapR Technologies 2013
5
What is Important?

Deployable
–
Clever prototypes don’t count if they can’t be standardized

Robust

Transparent

Skillset and mindset matched?

Proportionate
©MapR Technologies 2013
6
What is Important?

Deployable
–

Robust
–

Clever prototypes don’t count
Mishandling is common
Transparent
–
Will degradation be obvious?

Skillset and mindset matched?

Proportionate
©MapR Technologies 2013
7
What is Important?

Deployable
–

Robust
–

Will degradation be obvious?
Skillset and mindset matched?
–

Mishandling is common
Transparent
–

Clever prototypes don’t count
How long will your fancy data scientist enjoy doing standard ops tasks?
Proportionate
–
Where is the highest value per minute of effort?
©MapR Technologies 2013
8
Academic Goals vs Pragmatics

Academic goals
–
–
–

Reproducible
Isolate theoretically important aspects
Work on novel problems
Pragmatics
–
–
–
–
–
Highest net value
Available data is constantly changing
Diligence and consistency have larger impact than cleverness
Many systems feed themselves, exploration and exploitation are both
important
Engineering constraints on budget and schedule
©MapR Technologies 2013
9
Example 1:
Making Recommendations Better
©MapR Technologies 2013
10
Recommendation Advances

What are the most important algorithmic advances in
recommendations over the last 10 years?

Cooccurrence analysis?

Matrix completion via factorization?

Latent factor log-linear models?

Temporal dynamics?
©MapR Technologies 2013
11
The Winner – None of the Above

What are the most important algorithmic advances in
recommendations over the last 10 years?
1. Result dithering
2. Anti-flood
©MapR Technologies 2013
12
The Real Issues

Exploration

Diversity

Speed

Not the last fraction of a percent
©MapR Technologies 2013
13
Result Dithering

Dithering is used to re-order recommendation results
–
Re-ordering is done randomly

Dithering is guaranteed to make off-line performance worse

Dithering also has a near perfect record of making actual
performance much better
©MapR Technologies 2013
14
Result Dithering

Dithering is used to re-order recommendation results
–
Re-ordering is done randomly

Dithering is guaranteed to make off-line performance worse

Dithering also has a near perfect record of making actual
performance much better
“Made more difference than any other change”
©MapR Technologies 2013
15
Simple Dithering Algorithm

Generate synthetic score from log rank plus Gaussian
s = logr + N(0, e )

Pick noise scale to provide desired level of mixing
Dr µ r exp e

Typically
e Î [ 0.4, 0.8]

Oh… use floor(t/T) as seed
©MapR Technologies 2013
16
Example … ε = 0.5
1
1
1
1
1
1
1
2
4
2
3
2
©MapR Technologies 2013
2
2
4
2
6
2
2
1
1
1
1
1
6
3
3
4
2
3
3
3
2
5
5
3
5
8
2
3
3
5
4
5
7
3
4
4
3
5
6
15
4
24
6
7
3
4
2
7
17
4
7
7
7
16
7
12
6
9
7
7
12
13
6
11
13
9
17
5
4
8
13
8
17
16
34
10
19
5
13
14
17
5
6
6
16
Example … ε = log 2 = 0.69
1
1
1
1
1
1
1
2
2
3
11
1
©MapR Technologies 2013
2
8
3
2
5
2
3
4
3
4
1
8
8
14
8
10
33
7
5
11
1
1
2
7
3
15
2
7
15
3
23
8
4
2
4
3
9
3
10
3
2
5
9
3
6
10
5
22
18
15
2
5
8
9
4
7
1
7
11
7
11
7
22
7
6
11
19
4
44
8
15
3
2
6
10
4
14
29
6
2
9
33
14
14
33
Exploring The Second Page
©MapR Technologies 2013
19
Lesson 1:
Exploration is good
©MapR Technologies 2013
20
Example 2:
Bayesian Bandits
©MapR Technologies 2013
21
Bayesian Bandits

Based on Thompson sampling

Very general sequential test

Near optimal regret

Trade-off exploration and exploitation

Possibly best known solution for exploration/exploitation

Incredibly simple
©MapR Technologies 2013
22
Thompson Sampling

Select each shell according to the probability that it is the best

Probability that it is the best can be computed using posterior
é
ù
P(i is best) = ò I êE[ri | q ] = max E[rj | q ]ú P(q | D) dq
ë
û
j

But I promised a simple answer
©MapR Technologies 2013
23
Thompson Sampling – Take 2

Sample θ
q ~ P(q | D)

Pick i to maximize reward
i = argmax E[rj | q ]
j

Record result from using i
©MapR Technologies 2013
24
Fast Convergence
0.12
0.11
0.1
0.09
0.08
regret
0.07
0.06
ε- greedy, ε = 0.05
0.05
0.04
Bayesian Bandit with Gam m a- Norm al
0.03
0.02
0.01
0
0
100
200
300
400
600
500
n
©MapR Technologies 2013
25
700
800
900
1000
1100
Thompson Sampling on Ads
An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
©MapR Technologies 2013
26
Bayesian Bandits versus Result Dithering

Many useful systems are difficult to frame in fully Bayesian form

Thompson sampling cannot be applied without posterior sampling

Can still do useful exploration with dithering

But better to use Thompson sampling if possible
©MapR Technologies 2013
27
Lesson 2:
Exploration is pretty
easy to do and pays
big benefits.
©MapR Technologies 2013
28
Example 3:
On-line Clustering
©MapR Technologies 2013
29
The Problem

K-means clustering is useful for feature extraction or compression

At scale and at high dimension, the desirable number of clusters
increases

Very large number of clusters may require more passes through
the data

Super-linear scaling is generally infeasible
©MapR Technologies 2013
30
The Solution

Sketch-based algorithms produce a sketch of the data

Streaming k-means uses adaptive dp-means to produce this sketch
in the form of many weighted centroids which approximate the
original distribution

The size of the sketch grows very slowly with increasing data size

Many operations such as clustering are well behaved on sketches
Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson.
Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.
©MapR Technologies 2013
31
An Example
©MapR Technologies 2013
32
An Example
©MapR Technologies 2013
33
The Cluster Proximity Features

Every point can be described by the nearest cluster
–
–

Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
bit + 2 proximities)
–
–

4.3 bits per point in this case
Significant error that can be decreased (to a point) by increasing number of
clusters
Error is negligible
Unwinds the data into a simple representation
Or we can increase the number of clusters (n fold increase adds log
n bits per point, decreases error by sqrt(n)
©MapR Technologies 2013
34
Diagonalized Cluster Proximity
©MapR Technologies 2013
35
Lots of Clusters Are Fine
©MapR Technologies 2013
36
Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together
©MapR Technologies 2013
37
Streaming k-means Ideas

By using a sketch with lots (k log N) of centroids, we avoid
pathological cases

We still get a very good result if the sketch is created
–
–
in one pass
with approximate search

In fact, adaptive dp-means works just fine

In the end, the sketch can be used for clustering or …
©MapR Technologies 2013
38
Lesson 3:
Sketches make big
data small.
©MapR Technologies 2013
39
Example 4:
Search Abuse
©MapR Technologies 2013
40
Recommendations
Alice
Charles
©MapR Technologies 2013
Alice got an apple and a
puppy
Charles got a bicycle
41
Recommendations
Alice
Bob
Charles
©MapR Technologies 2013
Alice got an apple and a
puppy
Bob got an apple
Charles got a bicycle
42
Recommendations
Alice
Bob
?
What else would Bob like?
Charles
©MapR Technologies 2013
43
Log Files
Alice
Charles
Charles
Alice
Alice
Bob
Bob
©MapR Technologies 2013
44
History Matrix: Users by Items
Alice
✔
Bob
✔
Charles
©MapR Technologies 2013
✔
✔
✔
✔
45
✔
Co-occurrence Matrix: Items by Items
How do you tell which co-occurrences are useful?.
1
2
1
©MapR Technologies 2013
1
2
1
0
0
-
1
1
46
0
0
Co-occurrence Binary Matrix
not
not
©MapR Technologies 2013
1
1
47
1
Indicator Matrix: Anomalous Co-Occurrence
Result: The marked row will be added to the indicator
field in the item document…
✔
✔
©MapR Technologies 2013
48
Indicator Matrix
That one row from indicator matrix becomes the indicator field in the Solr
document used to deploy the recommendation engine.
✔
id: t4
title: puppy
desc: The sweetest little puppy ever.
keywords: puppy, dog, pet
indicators:
(t1)
Note: data for the indicator field is added directly to meta-data for a document in
Solr index. You don’t need to create a separate index for the indicators.
©MapR Technologies 2013
49
Internals of the Recommender Engine
50
©MapR Technologies 2013
50
Internals of the Recommender Engine
51
©MapR Technologies 2013
51
Looking Inside LucidWorks
Real-time recommendation query and results: Evaluation
What to recommend if new user listened to 2122: Fats Domino & 303: Beatles?
Recommendation is “1710 : Chuck Berry”
52
©MapR Technologies 2013
52
Real-life example
©MapR Technologies 2013
53
Lesson 4:
Recursive search abuse pays
Search can implement recs
Which can implement search
©MapR Technologies 2013
54
Summary
©MapR Technologies 2013
55
©MapR Technologies 2013
56
Me, Us

Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG

MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s

Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
©MapR Technologies 2013
57

similar documents