Relational Learning via Collective
Matrix Factorization
A Bayesian Matrix Factorization Model
for Relational Data
UAI 2010
Authors: Ajit P. Singh & Geoffrey J. Gordon
Presenter: Xian Xing Zhang
Basic ideas
• Collective matrix factorization is proposed for
relational learning when an entity participates
in multiple relations.
• Several matrices (with different types of
support) are factored simultaneously with
shared parameters
• CMF is extended to a hierarchical Bayesian
model to enhance the sharing of statistics
An example of application
• Functional Magnetic Resonance Imaging (fMRI):
– fMRI data can be viewed as a relation (real valued),
Response(stimulus, voxel) ∈ [0, 1]
– stimulus side-information: a relation (binary) Cooccurs(word, stimulus) ∈ {0, 1} (which is collected as the
statistics of whether the stimulus word co-occurs with other commonlyused words in large)
– The goal is to predict unobserved values of the
Response relation
Basic model description
• In fMRI example, the Co-occurs relation is an m×n
matrix X; the Response relation is an n×r matrix Y.
• Likelihood of each matrix X and Y:
• Co-occurs (p_X) is modeled by the Bernoulli
distribution, Response (p_Y) is modeled by a
Hierarchical Collective Matrix Factorization
• Information between entities can only be shared indirectly, through
another facto: e.g., in f(UV’), two distinct rows of U are correlated only
through V .
• The hierarchical prior acts as a shrinkage estimator for the rows of a factor,
pooling information indirectly, through Θ.
Bayesian Inference
• Hessian Metropolis-Hastings:
– In random walk Metropolis-Hastings it samples from a proposal
distribution defined by a Gaussian with mean equal to the sample at
time t, F_i(t) and covariance matrix
, which is problematic.
– HMH uses both the gradient and Hessian to automatically construct a
proposal distribution at each sampling step. This is claimed as the
main technical contribution of the UAI2010 paper.
Related work
Experiment setting
• The Co-occurs(word, stimulus) relation is collected by
measuring whether or not the stimulus word occurs within
five tokens of a word in the Google Tera-word corpus.
• Hold-out prediction:
• Fold-in prediction (to predict a new row in Y)
Experiment results
• Existing methods force one to choose between
ignoring parameter uncertainty or making
Gaussianity assumptions.
• Non-Gaussian response types significantly
improve predictive accuracy.
• While non-Gaussianity complicates the
construction of proposal distributions for
Metropolis-Hastings, it does have a significant
impact on predictive accuracy

similar documents