Notes on Factorisation Machine

20 Dec 2015

To start with, consider model equation of a linear regression model with two-way interaction effect:

\begin{equation} f(x) = \beta_0 + \sum_{i=1}^{p} \beta_i x_i + \sum_{i=1}^{p} \sum_{j=i+1}^{p} \beta_{i, j} x_i x_j \end{equation}

where

\(\beta_o\): is the interception term,
\(\beta_i\): models the strength of the i-th variable.
\(\beta_{i,j}\): models the interaction between i-th and j-th variable.

To estimate the parameters, we can firstly create the interactive variables ¹ \(x_{i,j} = x_i \cdot x_j\) and add them to the design matrix \(X\). It converts the problem into a linear regression which can be easily solved by least squares².

The Factorisation Models have the same model equation but use a different way of estimating the interaction parameter \(\beta_{i,j}\) by factorising it: provided a sufficiently large \(k\), the \(\beta\) \((p \times p)\) matrix can be approximated (factorised) by \(V \bullet V^T\) using a lower dimension matrix \(V\) \((p \times k)\).

In this way, interaction parameter becomes

\begin{equation} \hat{\beta}_{i,j} := \langle v_i, v_j \rangle = \sum_{f=1}^{k} v_{i, f} v_{j, f} \end{equation}

Where the k is a hyper-parameter that defines the dimensionality of the factorisation.

The model equation can be solved using stochastic gradient descent with various of loss (square loss, logit, or hinge loss). For square loss ³, the gradient of FM model is

\begin{equation} \frac{\partial{f}}{\partial{\theta}} = \begin{cases} 1, & \textrm{ if } \theta \textrm{ is } \beta_0 \\ x_i, & \textrm{ if } \theta \textrm{ is } w_i \\ x_i(\sum_{j=1}^{p} v_{j, f} x_j) - v_{i, f} x_i^2, & \textrm{ if } \theta \textrm{ is } v_{i, f} \end{cases} \end{equation}

Due to this factorisation of interaction there is no model parameter that directly depends on the two variables. The computation can reduced from quadratic \(\mathcal{O}(kn^2)\) to linear \(\mathcal{O}(kn)\), see Lemma 3.1 in Rendle's paper ⁴.

Another advantage of this is we are able to estimate the interaction parameter \(\beta_{i,j}\) when there is not enough observation for \((x_i, x_j)\). This is a desired feature when working with sparse dataset. Rendle showed FM outperforms Support Vector Machine on hugely sparse dataset.

Factorisation Machine looks promising: it's fast to train and performs well on sparse dataset. I haven't fully understand it yet but am keen to apply it to a Kaggle competition and gain more insights.

Footnotes:

https://en.wikipedia.org/wiki/Interaction_(statistics)

https://en.wikipedia.org/wiki/Least_squares

https://en.wikipedia.org/wiki/Loss_functions_for_classification#Square_loss

⁴

http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

Yi Tang Data Science and Emacs

Notes on Factorisation Machine

Footnotes:

Related Posts

Setup Emacs Servers in MacOS 09 Feb 2023

Speed Up Sparse Boolean Data 05 Jan 2023

PoorMan's CI in Emacs 16 Dec 2022