Yi Tang Data Scientist with Emacs

Notes on Factorisation Machine

To start with, consider model equation of a linear regression model with two-way interaction effect:

\begin{equation} f(x) = \beta_0 + \sum_{i=1}^{p} \beta_i x_i + \sum_{i=1}^{p} \sum_{j=i+1}^{p} \beta_{i, j} x_i x_j \end{equation}

where

\(\beta_o\)
is the interception term,
\(\beta_i\)
models the strength of the i-th variable.
\(\beta_{i,j}\)
models the interaction between i-th and j-th variable.

To estimate the parameters, we can firstly create the interactive variables 1 \(x_{i,j} = x_i \cdot x_j\) and add them to the design matrix \(X\). It converts the problem into a linear regression which can be easily solved by least squares2.

The Factorisation Models have the same model equation but use a different way of estimating the interaction parameter \(\beta_{i,j}\) by factorising it: provided a sufficiently large \(k\), the \(\beta\) \((p \times p)\) matrix can be approximated (factorised) by \(V \bullet V^T\) using a lower dimension matrix \(V\) \((p \times k)\).

In this way, interaction parameter becomes

\begin{equation} \hat{\beta}_{i,j} := \langle v_i, v_j \rangle = \sum_{f=1}^{k} v_{i, f} v_{j, f} \end{equation}

Where the k is a hyper-parameter that defines the dimensionality of the factorisation.

The model equation can be solved using stochastic gradient descent with various of loss (square loss, logit, or hinge loss). For square loss 3, the gradient of FM model is

\begin{equation} \frac{\partial{f}}{\partial{\theta}} = \begin{cases} 1, & \textrm{ if } \theta \textrm{ is } \beta_0 \\ x_i, & \textrm{ if } \theta \textrm{ is } w_i \\ x_i(\sum_{j=1}^{p} v_{j, f} x_j) - v_{i, f} x_i^2, & \textrm{ if } \theta \textrm{ is } v_{i, f} \end{cases} \end{equation}

Due to this factorisation of interaction there is no model parameter that directly depends on the two variables. The computation can reduced from quadratic \(\mathcal{O}(kn^2)\) to linear \(\mathcal{O}(kn)\), see Lemma 3.1 in Rendle's paper 4.

Another advantage of this is we are able to estimate the interaction parameter \(\beta_{i,j}\) when there is not enough observation for \((x_i, x_j)\). This is a desired feature when working with sparse dataset. Rendle showed FM outperforms Support Vector Machine on hugely sparse dataset.

Factorisation Machine looks promising: it's fast to train and performs well on sparse dataset. I haven't fully understand it yet but am keen to apply it to a Kaggle competition and gain more insights.

How to Create a Screencast GIF in Emacs

nil

I've always wanted to create a GIF using Emacs to demonstrate some features, it just looks so cool. I finally got a chance after attending the Leeds Code Dojo. The final exercise is bit unusual; we have to write a basic expression evaluation program without using the eval function in whatever language we choose. The first problem we had was to figure out the order of sub-expression to evaluate. For example, in (5 * (2 + 1) ) expression, we know we firstly add 2 to 1 to get the 3, and then multiply 3 by 5. It sounds trivial but it is actually hard to write a program to do that.

I used regular expression1 to locate the most inner expression to evaluate, then replaced the expression with its evaluating result, and continued these two steps until there was no expression2.

The above GIF shows each step in a expression evaluation program written in Emacs Lisp.

This post show how to make GIF in Emacs on Ubuntu system.

Dependencies

There are three packages to install first. We need recordmydesktop to capture the motion of the screen, mplayer to view the video, and imagemagic to convert the recorded video into GIF file. They can be installed easily using the apt-get command, as in the following bash shell script:

sudo apt-get install recordmydesktop mplayer imagemagick

On Emacs side, I use camcorder package to control the workflow. It is hosted in MELPA repository, and can be installed by

(package-install 'camcorder)

Then everything should work nicely together.

Workflow

After these packages are installed, creating a GIF is simply, only requiring three steps.

1. Initiate the recording

In Emacs,

  • Switch to the buffer we want to record, let's call this buffer the recording buffer,
  • Initiate the recording by M-x camcorder-record command,
  • Choose where to save the video file, then

A new frame with the recording buffer will pop up. It is wrapped inside a white rectangular box. Everything inside the box will be recorded and saved in the video file. Note, if we move the window or overlay it with other windows, we probably get undesired results.

2. Record Choose the recording buffer/frame,

  • Press F-11 to pause/resume,
  • Show some cool things,
  • Press F-12 to stop,

Note the demonstration must have an effect on the recording buffer, and we can use with-current-buffer function to dump the output for a particular buffer, for example,

(with-current-buffer "Demo_Buffer"
  (insert "Start to demo: "))

will insert "Start to demo: " into the Demo_Buffer.

I found it is useful to wrap the demonstration into a function and bind to a key because I will probably run it many times.

(defun yt/camcorder-show-off ()
  (interactive)
  (goto-char (point-min))
  (insert "going to show you something cool, don't blink your eyes.")
  (sleep-for 2)
  ;;;; apply some functions
  (insert "\nExciting isn't?"))

(define-key camcorder-mode-map [f5] 'yt/camcorder-show-off)

There are two functions that are helpful control the flow. Use sleep-for function to let the program wait few seconds, and use y-or-n-p to let us choose whether to proceed or switch flow.

3. Make gif

After the demo is finished,

  • Type M-x camcorder-convert to convert a video file to a GIF file,
  • Choose a file name for the GIF file,
  • Select convert method, and choose use mplay with imagicstick.

We probably repeat the step 1-3 multiple times until we are happy with the GIF.

Footnotes:

1

Regular expression might not be suitable for this task, and it works

2

Everything is actually an expression

Migrate to Ubuntu

My MacBookPro's hard drive stooped working last week and I managed to recover most of the data from a Time Machine back-up 6 months ago. But I couldn't get the mu4e and mu working. I feed up with googling, trying, and decide to immigrate to Ubuntu. It would save me from a lot of frustrations and time in making my Mac and office PC work the same way.

Ideally, I will built a Ubuntu on Mac which is exactly the same as the one on my office PC, by just copy over everything 1. As a minimalist, I decided to build the system from scratch and install software one by one so that I can have an better understanding of what are the necessities for me.

In the last few days, I become extra mindful about the what and how I used the Ubuntu system in the office, and realise the things I need can be grouped into three categories:

  1. Configuration,
    1. the .ssh folder for the ssh-agent,
    2. the .fonts folder for new fonts,
    3. the .mbsynrc file for sync emails,
    4. the .ledgerrc.
  2. Software for
    1. Development: like git, gcc, Emacs, and R.
    2. Writing: org-mode, LaTeX,
    3. Email: mu, mu4e, and mbsync.
    4. Finance: ledger.
  3. Personal git repositories
    1. public reposity on GitHub,
    2. private reposities on BitBucket

For 1), since they are small, I can zip up and copy over, or even better, create a git repository so that sync on two machines becomes better easier.

For 2), I need to find the software's package name in the Ubuntu's software repository, and then install all of them by a script. The dependencies should be resolved automatically.

For 3), I need to create a shared folder between the host system and the Ubuntu system, and then copy over the ~/git/ folder.

It really sounds like a plan! I am going to download the Ubuntu installation file now and hopefully the transition will be very smooth.

My Expeirence with Repetitive Strain Injury (RSI)

Someday I typed more than 80 thousand times just in Emacs. This is pretty awesome at first sight but it can cause serious health problem.

Last month, I felt burning pain of my forearms. It is an symptom of Repetitive strain injury (RSI). I realised that if continue typing like that, one day I will never able to do programming, like the Emacs celebrities in Xah Lee' article about RSI.

Since then I've deliberately tried to avoid aimless and unproductive typing, take more typing breaks, think though things before trying, write more on paper.

Conditions are getting better: I don't feel server pain any more, only sometimes uncomfortable.

But I need to find a better way to improve it. Because sometimes I got the idea, but can't touch the keyboard. This feeling really suck.

So I investigated the Hydra package and use it to group related commands together so that use only two keys are needed to perform frequent tasks.

For example, to search something in current project, instead of typing M-x helm proj grep, that's 16 keystrokes, I only need F5 G with Hydra. The implementation is listed in this post.

But calling functions/commands in Emacs counts only a small proportion of my typing; most of the time, I write code and report.

This is where Yasnippets kicks in, it enable me to type less without losing quality. For example, I use this snippet quite often when writing R code,

res <- sapply(seq_len(n), function(i) {
    ## 
})

That's more than 40 keystrokes. Yasnippets can short it to only 6s! After I type sapply and then hit TAB, it will expand to the region above.

I will investigate the Yasnippet package next week. If you know any good tutorials for Yasnippet or snippets for writing R code, please share your resources.

Start Enjoying Regular Expression In Emacs

The search-forward-regexp, replace-match, and match-string functions work together nicely, and makes my job much easier and enjoyable!

I am writing a release notes for the a software updates. Part of the process is to associate the SVN Revision number that relates to important changes, so that others can backtrack and review the code and see what exactly has been implemented.

In Phabricator, the revision number will be render automatically. Clicking them takes me to the exact revision, showing the difference with previous version. But the documentation will be eventually built by Sphinx and hosted on a remote server. So I have to manually add the URL to all the SVN revision number. For example, to replace rS1234 to

[[http://phabricator.domain.co.uk/rS1234][rS1234]]

There are 31 revision number in the whole document. I could do it manually but for the long term benefits, it would be more efficient write a function to process it automatically, maybe others can use it as well.

Implementation

The first thing I noticed is each SVN revision numbers consist of two letters (rS) and few digits. Because the four digits I don't know beforehand, I have to use regular expression to do the pattern search.

The tricky bit here is to retrieve the values that matched the pattern, because of it is needed to construct the URL that points to the commits, and I also need to replace the it with differnet values.

The procedure can be summarised as:

  1. Find the revision number that match the patterns described above. I use search-forward-regexp() to search the pattern "rS[0-9]+", which means a string that starts with rS with one or more digits.a
  2. retrieve the values that matched the pattern. This is done by match-string().
  3. replace the revision number with the constructed URL. This is done by replace-match(), and I use concat() to combine the IP address with the revision number.

The following is a workable implementation:

(defvar revision-pattern "rS[0-9]+"
  "The RegExp pattern of the SVN revision number")

(defvar repo-url "http://10.0.0.11/"
  "The IP address of the SVN repository")

(defun yt/add-link-to-SVN-revision-number ()
  "add links to svn commits identifier"
  (interactive)
  (while (search-forward-regexp revision-pattern)
    (let* ((commit (match-string 0))
           (link (concat repo-url commit)))
      (replace-match "")
      (org-insert-link nil link commit))))

Note the last two lines of the function can be simplified as

(replace-match (concat "[[" link
                       "][" commit "]]"))                       

You can easily adopt the code and make it applicable to your case, just modify the revision-pattern and repo-url variables. But beware that you should not apply the function to the same buffer more than once, otherwise you will get something crazy like this:

[[http://10.0.0.11/[[http://10.0.0.11/rS1234][rS1234]]][[[http://10.0.0.11/rS1234][rS1234]]]]

One way to make it better is to have a test before replacing: if the revision number is already associated with a URL, then do nothing. If you have figure out how to do it, please let me know and I've happy to update this post.

My posts published last year showed my frustration with regular expression in Emacs. But now I am looking forward doing more text processing with it, because it will be fun!

The search-forward-regexp, replace-match, and match-string functions work together nicely and make the my job much easier and enjoyable!

What's your favourite functions in regular expression? Do you have something to recommend?

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!