Yi Tang Data Science and Emacs

A Workflow for Using Git to Track SVN Repository

Version control system is a complex issues, and hard to understand the idea of branching and different types of merging. I merely understand the basic of Git, and it already makes my life a lot easier, I am managing about 10 repositories at this moment without much effort.

But my collages are using SVN as the centre storage for scripts. Switching to SVN is not a problem, I just need few weeks to transfer the knowledge and start to use it. I am reluctant to learn something basic and have duplicated knowledge, also, I use GitHub and Bitbucket which are Git based. But sticking to Git make mine work impossible to work with collauges.

Then I found out the Git developer has already made effort to bridge Git and other version control system, like SVN. The git svn allows me to just Git commands for staging, cherry-picking, pull etc, and then upload to the SVN remote repository with just one command line. I really like the idea of transferring the skills from one system to another without any cost, it makes me believe Git is great and I can continue to use Magit in Emacs!

Here is the basic steps and comments for this work flow:

  1. Create a folder mkdir ProjRepo
  2. Create an empty Git repository git init
  3. Add the following to .git/config
[svn-remote "svn"] url = https://your.svn.repo fetch = :refs/remotes/git-svn

and change the URL to right repository,

  1. pull from SVN central repository to this folder, git svn fetch svn
  2. switch to SVN remote branch, git checkout -b svn git-svn
  3. modify or add files
  4. use git add and git commit for snapshot local changes
  5. sometimes need to update local repository, git svn rebase
  6. finally upload local changes to SVN central repository git svn dcommit

See the official manual 8.1 Git and Other Systems - Git and Subversion git-svn documentation for more details.

Why Use Emacs 1 - Emacs Speaks Statistics

I am a Statistician, coding in R and write report is what I do most of the day. I have been though a long way of searching the perfect editor for me, tried Rstudio, SublimeText, TextMate and settled down happily with ESS/Emacs, for both coding and writing.

There three features that have me made the decision:

Auto Formatting

Scientists has reputation of being bad programmers, who wrote code that is unreadable and therefore incomprehensible to others. I have intention to become top level programmer and followed a style guide strictly. It means I have to spent sometime in adding and removing space in the code.

To my surprise, Emacs will do it for me automatically, just by hitting the TAB and it also indents smartly, which make me conformable to write long function call and split it into multiple lines. Here's an example. Also, if I miss placed a ')' or ']' the formatting will become strange and it reminders me to check.

rainfall.subset london,
rainfall.pairs,
rainfall.dublin)

Search Command History

I frequently search the command history. Imaging I was produce a plot and I realised there was something miss in the data, so I go back and fix the data first, then run the ggplot command again, I press Up/Down bottom many times, or just search once/two times. M-x ggplot( will give me the most recent command I typed containing the keyword ggplot(, then I press RET to select the command, which might be ggplot(gg.df, aes(lon, lat, col = city)) + geom_line() + ...... If it is not I want, I press C-r again to choose the second most recent one and repeat until I find right one.

Literate Programming

I am a supporter of literate statistical analysis and believe we should put code, results and discoveries together in developing models. Rstudio provides an easy to use tool for this purpose, but it does not support different R sessions, so if I need to generate a report, I have to re-run all the code from beginning, which isn't particle for me with volumes data because it will take quit long.

ESS and org-mode works really well via Babel, which is more friendly to use. I can choose to run only part of the code and have the output being inserted automatically, no need to copy/paste. Also, I can choose where to execute the code, on my local machine or the remote server, or both at the same time.

These are only the surface of ESS and there are lot more useful features like spell checking for comments and documentation templates, that makes me productive and I would recommend anyone uses R to learn ESS/Emacs. The following is my current setting.

;; Adapted with one minor change from Felipe Salazar at
;; http://www.emacswiki.org/emacs/EmacsSpeaksStatistics
(require 'ess-site)
(setq ess-ask-for-ess-directory nil) ;; start R on default folder
(setq ess-local-process-name "R")
(setq ansi-color-for-comint-mode 'filter) ;;
(setq comint-scroll-to-bottom-on-input t)
(setq comint-scroll-to-bottom-on-output t)
(setq comint-move-point-for-output t)
(setq ess-eval-visibly-p 'nowait) ;; no waiting while ess evalating
(defun my-ess-start-R ()
(interactive)
(if (not (member "*R*" (mapcar (function buffer-name) (buffer-list))))
(progn
(delete-other-windows)
(setq w1 (selected-window))
(setq w1name (buffer-name))
(setq w2 (split-window w1 nil t))
(R)
(set-window-buffer w2 "*R*")
(set-window-buffer w1 w1name))))
(defun my-ess-eval ()
(interactive)
(my-ess-start-R)
(if (and transient-mark-mode mark-active)
(call-interactively 'ess-eval-region)
(call-interactively 'ess-eval-line-and-step)))
(add-hook 'ess-mode-hook
'(lambda()
(local-set-key [(shift return)] 'my-ess-eval)))
(add-hook 'inferior-ess-mode-hook
'(lambda()
(local-set-key [C-up] 'comint-previous-input)
(local-set-key [C-down] 'comint-next-input)))
(add-hook 'ess-mode-hook
(lambda ()
(flyspell-prog-mode)
(run-hooks 'prog-mode-hook)
;; (prog-mode)
))

;; REF: http://stackoverflow.com/questions/2901198/useful-keyboard-shortcuts-and-tips-for-ess-r
;; Control and up/down arrow keys to search history with matching what you've already typed:
(define-key comint-mode-map [C-up] 'comint-previous-matching-input-from-input)
(define-key comint-mode-map [C-down] 'comint-next-matching-input-from-input)

Send Stylish MIME in Emacs

Last Updated: 18 Jan 2015

This is the first technical article in this blog, however the main purpose is not to analyse the problem and provide the solutions, but to tell a story of an ordinary person trying to pursuit his vision in a multi-languages environment (Emacs and HTML) that he only knows the basis. Hope you find it is interesting to read and for those who care the solution more than problem-solving approach, please see the last section.

The Problem

The first time I thought I need an fancy Email is when I sent an quick model update to my colleague; I have a table like this

Conditioning Variable Dependent Variable Probability
k >= 50 t >= 50 0.154
k >= 50 t >= 100 0.111
k >= 50 t >= 200 0.078

It was written in org-mode in which I can do the formatting quickly and nicely. But once copied over to Outlook, it looks messy, and the columns does not lineup.

| Conditioning Variable | Dependent Variable | Probability |
|---------------------------------------------------–—|
| k >= 50 | t >= 50 | 0.154 |
| k >= 50 | t >= 100 | 0.111 |
| k >= 50 | t >= 200 | 0.078 |

The correct way is to insert a table in Outlook. First, I have to export the table to a CSV file, than open it in Excel, and finally copy it over to Outlook which will recognised it as a table.

HTML Attachment Solution

I guess the purpose of that email is to give my colleague few numbers, in a way that he can compare and gain a feeling of the model. So the format is really necessary, but the workaround is really tedious.

I have another colleague who is an HTML expert and produced an company CSS style-sheet. He was kindly customised it to match the org-export class, i.e. org-ur, org-table, org-list.

So what I did was to export to org-file as a HTML and attached it in the email so that my colleague can simply click and open it in a browser, which will gives him a nicely formatted table. But people have hundreds email per day and seems to dislike attachments.

Paradise of MIME

I noticed Bernt Hansen pointed out in his famous Org Mode - Organize Your Life In Plain Text! that he use org-mime to sent HTML Email. MIME, standards for Multi-Purpose Internet Mail Extensions, is an extension to plain email and enable user to exchange rich data includes image, table, video etc.

The org-mime can parse the org file into an HTML code, in a way that the email server like Office365 or Gmail can recognise and render it with pre-defined styles.

The default style looks awful: the font, the colour, size, basically nothing is right. Recently I sent about 3-5 emails using this style, I doubt the reader will spent less time in reading and comprehend it, therefore the message is not conveyed.

But the workflow is fascinating: I call org-mime-subtree function, then I just type few email address, no need to switch to system or Outlook, everything is done in Emacs and at the exact point where the main content is generated.

So I was thinking, what if the email is look as good as the attachment? What if I can apply the style to the email, that would be looks fanatic!

I did my research, the org-mime indeed provides feature to let user to change the HTML style, two example are showed on worg. The package first generate the HTML file, and than search-and-replace a certain chunk, for example,

1
2
3
<p> 
  this is a paragraph 
</p>

will becomes something like this, depends on users specification,

1
2
3
<p style="blue">
  this is a paragraph 
</p>

The search-replace mechanics works fine, for a small email. It takes a pair value (element, style), where element can be paragraph, table, list and style can be colour, font, size etc. The problem is this pair is not quick match the standard CSS file,

1
2
3
4
5
6
7
body {
    font-family: "Helvetica Neue", "Lucida Grande", "Lucida Sans Unicode", Helvetica, Arial, sans-serif !important;
    font-size: 14px;
}
body #content {
    padding-top: 70px;
}

One can processing the CSS file, and feed the package a long list of pairs. But this approach seems not safe. I quickly skim the CSS file and found something I couldn't understand, for example the body #content block above.

Hack org-mime

I think the most problem-free approach is to follow org-export-html and ensure the generated Email has same style as exported HTML and org-MIME package will eventually implement this, but I don't to wait and decide to hack.

The script is formatted in a nice way, and looks like a textbook C program: it first declares variable and functions, with concise documentation so one can visualise the structure after reading 5-10 minutes. But the implementation is way beyond my knowldge on Emacs-Lisp language. I almost looked up each function that be used, take this snippet for example,

1
2
3
4
5
6
 
(with-temp-buffer
  (insert html)
  (goto-char (point-min))
  (run-hooks 'org-mime-html-hook)
  (buffer-string))

I have no idea of what does it means. You know the feeling when you try to learn an foreign language but took the wrong book that way above your level, and you find there no single word you could understand, and you was like What The Hell? That was my feeling.

The strategy I came up was to build up my Emacs-Lisp vocabulary: try to understand the functions/processes and translate it into a plain English, for example,

with-temp-buffer
create a temporary buffer
insert
insert the string, in this case, called html, at point.
goto-char
move the cursor, which is called point in emacs, to somewhere
point-min
means the begining of a buffer/file
run-hook
run functions that links to org-mime-html-hooks
buffer-string
return a buffer as a string

Now that I understand each words, I need to comprehense it and combine than together to understand the mean of this snippet. I tried to write in a plain English and the first attempt is like this

create temporary buffer, insert the generated html file, than move the cursor to the very start, and than apply other functions that links to org-mime-htmize

I continue this word-sentence-paragraph process and I understand few functions. But it can goes on and on, and the more I learn about Emacs lisp, the future away I digress from my original goal: apply the style sheet to HTML email. I guess this is a common dilemma in working with multi-languages. Usually I follow my interests but this time I choose to focus on achieving the goal.

MIME Solution

It turned out it is a right decision. The concept of "inline-CSS" is mentioned int the script, I googled and found out the solution within 10 minutes. I realised that what I need to do is add a block in beginning of the HTML mail!! BINGO!

1
2
3
4
5
6
<head>
  <style>
    ...
  </style>
</head>
;; html email content starts here 

Emacs Configuration

Here's the settings:

1
2
3
4
5
6
7
8
9
10
11
12
(require 'org-mime)
(add-hook 'org-mime-html-hook
          (lambda ()
            (insert
             "       
<head>
<style>
;; content of the .css file 
</style>
</head>"
             ))
          t)

Emacs for Writing

Last Updated: 31 Dec 2014

Do you use Emacs for writing the LaTeX, Markdown, or org documents? Do you have a set of specific settings only for writing? In this article I will share my experience of configuring a writing mode in Emacs that make it the most efficient writing tool for me.

Word Count

I try to write as concise as possible and I use word count as a benchmark. Counting the words does not sounds like a trivial task in my cases because I have a habit to comment, even for general writing. I may comment out the whole paragraph, and leave a note aside about why, which are kept as it will be helpful in edit/review. These comments and notes should not be counted since the reader can't see them.

Addition to comments, there is a full list that does not count for technical articles, like source code, tables, figure captions etc. Some people may add reference section to the list as well.

org-wc provides the org-wc-subtree function that know what to count and what not to count. Also, org-wc-display will loop though all sections and overlay the number of words to each section headline. It is particularly useful when I need to know which sections needs to trim down and which to add more.

One of my daily achievement is to complete a writing challenge, which is about either to have write about 500 words or 45 minutes, whichever comes first. It is like a racing game for me, knowing the time or number of words is important. Tracking time is simple in Org-mode but words is problematic: I have to call the org-wc-subtree function manually. I raised a issues on GitHub and guided to nanowrimo mode, which updates the word counts while I am typing and shows it on mode-line.

It works out of box for me. The number of words is adjacent to the time I spent, which make it is very convenient to compare. Also, it calculate the average number of words per minute. It use this number to predict how long I need to achieve my daily goals (which is 500 words). Screenshot%202014-12-26%2014.39.07.png The picture above shows that I spent 30 minutes editing and there are 254 words in this section.

Variable-width Font

I have a little OCD about font since university. I use Time News Rome for formal report and any other serif font for general writing because they make paragraph and text easier to read.

There was a time my friend passed me a PDF file and asked me to review it. The problem was it was in Arif font (I think) which looks terrible, and also writing became unpleasant. This experience makes me to think what is the best font for writing.

I did some research and come across the concept of variable-width font. As a programmer, I use Adobe's Source Code Pro font as default which means I face monospaced font all day. For a monospaced font, each character has same space.

While for variable-doth font, each cahracter takes width corresponding to it's shape. For example, the length of "i" is about 1 of 4th of "w". Needless to say, variable-width font is more close the nature of hand-writing. Emacs has a built-in variable-pitch-mode that could change the font.

But will it make any different to my writing? I am not sure at this moment, but I would like to have a special font that I solely use in writing. The link between the font and my write mind will gradually become firm, and eventually increase my productivity in writing.

Sentence Highlight

Writing requires thinking and concentration. People have their own tips that help them to stay focus and get writing done, it may relates to a place, time or tools.

I tired many tips, like mediate before write, drink coffee, cut off internet but none of them works very well, the effects seems random. One problem I have in writing is that I jump between the sections quite often.

I tried to highlight the one sentence at a time so that I can focus on the one I am writing. I found hl-sentence package does exactly what I want. Also, I followed the author's suggestion and tweak the configuration to blur the other sentences to reduce the noise.

The current setting has two folder and helps me in a way that I can focus naturally: I don't need to force myself not looking other sentence.

The sentence highlight feature also has an big impact on my writing process by making the editing easier. One thing I want to achieve is to have proper length for each sentence/paragraph: If it is too short, I will merge it. If it is too long, I will break up into short sentences. The highlights give me a sense of the length visually which I used to get by reading or counting. To check how many sentences exactly for each paragraph, I move the cursor to end of a sentence by M-e, and then count how many flashes I have to reach the end of a paragraph.

Screenshot%202014-12-26%2018.54.03.png

Wrap-up

I am fairly happy about the nanowrimo, hl-sentence and variable-pitch mode and the powerful Emacs. Thanks to all the authors who wrote the scripts, because of their quality work, many things work out of box and I am able to have an seamless integration to the current workflow. It has became more efficient and productive, and makes me believe the Emacs is the best writing tool for me.

Which program do you use for writing? which feature do you like most?

How Do I Build This Blog

To Reader

The learning skill becomes more and more important nowdays because there are just so much to learn, either from daily job or personal interest. But have you ever thought about the way you learn or how good is your learning skill? In this article, I want to share my experience in learning how to build this blog using Jekyll and how I exam the way I learn via data and statistical analysis. It is worth reading if you:

  1. want to improve your learning skill,
  2. are trying to learn Jekyll or want to build a personal website,
  3. are interested in quantified-self project.

Why I Learn Jekyll/Blog

All my high-school classmate known that I am really bad at writing. Things becomes worse when I was in university studying mathematics. I start to void writing at all the cost: I deliberately volunteed the do the coding or maths bit and let others do the writing. Everybody in the group seems like me.

During my study in Warwick University, I fancy the Statistics and really enjoy telling people the relationship I between all sort of facts. Then I come across The Guardian's Data Journalism program which is a term in use since 2009/2010, to describe a journalistic process based on analyzing and filtering large data sets for the purpose of creating a news story1. I found it is really cool! I tried to analyst a new dataset from World Bank and want to write an article but when I sit down my mind just completely blank. I managed to write a few paragraph but it was really awaful. That was the first time that I thought I wish I have a proper writing skill. I won't bother to learn because I planed to go back to China soon but I decided to work in the UK at the last minute.

Last week, I was doing a statistical consutlantcy project. I was do the analysis, writing code in the morning then try to write it up in the afternoon. I realise it pretty easy for me to do the analysis and coding, but it was really a pain to write it up, but I do enjoy it. So I decided to build a personal blog so that I can practise my writing skill and at the same time, to promote the usage of statistics in daily life.

I have no prior knowledge of building a website and spent couple of evenings/weekends sitting in front of computer at library/caffee and try to learn. This project accros two months and takes me about 20 hours in total to finally build this blog. Motivation really is crucial to learning. What make this learning expeirence really unique is that I have data about my learning.

Process Raw Data

I have a very good habit: I record every single tasks I do in terms of how much time I spent and what I did. Take this task for example,

PM (strcture)
SCHEDULED: <2014-12-11 Thu 17:45>
:LOGBOOK:  
CLOCK: [2014-12-11 Thu 17:41]--[2014-12-11 Thu 18:28] =>  0:47
:END:      
:PROPERTIES:
:Effort:   0:45
:END:
[2014-12-10 Wed 22:28]
wait for 4 minutes 
- [ ] go though all the headlines,
- [ ] group into few categories, like 1) learn, 2) apply, 3) improve etc.
- [ ] start to edit 

This note includes all the info I need to know:

  1. I want to do it On Wed and estimate it takes 45 minutes to compltete
  2. I planed to do on Thurs,
  3. I started at Thur 17:41, 4 minutes before the scheduled time,
  4. It takes me 47 minutes to do the job, 2 minutes longer then expected.
  5. what i did, the main body

I gathered all the notes that are relevant to this project and accumulate the time I spent for each sub-tasks. It can be summaries as a table:

Table 1: Clock summary at [2014-12-10 Wed 21:42]
Headline Time  
Total time 17:22  
TODO blog 17:22  
  DONE Jekyll official guide on github…   1:22
  workflow   1:45
  TODO how to change the look of html   2:11
  org-jeykll workflow (final)   1:26
  DONE blog, does not looks good on…   0:10
  discovery jeykll template   1:57
  DONE Jeykll disaster   1:00
  NEXT tweak jkyll (add social content…   1:17
  DONE Jekyll (disqus and google…   0:25
  Jekll,   2:23
  Jekyll general search   0:53
  NEXT intro to jekyll   0:28
  Jekyll code highlight   1:03
  DONE Jekyll - non-doing action   1:02

It tells that I spent 17 hours and 22 minutes in total on this project. The first question is: what does this 17 hours mean to me? I spent 500hours plus in playing video games between 2013-2014 and I really don't get nothing out of it. I really enjoy this learning experience because I have a definitive goal I want to achieve, then I put effort and I can get some results.

The rest of the table tells me tasks I did in chronological order. The first task was to read the official Jekyll guide and I spent 1 hour 22 minutes. This table looks really awfully and gives you an insight of what a real world data looks like. I don't really know what to do with it. So I have to refers to the way I learn as a child.

From our education system, the way we are learning is that we start on a basic level, we study the topic, apply it, make mistakes, correct it, and continue this process to the next level. It gives me an idea to define "levels". So I skim the whole project, from beginning to end, and grouped the data into five categories:

Basis
the foundation of Jekyll, website and HTML language,
Features
he extended features provided by Jekyll, for example, add social network link, add a discussion/comments,
Workflow
integrate the publishing process to my current workflow and try to automate as much process as possible,
Try
try to impelement something new that for my own needs, or try other people ideas,
Fix
fix problems along the way I build up the website

Analysis 1 - Mixture of levels

There is a linear increasing level, and one dependence the previos level. Ideally, the most cost-effective approach for a person to learn, is to focus on step, master it, and then goes to the next. This is the way of our education system. But it won't be the case of a self-study project, which is most likely to be interest-driven. it wold be very interesting to see that how I was jumping between these five levels. I sort out the timeline for each tasks I did and plot the time with levels.

a.gg.plot.png

Figure 1: timeline view

The x-axis is the time line in minutes for this project and y-axis is the level I defined. I can tell which level I am in a studying time. For example, for the first 200 minutes I was studying the basis knowledge, then I spent 35 minutes on Feature and so on.

I am very surprised that I went to workflow level so early, in about 20% of this project. It is actually make a lot sense because this project across 2 months and having a workflow that suitable for me really accelerate this project because It takes no time to pick up after leaving this project for few days.

There was a time that the website is broken and I cannot figure out why. It turned out I missed spelled an configuration file.

Finally I spent some time on googling about how other people use Jekyll and tried quit a few and never goes back to lower level.

Analysis 2 - Time distribution

It would be also interesting to see how long I spent on each levels, i.e.

a.pie.chart.2.png

Figure 2: time distritbuoin

Level Time (Min) Percentage
Basis 184 0.18
Feature 276 0.26
Workflow 191 0.18
Fix 70 0.07
Try 321 0.31

The pie chart should be read in clocking-wise direction and it is ordered by levels. I spent 3 hours in Basis level and to be honest I understand much about Jekyll. It is interesting to see that I spent 8% more time on the expanded features than the basic knowledge, i.e. using it rather study the knowledge.

Thanks to all the volunteers that working on connect Jekyll and Org mode. I only spent 3 hours in setting up the workflow that includes write article in Org mode (plain text), convert it to HTML web page, and then upload to my blog. Jekyll is a sophisticated and well tested software that it can be configured easier and I didn't running into any problems.

The rest 5.4 hours was the most inefficient in this project. At that time I was keen to include Table of Content and Code Syntax Highlight in my blog. I did a lot further research for solutions and tried a lot. But none fits well with the foundation I have built up. Some is buggy and create more problems. If you really need these feature, see this blog.

Analysis 3 - Benefits

The simple question: is it worth spending 18 hours building a website side? In timewise, I have spent 20 hours on writing, but this number does not count, because I can write without a blog at all. But I feel that having this blog promote me writing because:

  1. the writings has their destination. It is not a digital file in my computer that I will forget in few weeks or a page in a notebook I hardly look back with potentially lose. All my writings will be in this single website that has a unique address that everybody can visit it whenever or whatever they are.
  2. the aim for writing has been changed. It is not only to express myself, like a daily, but to share an journal to other. I have to consider the reader's feel, will they like it? will they understand it? So, the writing becomes more proactive, more thinking on human relationship, and thus more fun.
  3. Writing is linked to this website, which is linked to quantifies-self project and Emacs/org-mode, which then linked back to statistics and programming, which are the two main passion of me. Writing is not something that occurs to my mind one day and then I swear I will master it, but something in my passion network and extended my passion to another area.

Conclusion

[2014-12-19 Fri 13:21] First, quantify my time on learning is

Words: 1657, Write: 5 Hours

Learning How to Learn: Powerful mental tools to help you master tough subjects The Data Journalism Handbook

Footnotes:

1

wiki

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!