Yi Tang Data Science and Emacs

Less Excel, More R/Python in Emacs

Table of Contents

  1. Excel Is Great
  2. But
  3. Emacs Has More To Offer

Excel Is Great

Regardless of how powerful and convenient the R/Python data ecosystem becomes, there is still value in looking at the data in Excel, especially when exploring the data together with less technical people.

Thanks to its trivial interface Excel is widely used in data analysis: hoover the mouse to select columns, apply filters then calculate some statistics. Most of the time that is all it takes to get the answers the clients are seeking.

I recently realised that having transparency and working with the tools that clients use plays a crucial role in strengthening the trust and delivering the impacts to the business. Sometimes I think I should do more in Excel.

But

The problem with Excel is reproducibility - I’m not able to codify the clickings done in Excel and integrate them into the automated data pipeline. It is rather foolish to have quality control procedures, including code reviews, automated testing, CI etc in the system but in the very end drop all those gatekeepers and go for error-prone manuals.

Plus it is way more efficient to have everything done in one place to have a smooth process with no fractions. It is a key factor in enabling quick turnaround.

So I had the motive to limit the usage of Excel to deliver data to the business and pair data analysis. Again I have been looking into how much it can be done without leaving Emacs.

Emacs Has More To Offer

I was pleased to discover the ess-view-data package and its Python counterpart python-view-data. They interact with an active R/Python session in Emacs and print out data.frame objects in plain text, a.k.a. view data. What’s more, it can process the data before viewing, for example, subset the data row/column-wise, summarise the dataset etc.

The package keeps a record of the data processing pipeline so in the end I would have a copy of the R/Python code that generates the output. I can then effortlessly transfer the code to a script to ensure reproducibility in the future.

Another benefit derives from having a plain text buffer for the data. It is handy in exploring large datasets with an excessive number of columns. For example, the dataset I work on daily basis has about 300 columns. It contains different flavours of the financials, the raw values, imputed, ranked, smoothed etc.

It’s not possible to remember all the column names even after more time was spent in giving meaningful names or ensuring the correct columns are referred to. Having a persistent plain text buffer that I can search for makes finding the right column names a lot easier. It also helps to check what’s in and not in the data.

That’s my first impression of Shuguang Sun’s packages, it looks promising.

Blog in Emacs - Use Jekyll's Draft Mode

Why?

I wasn’t aware of Jekyll’s draft mode. My workaround was manually changing the published field in the front matter to true when the post is ready to publish. It works fine. However, with naive support from Jekyll, there are more benefits to using the draft mode.

To start with, I like the drafts saved in the _drafts folder, not mixed with other published posts in the _posts folder. It is way more cleaner and easy to manage. With a glimpse of my eyes, I can see what are the posts that I am drafting.

It also gives a piece of mind: only posts under the _posts folder are exported and shown in my blog. It ensures I don’t accidentally publish a post in draft.

Once there are files in _drafts folder, adding ​-​-​drafts argument to the jekyll serve command is all I need to be able to see the drafts locally.

Of course, I also need to write a bit of Lisp code to integrate the draft mode into my blogging workflow. This is the remaining of this post is about.

Implementation

For a blog post, I have the source file in org-mode and its exported file in Markdown. Now there is a new location dimension: they can be either in the _drafts or _posts folder.

mode source file (org mode) exported md in Jekyll
draft org/_drafts/on_image.org jekyll/_drafts/on_image.md
publish org/_posts/2027_02_08_on_image.org jekyll/_posts/2027_02_28_on_image.md

In terms of content, the published post and its final draft, and their exported counterparts are the same, only in different locations. Their content can be different to have some flexibility, e.g. published post has higher resolution of screenshots. This feature is possible to implement in the future. For now, I follow the simple “same but in different places” rule.

The new process looks like this: When I publish a post, it moves the org file from _draft to _posts folder, adds a date to the filename (which I have already), and then triggers the exporting process. To avoid duplication, it removes the original org file and its exported draft in Markdown.

To achieve that, the main missing piece from my current Emacs configuration is the yt/jekyll-find-export function (see below). For a post in _drafts or _posts, it finds the full path of the corresponding exported markdown file. I can then delete it or start the exporting process.

 
(defun yt/jekyll-is-draft-p ()
  "if the file is inside of the draft directory, it is a draft."
  (let ((draft-dir  (file-truename jekyll-source-drafts-dir))
        (filepath (file-truename (buffer-file-name))))
    (string-prefix-p draft-dir filepath)))


(defun yt/jekyll-find-export ()
  "find the full path to the exported file of the current post."
  (let* ((src-file (file-name-nondirectory (buffer-file-name)))
         (dest-file (file-name-with-extension src-file ".md")))
    (if (yt/jekyll-is-draft-p)
        (file-name-concat jekyll-site-draft-dir dest-file)
      (file-name-concat jekyll-site-post-dir dest-file))))

Blog in Emacs - Work with Images

I do my best to keep my blog simple, I would not use images/videos unless I can’t demonstrate well enough in plain text, for example, to demonstrate a mobile app using a screenshot (Learn in Emacs - Building Up Vocabulary) or how to represent stock price charts for neutral network (Speed Up Sparse Boolean Data).

Even when I do, I keep the usage to the bare minimum: all I do is insert the image, make it centralised, and put a caption on top of it.

The Org-mode supports images well, with a few additional HTML attributes for each inserted image, I can fine-control the images’ position, alignment, size etc.

However, I can’t get the benefits because I migrated my blog posts from HTML to the Markdown format for its simplicity. Plus Jekyll comes with its little quirks when it comes to Markdown images. So I have to write something for myself.

I managed to achieve a satisfactory workflow for my simple usage of images in Emacs. It works well for the Jekyll site. Here’s the code and explanation.

  • org-download : is the package that I use to create the images for blogging from various sources.

    I can drag images from external applications to Emacs, including browsers, Preview, or iPhoto. The images will be saved in the /project/assets/org-download folder per my matrix/project setup.

    For the application that I can’t drag the images, I take a screenshot inside Emacs by calling the org-download-screenshot function.

  • yt/jekyll-copy-from-org-downkload: is a little helper function that transfers the files under the org-download folder to the /assets folder in a Jekyll site.

    It lists the files in the source org-download folder and provides them as a selection list. It comes with auto-completion and fuzzy matches to help me choose the file.

    It also strips out the special characters in the filename otherwise the URL will be broken in Jekyll.

  • yt/jekyll-insert-image: lists the files in the /assets folder so I can choose easily which image to use.

    It brings up the Liquid template for image so I don’t have to remember its syntax. It ensures the file path is in the correct format (starts with ​/assets​/), I just fill in the caption and size after selecting the file.

An extract of the code is listed below for demonstration propose. Future updates will be reflected in my .emacs.d git repo.

 

(defun yt/jekyll-insert-image (src caption)
  (interactive (list (read-file-name "images to include: " jekyll-assets-dir)
                     (read-string "Caption: ")))
  (insert (format jekyll-insert-image-liquid-template (file-name-nondirectory src) caption)))

(defun yt/jekyll-copy-org-download-to-assets (file)
  "copy file from project org-download folder to the blog assets folder.
it ensures there's no underscore(_) in the file name.
"
  (interactive (list (read-file-name "file to copy: " org-download-image-dir)))
  (let* ((ext (file-name-extension file ))
        (base (file-name-base file))
        (dest-base (jekyll-make-slug base))
        (dest-file (expand-file-name (file-name-with-extension dest-base ext) jekyll-site-assets-dir)))
    (copy-file file dest-file)
    dest-file))

Learn in Emacs - Building Up Vocabulary

Table of Contents

  1. WHY?
  2. Workflow for Building Vocabulary
  3. Revision on Mobile Devices
  4. Org-mode Based Simple Study Strategies
  5. Emacs Lisp Implementation

WHY?

Research shows having effective and rapid communication can boost creativity and spark joy1. I believe in it from my personal experience in conversing, reading a book in my native language or trying to understand a large codebase.

I wasn’t enable to achieve similar results when it came to using English. In the past I have been trying to improve my English language skills to boost my productivity in reading books and to make it more enjoyable. The approach was practising more in reading and writing. However, I started questioning the effectiveness. This year I decided to take one step back to focus on the basics and improve my vocabulary.

I want to take the “slip-box” method2 which proved to be effective for learning Emacs Lisp language. It is a bottom-up approach, so I would have one note for each word with the explanation in it, links to other similar words, or words I got confused with.

One advantage is that I can also leverage my existing setup.

Workflow for Building Vocabulary

When come across a new word that I’m not sure about its meaning, I will

  1. move the cursor to the word,
  2. press F1 d to look into the dictionary, the result will shown in the osx-dictionary buffer,
  3. read its meaning and try to understand it,
  4. press r to listen the pronunciation and read after it. I usually repeat it a couple of times to deepen the memory,
  5. press a to create an atomic note. it has the dictionary meaning in it for future reference,
  6. edit the notes to add my understanding and copy the sentence/paragraph that contains the new word.
  7. press C-c C-c to save it to my vocabulary database, which is just a folder with flat org-mode files.

There’s quite a lot of automation so I can focus on understanding it (Step 3) and write a good note (Step 6) in my own words.

This workflow depends on two Emacs packages:

  • osx-dictionary: it interfaces with macOS’s dictionary app. It displays the meaning and says the pronunciation.

    The package is well written and easy to work with; I managed to extend it to add Steps 5-7 with little effort.

    It has limitations: it works only in macOS and it only outputs one dictionary. Adding the meaning in Chinese requires a few more manual steps: 1) press o to open the Dictionary.app, 2) go to the Chinese dictionary tab and copy the meaning, and 3) paste it to the note in Emacs.

    I personally find the Oxford dictionary macOS uses is not easy to follow. From time to time I have to visit https://dictionary.cambridge.org/ to find the explanation that I could understand. In transforming my old vocabulary notes to the new format, I found the explanation from vocabulary.com is the best. I might have to resurrect my voca-builder3 package.

  • org-roam: it interfaces with org-mode for creating atomic notes. It avoids duplication: if there’s a note for the word that exists already, it opens the note, so I can have a look and enrich it.

    I can link notes/words in my vocabulary database which is very useful because for me learning by comparing is super effective.

    The org-mode provides a lot of functionalities that might be useful to facility learning in the future.

Revision on Mobile Devices

Once I have a fleet of notes, the next step is to revise them on a regularly. The routine I’m trying to get myself into is rereading the notes I created for the last few days while waiting for the tube/bus, I call it a revision break.

So far I have an Emacs lisp program4 that filters all my notes by time so I have last_24_hours.org, last_3_days.org and last_7_days.org. These files are synced with iCloud so they are available to review on my iPad and iPhone using the beorg App.


Reading my vocabulary notes on iPhone

Org-mode Based Simple Study Strategies

For the dedicated study sessions, I need a few strategies to shortlist the notes. I think They will be based on the metadata of the note. With org-mode’s API, it should be easy to implement.

I haven’t done it yet, but the idea is to score the notes from 0 to 5, 5 means the most important notes so I would study them first, 0 means not important notes so will be at the bottom.

There can be multiple scores, for example, one for pronunciation. the word that I got the pronunciation completely wrong would get a 5, and the word would get a 3 if sometimes I got it wrong, and sometimes I got it right.

Another score is how many times I looked into the word. There are words that I just keep forgetting about it, or keep confusing with another similar word. So the property of ‘visited_at’ gets a timestamp appended at the time of visiting and the score is calculated by the number of timestamps.

Emacs Lisp Implementation

Adding an action to the headline in osx-dictionary’s buffer.

 
(require 'osx-dictionary)
(setq osx-dictionary-mode-header-line
      (append '((:propertize "a" face mode-line-buffer-id)
                ": Add to vocabulary"
                "    ")
              osx-dictionary-mode-header-line))

Adding yt/add-to-vocabulary to key a in osx-dictionary buffer. It creates an note using org-roam.

 
(defvar vocabulary-repo-dir "~/matrix/learning/meta-leanring/vocabulary"
  "where to save the vocabulary notes.")
(defvar yt/voca--roam-template
  '(("d" "default" plain "%?" :target
     (file+head "%<%Y%m%d%H%M%S>-${slug}.org" "#+title: ${title}

%?

#+begin_example
%(yt/osx-dict--get-meaning)
#+end_example

")
     :unnarrowed t))
  "roam template for vocabulary notes")


(defun yt/add-to-vocabulary ()
  "add a new vocabulary note for ther highlihted region or word at
point."
  (interactive)
  (let* ((org-roam-directory (expand-file-name "notes" vocabulary-repo-dir))
         (org-roam-db-location (expand-file-name "org-roam.db" org-roam-directory ))
         (org-roam-capture-templates yt/voca--roam-template))
    (org-roam-node-find nil (yt/osx-dict--get-word-and-pronounce))))

(defun yt/osx-dict--get-word-and-pronounce ()
  "extract the word and its pronunciation from the *osx-dictionary* buffer"
  (with-current-buffer "*osx-dictionary*"
    (goto-char (point-min))
    (search-forward "|" nil nil 2)
    (buffer-substring-no-properties (point-min) (point))))

(defun yt/osx-dict--get-meaning ()
  "wrap the *osx-dictionary* buffer cnotent as a string"
  (with-current-buffer "*osx-dictionary*"
    (buffer-substring-no-properties (point-min) (point-max))))

(define-key osx-dictionary-mode-map "a" 'yt/add-to-vocabulary)

Footnotes

1 reference is lost; it is somewhere in the book “The Second Mountain”, the chapter on religions.

2 from the book “How to Take Smart Notes”. I plan to reread this book in early 2024.

3 My first Emacs package in 2015, https://github.com/yitang/voca-builder

4 next blog post is on my lisp programs

Atomic Habit in Emacs - Keep Git Repos Clean

Table of Contents

  1. Why?
  2. Emacs Lisp Helper
  3. Practise

Why?

I am having a hard time keeping my git repositories clean: there are just too many repositories, I counted 31 in total, and I have 5 computers where I work on them.

The consequence is that sometimes I get surprised at seeing a lot of seemingly useful changes that are not committed to the git repo. I had to stop whatever I was doing to just think about what to do with those changes. It breaks the flow!

There are other occasions where I thought I fixed some bugs, but I don’t have the patches on my laptop. It turned out I didn’t check in to the cloud, so I have to log back to the right server to run a couple of git commands, or if I don’t have access to the servers, I have to fix the bugs from scratch again. It is inefficient!

It can happen a lot in active projects where I work on multiple systems and multiple git repos or when I travel. I plan to revisit my filesystem (which is inspired by Stephen Wolfram 1) and tech setup to reduce the number of repos by merging them and keeping only 1 laptop, 1 workstation and 1 server. This is something for summer, it can reduce the severity of the problem but can not eliminate it.

At the moment, I just have to become more disciplined in managing files, e.g. to have an atomic habit of checking my git repo regularly, or at least do it once at the end of the day, or as part of the shutdown ritual after finishing a task2.

Emacs Lisp Helper

The 3rd Law of Behavior Change is make it easy.

James Clear, Atomic Habit

To facilitate the forming of this habit, I implemented a utility function in Lisp to list the dirty git repo, and provide a clickable link to the magit-status buffer of the git repo. With one click on the hyperlink, I can start to run git commands via the mighty magit package. I bind this action to keystroke F9-G.

 
(defun yt/git--find-unclean-repo (root-dir)
  ""
  ;; (interactive)
  (setq out nil)
  (dolist (dir (directory-files-recursively root-dir "\\.git$" t))
    (message "checking repo %s" dir)
    (let* ((git-dir (file-name-parent-directory dir))
           (default-directory git-dir))
      (unless (string= "" (shell-command-to-string "git status --porcelain"))
        (push git-dir out))))
  out)


(defun yt/dirty-git-repos (&optional root-dir)
  "list the dirty git repos, provides a clickable link to their
magit-status buffer."
  (interactive (list (read-directory-name "Where's the root directory?" )))

  (let ((buffer (get-buffer-create "*test-git-clean*"))
        (git-repos (yt/git--find-unclean-repo root-dir)))
    (with-current-buffer  buffer
      (unless (eq major-mode 'org-mode)
        (org-mode))
      (goto-char (point-min))
      (insert (format "Number of dirty git repos: %s " (length git-repos)))
      (dolist (git-repo git-repos)
        (insert (format "\n[[elisp:(magit-status \"%s\")][%s]]" git-repo git-repo))))
    ))

The workhorse is the git status --porcelain command: If the git repo is clean, it returns nothing, otherwise, it outputs the file names whose changes are not checked in, e.g. the first file is modified (M), and the second file is not untracked (??).

 M config/Dev-R.el
?? snippets/org-mode/metric

The rest of the code is for parsing the outputs and turning them into a user-friendly format in Org-mode. What’s interesting is that The org-mode provides a kind of hyperlink that evaluates Lisp expressions, using the example below,

 
[elisp:(magit-status "/foo")]["Git Status of Repo /foo"]

The description of the hyperlink is “Git Status of Repo /foo” , after I click it, it runs the expression (magit-status "/foo") which shows the git status of /foo repo in a dedicated buffer.

Before executing it will ask for a confirmation. It can be a bit annoying and inconvenienced at first which naturally leads to the temptation of removing this behaviour by setting org-link-elisp-confirm-function to nil. I discourage you from doing so in case someone embeds funny codes, (for example rm -rf ~/) in a hyperlink, so make sure to check that variable’s documentation before changing it3!

Practise

It was fun to write the lisp functions. I learnt how to use the optional function argument and interactive so that the function can be used both interactively and pragmatically. I’m very much wanting to spend more time in coding, to enhance it with some ideas I got from reading Xu Chunyang’s osx-dictionary package4.

However, the effectiveness of those functions has little to do with the extra features I had in mind but really depends on how I use them. Solving the problems requires deliberate practise and changing my behaviours so that cleaning git repos becomes a habit of mine, which is always the hardest part.

One key indicator for this habit5 can be the number of check-ins and see if there’s a substantial increase from today.

Footnotes

1 see Stephen Wolfram’s blog posts

2 Cal Newport, Deep Work, Page 151

3 https://orgmode.org/manual/Code-Evaluation-Security.html

4 https://github.com/xuchunyang/osx-dictionary.el

5 inspired by Andrew Grove’s book High Output Management

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!