Yi Tang Data Science and Emacs

Atomic Habit in Emacs - Keep Git Repos Clean

Table of Contents

  1. Why?
  2. Emacs Lisp Helper
  3. Practise

Why?

I am having a hard time keeping my git repositories clean: there are just too many repositories, I counted 31 in total, and I have 5 computers where I work on them.

The consequence is that sometimes I get surprised at seeing a lot of seemingly useful changes that are not committed to the git repo. I had to stop whatever I was doing to just think about what to do with those changes. It breaks the flow!

There are other occasions where I thought I fixed some bugs, but I don’t have the patches on my laptop. It turned out I didn’t check in to the cloud, so I have to log back to the right server to run a couple of git commands, or if I don’t have access to the servers, I have to fix the bugs from scratch again. It is inefficient!

It can happen a lot in active projects where I work on multiple systems and multiple git repos or when I travel. I plan to revisit my filesystem (which is inspired by Stephen Wolfram 1) and tech setup to reduce the number of repos by merging them and keeping only 1 laptop, 1 workstation and 1 server. This is something for summer, it can reduce the severity of the problem but can not eliminate it.

At the moment, I just have to become more disciplined in managing files, e.g. to have an atomic habit of checking my git repo regularly, or at least do it once at the end of the day, or as part of the shutdown ritual after finishing a task2.

Emacs Lisp Helper

The 3rd Law of Behavior Change is make it easy.

James Clear, Atomic Habit

To facilitate the forming of this habit, I implemented a utility function in Lisp to list the dirty git repo, and provide a clickable link to the magit-status buffer of the git repo. With one click on the hyperlink, I can start to run git commands via the mighty magit package. I bind this action to keystroke F9-G.

 
(defun yt/git--find-unclean-repo (root-dir)
  ""
  ;; (interactive)
  (setq out nil)
  (dolist (dir (directory-files-recursively root-dir "\\.git$" t))
    (message "checking repo %s" dir)
    (let* ((git-dir (file-name-parent-directory dir))
           (default-directory git-dir))
      (unless (string= "" (shell-command-to-string "git status --porcelain"))
        (push git-dir out))))
  out)


(defun yt/dirty-git-repos (&optional root-dir)
  "list the dirty git repos, provides a clickable link to their
magit-status buffer."
  (interactive (list (read-directory-name "Where's the root directory?" )))

  (let ((buffer (get-buffer-create "*test-git-clean*"))
        (git-repos (yt/git--find-unclean-repo root-dir)))
    (with-current-buffer  buffer
      (unless (eq major-mode 'org-mode)
        (org-mode))
      (goto-char (point-min))
      (insert (format "Number of dirty git repos: %s " (length git-repos)))
      (dolist (git-repo git-repos)
        (insert (format "\n[[elisp:(magit-status \"%s\")][%s]]" git-repo git-repo))))
    ))

The workhorse is the git status --porcelain command: If the git repo is clean, it returns nothing, otherwise, it outputs the file names whose changes are not checked in, e.g. the first file is modified (M), and the second file is not untracked (??).

 M config/Dev-R.el
?? snippets/org-mode/metric

The rest of the code is for parsing the outputs and turning them into a user-friendly format in Org-mode. What’s interesting is that The org-mode provides a kind of hyperlink that evaluates Lisp expressions, using the example below,

 
[elisp:(magit-status "/foo")]["Git Status of Repo /foo"]

The description of the hyperlink is “Git Status of Repo /foo” , after I click it, it runs the expression (magit-status "/foo") which shows the git status of /foo repo in a dedicated buffer.

Before executing it will ask for a confirmation. It can be a bit annoying and inconvenienced at first which naturally leads to the temptation of removing this behaviour by setting org-link-elisp-confirm-function to nil. I discourage you from doing so in case someone embeds funny codes, (for example rm -rf ~/) in a hyperlink, so make sure to check that variable’s documentation before changing it3!

Practise

It was fun to write the lisp functions. I learnt how to use the optional function argument and interactive so that the function can be used both interactively and pragmatically. I’m very much wanting to spend more time in coding, to enhance it with some ideas I got from reading Xu Chunyang’s osx-dictionary package4.

However, the effectiveness of those functions has little to do with the extra features I had in mind but really depends on how I use them. Solving the problems requires deliberate practise and changing my behaviours so that cleaning git repos becomes a habit of mine, which is always the hardest part.

One key indicator for this habit5 can be the number of check-ins and see if there’s a substantial increase from today.

Footnotes

1 see Stephen Wolfram’s blog posts

2 Cal Newport, Deep Work, Page 151

3 https://orgmode.org/manual/Code-Evaluation-Security.html

4 https://github.com/xuchunyang/osx-dictionary.el

5 inspired by Andrew Grove’s book High Output Management

GPG in Emacs - Functions to Decrypt and Delete All

Table of Contents

  1. Motivation
  2. Emacs Lisp Implementation
  3. Bash Implementation

Motivation

Continuing from my last post, the EPA provides a seamless interface when working with GPG files in Emacs. But there are situations where I have to work with GPG files using other programs (mostly Python) which EPA cannot help.

For those cases, I have to decrypt the GPG files first before using them (for example, calling pandas.read_csv).

Obviously, there’s no point in encrypting a file if there is a decrypted version next to it. So I also need a function to delete all the decrypted files.

Emacs Lisp Implementation

Of course, I run Python inside of Emacs, I wrote the Lisp functions to decrypt GPG files and delete all the decrypted files.

 
(defun yt/gpg--decrypt-recursively (root-dir)
  "It decrypts all the files ends .gpg under the root-dir. The decrypted files have the same filename but without the .gpg extension.

It stops if the decryption fails. 
"
  (interactive)
  (dolist (file (directory-files-recursively root-dir "\\.gpg"))
    ;; the 2nd argument for epa-decrypt-file can only be the base filename without the directory.
    (let ((default-directory (file-name-directory file)))
      (epa-decrypt-file file (file-name-base file))
    ))
  )

(defun yt/gpg--delete-decrypted-files (root-dir)
  "It deletes the decrypted files under the root-dir directory.

e.g. if there's a file foo.tar.gz.gpg, it attempts to remove the foo.tar.gz file.
"
  (interactive)
  (dolist (file (directory-files-recursively root-dir "\\.gpg"))
    (delete-file (file-name-sans-extension file))
    )
  )

A bit of explanation:

  • directory-files-recursively: searches for files with a pattern. Here, it returns all the files ending with .gpg under the given root-dir,
  • dolist: loops over the GPG files to process them one by one,
  • epa-decrypt-file: decrypts a GPG file into a new file.
  • delete-file: deletes a given filename.

It seems the epa-decrypt-file function does not like the new filename with the directory in its path, so I have to set the default directory (working directory) and use the base filename after removing the directory as a workaround.

Bash Implementation

It would be useful to have those functionalities outside of the Emacs, so I implemented their counterpart in Bash.

 
function decrypt_recursively() {
    # PS: this function is equivalent to `gpg --decrypt-files $1/**/*.gpg`
    for fn in $(find $1 -iname "*.gpg")
    do
        echo decrypt ${fn} to "${fn%.*}"
        gpg -o "${fn%.*}" -d "${fn}" 
    done
}


function remove_decrypted_files() {
    for fn in $(find $1 -iname "*.gpg")
    do
        echo removing "${fn%.*}"
        rm "${fn%.*}"
    done
}

The interface is the same: given a root directory, it decrypts all the GPG files or deletes the decrypted files.

A little bit of Bash:

  • $1: refers to the first function argument, $2 refers to the second function argument and so on. This is the Bash way. When the function is called, $1 will be replaced with the actual argument, here it means the root directory.

  • $(find …): is a list of files returned by the find program. In this context, it stands for all the files whose filename ends with .gpg.

    It can be achieved using ls program but it will be a lot slower 1 and requires some configuration in MacOS 2.

  • ${fn%.*}: removes the last file extension of the variable $fn$, for example, foo.tar.gz.gpg becomes foo.tar.gz.

    Another approach is using $(basename $fn .gpg) to remove the .gpg extension explicitly.

  • for, do, done: loops through each file.

The Bash functions have the advantage of being easily incorporated into the system, for example, call the remove_decrypted_files function automatically prior to shutting down or after login.

Footnotes

1 why glob is slow

2 how to enable globstar option in MacOS

GPG in Emacs - First Step Towards Data Security

Table of Contents

  1. WHY?
  2. GNU Privacy Guard (GPG)
  3. EPA - Emacs Interface to GPG
  4. Org-Agenda and Dired
  5. Lisp to Close all GPG Files

WHY?

I have growing concerns about data security. It is not that I have something to hide, it’s that I don’t like how my data is being harvested in general by the big corporations for their own benefits, which is mostly trying to sell me stuff that I don’t need or I purchased already. Seeing the advertisements specifically targeting me motivates me to do something.

Setting my personal cloud seems a bit too extreme, and I don’t have the time for it anyway. So I did a little “off-the-grid” experiment in which I exclusively used an offline Debian laptop for data sensitivity work (password management, personal finance, diary etc). It is absolutely secure for sure, but the problem is accessibility: I can only work when I have access to the physical hardware.

It becomes infeasible when I travel, and it gives me some headaches to maintain one more system. Also, the laptop’s screen is only 720p, I can literally see the pixels when I write; it feels criminal to not use the MBP’s Retina display. Lastly, It cannot be off the grid completely; at one point, I have to back it up to the cloud.

So I spent some time researching and learning. I just need a data protection layer so that I don’t have to worry about leaking private data accidentally by myself, or the cloud storage provider getting hacked.

The benefits include not only having peace of mind but also encouraging myself to work on those types of projects with greater convenience.

GNU Privacy Guard (GPG)

is the tool I settled with. It is a 24 years old software that enables encrypting/decrypting files, emails or online communication in general. It is part of the GNU project which weighs a lot to me.

There are two methods in GPG:

  • Symmetric method: The same password is used to both encrypt and decrypt the file, thus the symmetric in its name.
  • Asymmetric method: It requires a public key to encrypt, and a separate private key to decrypt.

There seems no clear winner in which method is better1. I choose the asymmetric method simply for its ease of use. The symmetric method requires typing the passwords twice whenever I save/encrypt the file which seems too much.

The GPG command line interface is simple. Take the below snippet as an example,

 
gpg -r "Bob" -e foo.org
gpg -o foo2.org -d foo.org.gpg

The first line encrypts the foo.org file using the public key identified as “Bob”. It results in a file named foo.org.gpg.

The second line decrypts the foo.org.gpg file to foo2.org which will be identical to foo.gpg.

EPA - Emacs Interface to GPG

Emacs provides a better interface to GPG: Its EPA package enables me to encrypt/decrypt files in place. So I don’t have to keep jumping between the decrypted file (foo.org) and the encrypted file (foo.org.gpg) while working on it.

Below is the simple configuration that works well for me and its explanation.

 
(require 'epa-file)
(epa-file-enable)
(setq epa-file-encrypt-to "foo@bar.com")
(setq epg-pinentry-mode 'loopback)
  • epa-file-enable: is called to add hooks to find-file so that decrypting starts after opening a file in Emacs. It also ensures the encrypting starts when saving a GPG file I believe.

    To stop this behaviour, call (epa-file-disbale) function.

  • epa-file-encrypt-to: to choose the default key for encryption.

    This variable can be file specific, for example, to use the key belonging to foo2@bar.com key, drop the following in the file

    ;; -*- epa-file-encrypt-to: ("foo2@bar.com") -*-
    
  • epg-pinentry-mode: should be set to loopback so that GPG reads the password from Emacs’ minibuffer, otherwise, an external program (pinentry if installed) is used.

Org-Agenda and Dired

That’s more benefits Emacs offers in working with GPG files. Once I have the EPA configured, the org-agenda command works pretty well with encrypted files with no extra effort.

In the simplified example below, I have two GPG files as org-agenda-files. When the org-agenda is called, Emacs first try to decrypt the foo.org.gpg file. It requires me to type the password in a minibuffer.

The password will be cached by the GPG Agent and will be used to decrypt the bar.org.gpg assuming the same key is used for both files. So I only need to type the passphrase once.

 
(setq org-agenda-files '("foo.org.gpg" "bar.org.gpg"))
(org-agenda)

After that, org-agenda works as if these GPG files are normal unencrypted files; I can extract TODO lists, view the clock summary report, search text and check schedules/deadlines etc.

The dired provides functions to encrypt (shortcut “:e”) and decrypt (shortcut “:d”) multiple marked files in a dired buffer. Under the hood, they call the epa-encrypt-file and epa-decrypt-file functions.

Lisp to Close all GPG Files

It seems that once a buffer is decrypted upon opening or encrypted upon saving in Emacs, it stays as decrypted forever. So I need a utility function to close all the GPG buffers in Emacs to avoid leakage.

 
(defun yt/gpg--kill-gpg-buffers ()
  "It attempts to close all the file visiting buffers whose filename ends with .gpg.

It will ask for confirmation if the buffer is modified but unsaved."

  (kill-matching-buffers "\\.gpg$" nil t)
  )

Before I share my screens or start working in a coffee shop, I would call this function to ensure I close all buffers with sensitive data.

Footnotes

1 stackexchange: symmetric vs asymmetric method

Jekyll in Emacs - Align URL with Headline

Table of Contents

  1. Problem
  2. Solution
  3. Implementation

Problem

While I was working on improving the URL in my last post, I noticed the URLs are not readable, for example,

http://yitang.uk/2023/12/18/jekyll-in-emacs-update-blog-post-title-and-date/#org0238b9f

The URL links to the section called Code, so a much better URL should be

http://yitang.uk/2023/12/18/jekyll-in-emacs-update-blog-post-title-and-date/#Code

My notes show I have had this issue since 9 months ago. I made another attempt, but still could not find a solution!

Solution

I then switched to tidy up my Emacs configuration, and the variable org-html-prefer-user-labels caught my eye.

its documentation says

By default, Org generates its own internal ID values during HTML
export.

When non-nil use user-defined names and ID over internal ones.

So “#org0238b9f” is generated by org-mode. They are randomly generated; they change if I update the export file. It means every time I update a blog post, it breaks the URLs. This was a problem I wasn’t aware of.

Anyway, what’s important is that, in the end, it says

Independently of this variable, however, CUSTOM_ID are always
used as a reference.

That’s it, I just need to set CUSTOM_ID. That’s the solution to my problem. It is hidden in the documentation of some variables…

Implementation

So I need a function to loop through each node, and set the CUSTOM_ID property to its headline. The org-mode API provides three helpful functions for working with org files:

  • org-entry-get: to get a textual property of a node. the headline title is referenced as “ITEM”,
  • org-entry-put: to set a property of a node,
  • org-map-entries: to apply a function to each node.

I changed the final function a bit so it is used as an export hook (org-export-before-processing-functions) as an experiment. With this setup, it runs automatically whenever I export a blog post in org-mode to Markdown. Also, it works on the exported file so it leaves the original org file unchanged.

The code is listed below. It can also be found at my .emacs.d git repo which includes many other useful Emacs configurations for Jekyll.

 
 (defun yt/jekyll--create-or-update-custom_id-field ()
  "so that the CUSTOM_ID property is the same as the headline and 
the URL reflects the headline.

by default, the URL to a section will be a random number."
  (org-entry-put nil "CUSTOM_ID" (org-entry-get nil "ITEM"))
  )

(defun yt/jekyll--create-or-update-custom_id-field-buffer (backend)
  (when (eq backend 'jekyll-md)
    (org-map-entries 'yt/jekyll--create-or-update-custom_id-field)
    ))

(add-hook 'org-export-before-processing-functions 'yt/jekyll--create-or-update-custom_id-field-buffer)
 

Jekyll in Emacs - Update Blog Post Title and Date

Table of Contents

  1. Emacs Lisp Time
  2. Code

I’m the type of writer who writes first and comes up with the title later. The title in the end is usually rather different to what I started with. To change the title is straightforward - update the title and date fields in the front matter.

However, doing so leads to discrepancies between the title and date fields in front matter and the filename. In Jekyll, the filename consists of the original date and title when the post is first created.

This can be confusing sometimes in finding the file when I want to update a post. I have to rely on grep/ack to find the right files. A little bit of inefficiency is fine.

Recently, I realised that readers sometimes can be confused as well because the URL apparently also depends on the filename.

For example, I have my previous post in a file named 2022-12-08-trx-3970x.md. It indicates that I started writing it on 08 Dec with the initial title “trx 3970x”. A couple of days later on 13 Dec, I published the post with the title “How Much Does Threadripper 3970x Help in Training LightGBM Models?”.

The URL is however yitang.uk/2022/12/13/trx-3970x. It has the correct updated publish date, but the title is still the old one. This is just how Jekyll works.

Anyways, the correct URL should be

http://yitang.uk/2023/12/13/how-much-does-threadripper-3970x-help-in-training-lightgbm-models/

From that point, I decided to write a bit of Emacs Lisp code to help the readers.

Emacs Lisp Time

The core functionality is updating the filename and front matter to have the same publish date and title. It can breakdown into three parts:

  1. when called, it promotes a new title. The publish date is fixed to whenever the function is called.

  2. It renames the current blog post file with the new date and title. It also updates the title and date fields in the front matter accordingly.

  3. It deletes the old file, closes the related buffer, and opens the new file so I can continue to work on it.

My Emacs Lisp coding skill is rusty but I managed to get it working in less than 2 hours. I won’t say it looks beautiful, but it does the job!

I spent a bit of time debugging, it turns out the (org-show-all) needs to be called first to flatten the org file, otherwise, editing with some parts of the content hidden can lead to unexpected results.

I always found working with the filename/directory in vanilla Emacs Lisp cumbersome, I wonder if is there any modern lisp library with a better API, something like Python’s pathlib module?

Code

Here are the main functions in case someone needs something similar. They are extracted from my Emacs configuration.

 
 (defun yt/jekyll-update-post-name ()
  "it update the post filename with a new title and today's date.

it also update the font matter."
  (interactive)
  (let* ((title (read-string "new title: "))
         (ext (file-name-extension (buffer-file-name)))  ;; as of now, the ext is always .org.

         ;; the new filename is in the format of {date}-{new-title}.org
         (filename (concat
                    (format-time-string "%Y-%m-%d-")
                    (file-name-with-extension (jekyll-make-slug title) ext)))

         ;; normalise the filename. 
         (filename (expand-file-name filename))

         ;; keep the current point which we will go back to after editing.
         (old-point (point))
         )
    (rename-file (buffer-file-name) filename) ;; update the filename
    (kill-buffer nil)  ;; kill the current buffer, i.e. the old file.
    (find-file filename)  ;; open the new file.
    (set-window-point (selected-window) old-point)  ;; set the cursor to where i was in the old file.

    ;; udpate title field. 
    ;; note jekyll-yaml-escape is called to ensure the title field is yaml friendly.
    (yt/jekyll-update-frontmatter--title (jekyll-yaml-escape title))    
    )

  )

(defun yt/jekyll-update-frontmatter--title (title)
  "Update the title field in the front matter.

title case is used. 
"
  (let* ((old-point (point)))

    ;; ensure expand all the code/headers/drawers before editing.
    (org-show-all)

    ;; go to the first occurence of 'title:'.
    (goto-char (point-min))
    (search-forward "title: ")

    ;; update the title field with the new title.
    (beginning-of-line)
    (kill-line)
    (insert (format "title: %s" title))

    ;; ensure the title is in title case
    (xah-title-case-region-or-line (+ (line-beginning-position) 7) (line-end-position))

    ;; save and reset cursor back to where it started.
    (save-buffer)    
    (goto-char old-point)
    ))
 
If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!