I have growing concerns about data security. It is not that I have
something to hide, it’s that I don’t like how my data is being
harvested in general by the big corporations for their own benefits,
which is mostly trying to sell me stuff that I don’t need or I
purchased already. Seeing the advertisements specifically targeting me
motivates me to do something.
Setting my personal cloud seems a bit too extreme, and I don’t have
the time for it anyway. So I did a little “off-the-grid” experiment
in which I exclusively used an offline Debian laptop for data
sensitivity work (password management, personal finance, diary
etc). It is absolutely secure for sure, but the problem is
accessibility: I can only work when I have access to the physical
It becomes infeasible when I travel, and it gives me some headaches to
maintain one more system. Also, the laptop’s screen is only 720p, I
can literally see the pixels when I write; it feels criminal to not
use the MBP’s Retina display. Lastly, It cannot be off the grid
completely; at one point, I have to back it up to the cloud.
So I spent some time researching and learning. I just need a data
protection layer so that I don’t have to worry about leaking private
data accidentally by myself, or the cloud storage provider getting hacked.
The benefits include not only having peace of mind but also
encouraging myself to work on those types of projects with greater
GNU Privacy Guard (GPG)
is the tool I settled with. It is a 24 years old software that enables
encrypting/decrypting files, emails or online communication in
general. It is part of the GNU project which weighs a lot to me.
There are two methods in GPG:
Symmetric method: The same password is used to both encrypt and decrypt
the file, thus the symmetric in its name.
Asymmetric method: It requires a public key to encrypt, and a
separate private key to decrypt.
There seems no clear winner in which method is better1. I choose
the asymmetric method simply for its ease of use. The symmetric method
requires typing the passwords twice whenever I save/encrypt the file
which seems too much.
The GPG command line interface is simple. Take the below snippet as an
The first line encrypts the foo.org file using the public key identified as
“Bob”. It results in a file named foo.org.gpg.
The second line decrypts the foo.org.gpg file to foo2.org which will
be identical to foo.gpg.
EPA - Emacs Interface to GPG
Emacs provides a better interface to GPG: Its EPA package enables me
to encrypt/decrypt files in place. So I don’t have to keep jumping
between the decrypted file (foo.org) and the encrypted file
(foo.org.gpg) while working on it.
Below is the simple configuration that works well for me and its
epa-file-enable: is called to add hooks to find-file so that
decrypting starts after opening a file in Emacs. It also ensures the
encrypting starts when saving a GPG file I believe.
To stop this behaviour, call (epa-file-disbale) function.
epa-file-encrypt-to: to choose the default key for
This variable can be file specific, for example, to use the key
belonging to firstname.lastname@example.org key, drop the following in the file
;; -*- epa-file-encrypt-to: ("email@example.com") -*-
epg-pinentry-mode: should be set to loopback so that GPG reads
the password from Emacs’ minibuffer, otherwise, an external program
(pinentry if installed) is used.
Org-Agenda and Dired
That’s more benefits Emacs offers in working with GPG files. Once I
have the EPA configured, the org-agenda command works pretty well
with encrypted files with no extra effort.
In the simplified example below, I have two GPG files as
org-agenda-files. When the org-agenda is called, Emacs first try
to decrypt the foo.org.gpg file. It requires me to type the password
in a minibuffer.
The password will be cached by the GPG Agent and will be used to
decrypt the bar.org.gpg assuming the same key is used for both files. So I
only need to type the passphrase once.
After that, org-agenda works as if these GPG files are normal
unencrypted files; I can extract TODO lists, view the clock summary
report, search text and check schedules/deadlines etc.
The dired provides functions to encrypt (shortcut “:e”) and decrypt
(shortcut “:d”) multiple marked files in a dired buffer. Under the
hood, they call the epa-encrypt-file and epa-decrypt-file
Lisp to Close all GPG Files
It seems that once a buffer is decrypted upon opening or encrypted upon
saving in Emacs, it stays as decrypted forever. So I need a utility
function to close all the GPG buffers in Emacs to avoid leakage.
Before I share my screens or start working in a coffee shop, I would
call this function to ensure I close all buffers with sensitive data.
My notes show I have had this issue since 9 months ago. I made another
attempt, but still could not find a solution!
I then switched to tidy up my Emacs configuration, and the variable
org-html-prefer-user-labels caught my eye.
its documentation says
By default, Org generates its own internal ID values during HTML
When non-nil use user-defined names and ID over internal ones.
So “#org0238b9f” is generated by org-mode. They are randomly
generated; they change if I update the export file. It means every
time I update a blog post, it breaks the URLs. This was a problem I
wasn’t aware of.
Anyway, what’s important is that, in the end, it says
Independently of this variable, however, CUSTOM_ID are always
used as a reference.
That’s it, I just need to set CUSTOM_ID. That’s the solution to my
problem. It is hidden in the documentation of some variables…
So I need a function to loop through each node, and set the CUSTOM_ID
property to its headline. The org-mode API provides three helpful
functions for working with org files:
org-entry-get: to get a textual property of a node. the headline
title is referenced as “ITEM”,
org-entry-put: to set a property of a node,
org-map-entries: to apply a function to each node.
I changed the final function a bit so it is used as an export hook
(org-export-before-processing-functions) as an experiment. With this
setup, it runs automatically whenever I export a blog post in org-mode
to Markdown. Also, it works on the exported file so it leaves the
original org file unchanged.
The code is listed below. It can also be found at my .emacs.d git repo
which includes many other useful Emacs configurations for Jekyll.
I’m the type of writer who writes first and comes up with the title
later. The title in the end is usually rather different to what I
started with. To change the title is straightforward - update the
title and date fields in the front matter.
However, doing so leads to discrepancies between the title and date
fields in front matter and the filename. In Jekyll, the filename
consists of the original date and title when the post is first
This can be confusing sometimes in finding the file when I want to
update a post. I have to rely on grep/ack to find the right files. A
little bit of inefficiency is fine.
Recently, I realised that readers sometimes can be confused as well
because the URL apparently also depends on the filename.
For example, I have my previous post in a file named
2022-12-08-trx-3970x.md. It indicates that I started writing it on 08
Dec with the initial title “trx 3970x”. A couple of days later on 13
Dec, I published the post with the title “How Much Does Threadripper
3970x Help in Training LightGBM Models?”.
The URL is however yitang.uk/2022/12/13/trx-3970x. It has the correct
updated publish date, but the title is still the old one. This is just
how Jekyll works.
From that point, I decided to write a bit of Emacs Lisp code to help the
Emacs Lisp Time
The core functionality is updating the filename and front matter to
have the same publish date and title. It can breakdown into three
when called, it promotes a new title. The publish date is fixed to
whenever the function is called.
It renames the current blog post file with the new date and title.
It also updates the title and date fields in the front matter
It deletes the old file, closes the related buffer, and opens the
new file so I can continue to work on it.
My Emacs Lisp coding skill is rusty but I managed to get it working in
less than 2 hours. I won’t say it looks beautiful, but it does the
I spent a bit of time debugging, it turns out the (org-show-all)
needs to be called first to flatten the org file, otherwise, editing
with some parts of the content hidden can lead to unexpected results.
I always found working with the filename/directory in vanilla Emacs
Lisp cumbersome, I wonder if is there any modern lisp library with a
better API, something like Python’s pathlib module?
Here are the main functions in case someone needs something similar.
They are extracted from my Emacs configuration.
Back in my Kaggle days, I always wondered how much my ranking could
improve with a better computer. I finally pulled the triggers (twice)
and got myself a 32-Cores Threadripper 3970x workstation.
Before I can tell if it helps my Kaggle competitions or not, I thought
it would be interesting to quantify how much benefits I can get from
upgrading the i5-13600k to 3970x in training LightGBM model.
The TLDR is:
The speedup is 3 times in training LightGBM using CPU.
To my surprise, it is 2 times faster using GTX 1080Ti GPU than
There are no obvious gains from GTX 1080Ti to RTX 3080.
Experiment Set up
I use the data in the Optiver - Trading At The Close
competition. There are about 500,000 rows and 100 features. I train a
3-fold (expanding window) LightGBM model. Repeating the same process
with varying numbers of cores used in the process to get a performance
graph like this:
Threadripper 3970x vs i5-13600k: Train LightGBM Models on CPU
i5-13600k - Efficient Cores Count
The i5-13600k has 6 performance cores and 8 efficient cores. In
practice, I never use more than 6 cores in training ML models. My
theory is mixing fast performance and slow efficient cores leads to a
worse performance than using the performance cores alone. By
specifying 6 cores, I assume the OS uses only performance cores.
The result shows that I was wrong - Using more than 6 cores can give
considerable performance gain. It reduces the runtime by 10 minutes
from 6 to 14 cores.
The only plausible explanation is that when training LightGBM with 6
cores, it is already mixed with efficient cores. Therefore I see an
increases in performance while adding more cores.
Regardless I will start to use 12 cores in practise.
3970x - Disappointing Results
I know the performance gain will not scale linearly with the number of
cores but I wasn’t expecting that adding more cores can slow down the
The graph shows the 3970x achieves its best performance at using 12
cores. After that, adding more cores increases the runtime.
This type of behaviour is usually observed in simple tasks where the
overhead of coordinating between cores outweighs the benefits of extra
cores bring in.
But training thousands of decision trees with half a million data
points is definitive not in this simple task category. So I don’t
understand why this is happening.
i5 vs. 3970x - Training in Parallel
For 6 cores, it took i5 51 minutes and 3970x 42 minutes, which is about
1.2 speedup which is not bad. The same speed boost is also observed at
using 10 and 12 cores.
I found this consistent speedup confusing because there’s a mix of
performance and efficient cores in i5, so in theory every performance
core I add in 3970x should increase the performance marginal when
compared to i5.
In general, because of the poor scalability with respect to the number
of cores, the best performance is achieved when training the model with
a small number of cores and running multiple training in parallel. This is
the trick I use to get the extra performance boost for CPU-bound
Here’s the setup for each computer:
i5-13600: use 6 cores to train each model, and train 2 models in
parallel. They are 2 cores left for OS background activities.
3970x: also use 6 cores to train each model, but train 5 models in
parallel! It also leaves 2 cores for OS background activities.
After a little bit of maths, it takes 14 hours for 3970x to train 100
models, and 42.8 hours for i5, so the speedup is 3 times. This is just
based on my theory. It would be good to actually run the experiment
and see the actual numbers.
Table 1: Training 100 models in parallel setting.
Runtime of 1 model (S)
No. models in Parallel
Total Runtime (H)
So the most benefit I can get from 3970x is in running multiple
experiments in parallel!
CPU vs. GPU - Impressive Performance
I have a GTX 1080Ti in my i5 PC for running deep learning models and
CUDA code. I never use it for LightGBM because the GPU implementation
was slower than the CPU in 2019 when I tried it.
In summer Guolin Ke, the author LightGBM, promised a significant
improvement in GPU performance when he was looking for volunteers to
work on improving LightGBM’s GPU algorithm.
Since I have the experiments set up already, it took me little time to
repeat the same experiments using the GPU trainer. All I did was adding
device_type=’gpu’ in the configuration files.
Table 2: Runtime of training a single mdoel
# CPU Cores
The result shocks me: I can get 2 times speedup just by switching from
i5 to 1080Ti with one additional line in the config and it outperforms
the 3970x in training single model setting by a big margin!
Is the 3970x worth it?
I found myself asking this question after seeing the results. In the
context of this experiment, no, it makes no sense to spend £2,000 to
get 3 times speedup when I can simply switch to 1080Ti to get 2 times
speed up with no costs.
However, the reason I go for the Threadripper and the TRX40 platform
is the 128 PCIe 4.0 lanes. The workstation is capable of running 4
GPUs at the same time at full capability while as i5 can only run 1
If I had 4 GTX 3080 installed, it would finish training 100 models in
just under 8 hours! That’s 5.25 speedup to i5 and 1.75 speedup to
3970x in parallel setting.
This calculation is not for just entertainment. It turns out that
utilising multiple GPU to train gradient boost tree can be a really
This static blog is built using Jekyll in 2014. It survived after 7
years which is a success when it comes to personal blogging. Part of
the reason is having a good blogging workflow: write posts in Org
Mode, export to HTML with a front matter, build the site using Jekyll,
send the folder to an Amazon S3 bucket, and that’s it. All done in
Emacs of course.
I added a few things to the workflow to enhance the reading experience
including code highlights, centred images with caption, table of
content etc. There are more features I want to add but at the same
time, I want to be able to just write.
With that mindset, whenever there are issues, I apply quick fixes
without a deep understanding of the actual causes. It seems efficient
until recently some fixes become counter-productive.
I started seeing underscore (_) is exported as \_ and <p> tag
appears in code snippets. It all sounds like quick fix, but I just
couldn’t get it correct after few hours. For the last few posts, I had
to manually fix them for each of the read-edit-export-fix iteration.
Revisit the Tech Stack
I have an ambitious goal for this blog. So it is time to go sweep the
carpet. I studied the technologies used for this blog, Jekyll, AWS and
Org Mode exporting. It was a good chance to practise Org-roam for
taking atomic notes. The time is well spent as I learnt a lot.
I was impressed I got the whole thing up and running 7 years ago. I
don’t think I have the willpower to do it now.
Still, there are a lot of things that I do not have a good understand,
e.g. the Liquid templates, HTML and CSS tags etc. The syntax just
puts me off.
Long Ride with Jekyll
I prefer a simple format like Org Mode or Markdown and don’t have to
deal with HTML/CSS at all. There are a couple of occasions when I
cannot resist the temptation to look for an alternative to
Jekyll. There’s no luck in the search. It seems HTML is the only way
because it is native to the web.
So the plan is to stick with Jekyll for at least a few years. In the
next couple of weeks, I’d try to fix all the issues, after that,
gradually add more features to enhance the writing and reading
I hope people who also uses the similar tech stack (Org-mode, Emacs,
Jekyll, AWS) can benefit my work.