Yi Tang Data Science and Emacs

Retiring Raspberry Pi 4 as Home Server and NAS

Table of Contents

  1. Good Start for Self-Hosting
  2. Lack of NAS Capacity
  3. Looking for a Successor
  4. Unexpected
  5. Setting up z170a
  6. Power Consumption

Good Start for Self-Hosting

The little Raspberry Pi 4 (RP4) served me well in the last two years. I used it to host NextCloud/Syncthing for syncing files between devices, scraping financial data from Yahoo Finance and TimeMachine for MacOS backup.

The latest addition to the service stack is paperless-ngx. It allows my Canon printer/scanner to send digital copies of documents directly to the RP4 or Gmail.

The RP4 handles all the demands without showing any signs of struggle. It costs as little as 6kW per hour while the Xbox One S draws 11kW while sleeping. Thanks to the energy crisis in the UK, I started to appreciate the energy efficiency of RP4. The ARM chips in it really impressed me.

Lack of NAS Capacity

A 3TB portal hard drive (WD My Passport) was attached to the PR4 to store media data. The USB 3.0 connector is surprisingly stable and fast. With both ends connected by ethernet cables, the file transfer speed can reach up to 100 MB/s. When my MacBook Pro uses Wi-Fi, the speed drops to about 40-50 MB/s but it is still great because of the convenience.

Later I started using it as a NAS to store the Final Cut Pro library. The 4k home gym videos I shot using iPhone 12 Pro are numerous 1! The hard drive keeps getting filled up.

I can get another portal hard drive, but then it will get filled up again, say in less than a month? So it occurred to me that I need a proper home server with full NAS capacity.

Looking for a Successor

I did a bit of research but I am not able to find a good product. I suspect the reason is the NAS build is a niche area while the PC industry is gaming-centric, focusing on getting faster, bigger, and fancier hardware with unnecessary RGB lights, that is where the profits are I presume.

I came across some innovative products on AliExpress from China, such as the TopTon N5105 board. It is more powerful, consumes slightly more electricity, and it has 6 SATA cables! It would be a perfect successor for my PR4.

But I am not comfortable ordering electronic stuff from AliExpress, returning it or sending it back for repair would be a nightmare.

PS: The company is growing fast, it continued to innovate, and the product lines extended to Intel N100 with an additional NVME drive and a USB-C. Their website and marketing materials look notched up quite a bit. I kind of regret not taking the risk back then.

Unexpected

The other day, I was re-organising (again) my home office, so had to move a bookshelf. I started moving it without taking everything off, then a motherboard fell off. It was the z170a with an i5-6600k and a heat sink attached to it. The motherboard was in my first desktop that I purchased 10 years ago when I started participating in Kaggle competitions in 2014.

After a quick inspection, I saw some pins were bent. I felt ashamed and sorry for the motherboard that I had not taken care of it. So I made a promise: if it survived the fall, I would use it for my NAS.

Well, it did so I found my NAS.

Setting up z170a

While putting it up, one SATA port was snapped and came up, but the rest is still fine. Apart from that, everything else went smoothly. The Debian 12 became much easier to install with the isohybrid technology and the non-free firmware is now part of the installation image itself.

The server setup scripts and configuration are saved in a selfhosted-services git repository so restoring the services took little efforts.

I had one little trick: I assigned the IP address of RP4 to the new z170a server so that on the client side I didn’t have to change anything. This was achieved rather easily: few clicks in the ASUS router web UI and then a reboot.

While setting it up, I noticed the z170a system is much more responsive, thanks to the 3.5 GHz i5-6600k CPU and a much faster SSD over the SD card. I was able to run multiple processes at the same time.

The longest part is copying files from the 3TB portal hard drive to the z170a’s internal HDD, which took about 20 hours.

It has great extensibilities: there are 3 free SATA for HHD and two PCIe slots.

Power Consumption

The only downside is that it consumes a lot more electricity. When testing in barebone, it drew only 10W. After putting everything together with additional HDDs, fans, and ethernet cable, the power metre jumped to 45W. I removed hard drives one by one to see where the bottleneck is.

  • No HDD, 27W
  • IronWolf alone, 32W, 5W increases.
  • IronWolf + Seagate, 37W, another 5W increase.
  • IronWolf + Seagate + Toshiba, 45W, 8W increase.

So I kept only IronWolf which is a 3TB NAS grade HDD.

I also tried tweaking the BIOS and Linux kernel to get to C-states but I felt it was over-engineering so I am happily settled down with 27W.

Footnotes

1 I record weightlifting to correct and improve my techniques.

Use Ledger-Cli to Track DIY Project Expenses

Table of Contents

  1. Personal Technical Challenge
  2. Baby Steps
  3. Why? - Effort Estimation

Personal Technical Challenge

I used ledger-cli1 before and it was a painful experience. The problem was not rooted in the tool but in how I intended to use it: I wanted to track all my expenses, from buying a cup of coffee to booking a holiday package. When I started this journey, there was a massive jump from knowing little to nothing about personal finance to doing double-entry accounting in plain text.

Though I gave up, it introduced me to the idea of owning my bank transaction data in text files on my personal computer. So over the years, I manually curated about 8 years of historical transaction data.

If you haven’t done so, I strongly recommend you go to your banks’ website and download the transaction data manually, going as far back as you can. You will notice that the banks only give access to 3-5 years of data2. It’s a shame that banks use outdated technologies but it is better than having nothing.

Since I had the data, I did some analysis and charts in Python/R. But I kept wondering what ledger-cli can offer. I occasionally saw blog posts on ledger-cli in the Emacs communities, so there must be something out there.

It also has become a personal challenge. I turned not to give up but put it aside to tackle it again after I got older.

Baby Steps

Hopefully, I had become smarter as well. This time, to ensure I can successfully adopt the tool, I am going to reduce the scope to limit to only tracking DIY project expenses.

I love DIY and I wish I had more days for DIY projects. It is usually labour-intensive and I feel hyped and extremely confident after a couple of DIY. Pairing it with learning ledger-cli, a cognitive-intensive activity, would make them a nice bundle3.

Though the usage is simple, the question it can answer is important. I want to know, during or after the DIY project, how much it exactly costs. I could use a much simpler tool, like spreadsheets or a pen/notebook, but I want it to be a stepping stone to acquire ledger-cli properly in the future.

Why? - Effort Estimation

I need an accurate answer to the actual costs so that I can use the data to train myself in cost estimation. This is an very important skill to have as a homeowner, it would put me in a much better position in negotiation with the tradesman. A lot of the people in the UK complained that they or their relatives got ripped off by tradesman.4

In general, house repairs and improvements are getting much more expensive every year, due to the shortage of labourers, inflation and Brexit etc. To give an example using my last two quotes, adding an electrical socket costs £240 and replacing a small section of water pipes costs £500.

I have a good habit of using org-mode to track time, my goal to add ledger-cil to my system to track the expenses. After that, I would know if it is really worth doing the DIY or finding a proper tradesman. The total cost itself is not the only metric that matters, but n very essential one to have.

Footnotes

1 https://ledger-cli.org/

2 Why don’t banks give access to all your transaction activity?

3 I might pick it up from Atomic Habit

4 How many of you have been ripped off by builders / tradesmen? (or know someone closely that has)

Finding Highly Correlated Features

Table of Contents

  1. Motivation
  2. Implementation
  3. Parameterisation

Motivation

From a modelling perspective, it is not a big problem to have highly correlated features in the dataset. We have regularised Lasso/Ridge regression that are designed to deal with this kind of dataset. The ensemble trees are robust enough to be almost immune from this. Of course all model requires proper hyperparameter tuning with proper cross validation.

The problem raises in understanding the feature contributions: if there are 5 features that are highly correlated, and their individual contribute could be tiny, but their true contribution should be aggregated by adding the contribution together and considered them as a group, e.g. adding their coefficients in Ridge, and adding feature importance in LightGBM.

If their aggregated feature importance turns out to be indeed little, I can remove them from the model to have a simpler model. A mistake I used to make is removing the correlated features based on their individual feature importance, it leads to less performant models.

A better and cleaner approach is to the clean up correlated features to begin with, then I won’t need to do the feature importance aggregation, and it would speed up the model development cycle: there are less features to look at, to train the model, to verify the data qualities etc. When the model goes live in production, it translates to less data to source and maintenance.

Implementation

So I need to enrich my tool set to identify highly correlated features. I couldn’t find an existing library that does that, so I implemented it myself.

The key steps are:

  1. Based on the correlation matrix, create a correlation long table. Each row stands for the correlation between feature $X_1$ and feature $X_2$. Assuming there are three features in the dataset, the table looks like this.

    Row X1 X2 Corr
    1 A B 0.99
    2 A C 0.80
    3 B C 0.95
  2. Remove rows if the correlation is less than the threshold $T$. It significantly reduces the input to Step 3.

    If the threshold is 0.9, then the Row 2 will be removed.

  3. Treat the correlation table as a directed graph,

    1. Let $E$ be the unexplored nodes, filled with all the features $X_1$ in the start, $R$ is the result.

    2. For each node in $E$,

      1. Continue to travel the graph in depth-first fashion until there is no connections left, and add the connected node to the result $R$ at each visit.

      2. Remove the connected nodes in $R$ from the remaining nodes to explore in $E$.

The vanilla Python code corresponding to Step 3 is listed below. The ds object is a pandas.DataFrame, multi-indexed by $X_1$ and $X_2$, so ds.loc['A'].index gives all the connected features from $A$ whose correlation with $A$ is large than the provided threshold.

 
def find_neighbors(ds: pd.DataFrame, root: str, res: set):
    """recursively find the nodes connected with root in graph ds.
    """
    res.add(root)
    if root in ds.index:
        ns = ds.loc[root].index.tolist()
        for n in ns:
            find_neiboughr(ds, n, res)
    else:
        return []

def find_correlated_groups(ds: pd.DataFrame):
    """
    The ds object is a pandas.DataFrame, multi-indexed by X1 and X2.
    """
    res = defaultdict(set)

    # contiune til all nodes are visited.
    cols = ds.index.get_level_values(0).unique().tolist()
    while len(cols) != 0:

        # always start from the root as ds is directed graph.
        col = cols[0]
        find_neighbors(ds, col, res[col])

        # remove connected nodes from the remaining.
        for x in res[col]:
            if x in cols:
                cols.remove(x)
    return res

The result is a collection of mutually exclusive groups. Each group contains a set of highly correlated features, for example

Group A: {A, B, C}
Group D: {D, K, Z}

The next step is to decide which feature to keep and remove the rest within each group. The deciding factors can be data availability (e.g. choose the one feature with less missingness), costs in data sourcing (e.g. free to download from the internet) or familiarity (e.g. the feature is well understood by people) etc.

Parameterisation

There are two hyperparameters:

  • The correlation type: It can be Pearson for numerical data and Spearman for ordinal/categorical data. For a large dataset, it would take some time to calculate the correlation matrix.

  • The correlation threshold $T$: The higher the threshold, the less number of features to remove, so it is less effective. However, if the threshold is set too low, it leads to a high false positive rate, e.g. two features can be correlated, but they can still complement each other in the model.

I would test a range of values from 0.9 to 1, and review the results. Below graph shows the number of features to remove with varying thresholds.

  • When $T=0.9$, there are about 95 groups, and in total 153 features to remove.
  • When $T=1$, there are 29 groups, and in total 37 features to remove.


Proper end-to-end test runs are required to identify the best hyperparameters. As a quick rule of thumb, those 37 duplicated features identified with $T=1$ can be dropped without further testing.

The group sizes with varying thresholds $T$ provide an interesting insight of the data. The 75% percentile of the group sizes is plotted, which suggests that apart from the 33 duplicated features, there are a large number of paired features (i.e. group size is 2) whose correlation is large, more than 92%.


Less Excel, More R/Python in Emacs

Table of Contents

  1. Excel Is Great
  2. But
  3. Emacs Has More To Offer

Excel Is Great

Regardless of how powerful and convenient the R/Python data ecosystem becomes, there is still value in looking at the data in Excel, especially when exploring the data together with less technical people.

Thanks to its trivial interface Excel is widely used in data analysis: hoover the mouse to select columns, apply filters then calculate some statistics. Most of the time that is all it takes to get the answers the clients are seeking.

I recently realised that having transparency and working with the tools that clients use plays a crucial role in strengthening the trust and delivering the impacts to the business. Sometimes I think I should do more in Excel.

But

The problem with Excel is reproducibility - I’m not able to codify the clickings done in Excel and integrate them into the automated data pipeline. It is rather foolish to have quality control procedures, including code reviews, automated testing, CI etc in the system but in the very end drop all those gatekeepers and go for error-prone manuals.

Plus it is way more efficient to have everything done in one place to have a smooth process with no fractions. It is a key factor in enabling quick turnaround.

So I had the motive to limit the usage of Excel to deliver data to the business and pair data analysis. Again I have been looking into how much it can be done without leaving Emacs.

Emacs Has More To Offer

I was pleased to discover the ess-view-data package and its Python counterpart python-view-data. They interact with an active R/Python session in Emacs and print out data.frame objects in plain text, a.k.a. view data. What’s more, it can process the data before viewing, for example, subset the data row/column-wise, summarise the dataset etc.

The package keeps a record of the data processing pipeline so in the end I would have a copy of the R/Python code that generates the output. I can then effortlessly transfer the code to a script to ensure reproducibility in the future.

Another benefit derives from having a plain text buffer for the data. It is handy in exploring large datasets with an excessive number of columns. For example, the dataset I work on daily basis has about 300 columns. It contains different flavours of the financials, the raw values, imputed, ranked, smoothed etc.

It’s not possible to remember all the column names even after more time was spent in giving meaningful names or ensuring the correct columns are referred to. Having a persistent plain text buffer that I can search for makes finding the right column names a lot easier. It also helps to check what’s in and not in the data.

That’s my first impression of Shuguang Sun’s packages, it looks promising.

Blog in Emacs - Use Jekyll's Draft Mode

Why?

I wasn’t aware of Jekyll’s draft mode. My workaround was manually changing the published field in the front matter to true when the post is ready to publish. It works fine. However, with naive support from Jekyll, there are more benefits to using the draft mode.

To start with, I like the drafts saved in the _drafts folder, not mixed with other published posts in the _posts folder. It is way more cleaner and easy to manage. With a glimpse of my eyes, I can see what are the posts that I am drafting.

It also gives a piece of mind: only posts under the _posts folder are exported and shown in my blog. It ensures I don’t accidentally publish a post in draft.

Once there are files in _drafts folder, adding ​-​-​drafts argument to the jekyll serve command is all I need to be able to see the drafts locally.

Of course, I also need to write a bit of Lisp code to integrate the draft mode into my blogging workflow. This is the remaining of this post is about.

Implementation

For a blog post, I have the source file in org-mode and its exported file in Markdown. Now there is a new location dimension: they can be either in the _drafts or _posts folder.

mode source file (org mode) exported md in Jekyll
draft org/_drafts/on_image.org jekyll/_drafts/on_image.md
publish org/_posts/2027_02_08_on_image.org jekyll/_posts/2027_02_28_on_image.md

In terms of content, the published post and its final draft, and their exported counterparts are the same, only in different locations. Their content can be different to have some flexibility, e.g. published post has higher resolution of screenshots. This feature is possible to implement in the future. For now, I follow the simple “same but in different places” rule.

The new process looks like this: When I publish a post, it moves the org file from _draft to _posts folder, adds a date to the filename (which I have already), and then triggers the exporting process. To avoid duplication, it removes the original org file and its exported draft in Markdown.

To achieve that, the main missing piece from my current Emacs configuration is the yt/jekyll-find-export function (see below). For a post in _drafts or _posts, it finds the full path of the corresponding exported markdown file. I can then delete it or start the exporting process.

 
(defun yt/jekyll-is-draft-p ()
  "if the file is inside of the draft directory, it is a draft."
  (let ((draft-dir  (file-truename jekyll-source-drafts-dir))
        (filepath (file-truename (buffer-file-name))))
    (string-prefix-p draft-dir filepath)))


(defun yt/jekyll-find-export ()
  "find the full path to the exported file of the current post."
  (let* ((src-file (file-name-nondirectory (buffer-file-name)))
         (dest-file (file-name-with-extension src-file ".md")))
    (if (yt/jekyll-is-draft-p)
        (file-name-concat jekyll-site-draft-dir dest-file)
      (file-name-concat jekyll-site-post-dir dest-file))))
If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!