Yi Tang Data Science and Emacs

Setup ssh-agent Systemd Service for Emacs

Problem Statement

My personal desktop is not booting (the motherboard is probably dead) so I have been setting my server so I can work while sorting things out.

I got stuck in getting magit working in emacsclient: I thought I could run ssh-add inside of Emacs that would allow magic to access my git repos using ssh, but apparently, it is not the case.

After some digging, I learnt that the problem I have to solve is to run one ssh-agent in the background and then make the Emacs/Magit or any programs hook onto it. Then once I run ssh-add and type the passphrase for the first time, either inside of Emacs or in a bash terminal, everything would work.

Implementation

Drop the following unit file below to ~/.config/systemd/user/ssh-agent.service.

[Unit]
Description=SSH key agent

[Service]
Type=simple
Environment=SSH_AUTH_SOCK=%t/ssh-agent.socket
ExecStart=/usr/bin/ssh-agent -D -a $SSH_AUTH_SOCK

[Install]
WantedBy=default.target

The important things are

  1. The environment variable SSH_AUTH_SOCK is specified. It can be anywhere as long as this environment variable in other programs points to the same location.
  2. ssh-agent is invoked with the -a option to provide an address specified in the above step.

The $t is a specifier1 in systemd, it is equivalent to $XDG_RUNTIME_DIR variable in Debian. It points to the runtime temporary directory which apparently is safer2 than the /tmp directory. The runtime directory was cleaned up after stopping the ssh-agent so it is non-persistent.

To start the ssh-agent service:

 
systemctl enable --user ssh-agent
systemctl start --user ssh-agent

After that, update the unit file of Emacs to include this line (follow up my blog post Managing Emacs Server as Systemd Service for the full setup).

Environment=SSH_AUTH_SOCK=%t/ssh-agent.socket

To make it work for bash shell and all other programs calling from a bash terminal, add this line to ~/.bashrc.

 
export SSH_AUTH_SOCK="$XDG_RUNTIME_DIR/ssh-agent.socket"

Alternatives

There are programs developed to solve this specific problem (see Debian wiki). While using such a program seems like a simpler alternative (e.g. keychain), I prefer to use systemd as the unified approach for managing background services. I have been using it for emacsclient, and I’m adding ssh-agent to it.

What is your preference? How do you solve this problem?

Footnotes

1 All the specifiers are listed here.

2 I am not a security expert but the StackExchange comments seem to make sense.

Retiring Raspberry Pi 4 as Home Server and NAS

Table of Contents

  1. Good Start for Self-Hosting
  2. Lack of NAS Capacity
  3. Looking for a Successor
  4. Unexpected
  5. Setting up z170a
  6. Power Consumption

Good Start for Self-Hosting

The little Raspberry Pi 4 (RP4) served me well in the last two years. I used it to host NextCloud/Syncthing for syncing files between devices, scraping financial data from Yahoo Finance and TimeMachine for MacOS backup.

The latest addition to the service stack is paperless-ngx. It allows my Canon printer/scanner to send digital copies of documents directly to the RP4 or Gmail.

The RP4 handles all the demands without showing any signs of struggle. It costs as little as 6kW per hour while the Xbox One S draws 11kW while sleeping. Thanks to the energy crisis in the UK, I started to appreciate the energy efficiency of RP4. The ARM chips in it really impressed me.

Lack of NAS Capacity

A 3TB portal hard drive (WD My Passport) was attached to the PR4 to store media data. The USB 3.0 connector is surprisingly stable and fast. With both ends connected by ethernet cables, the file transfer speed can reach up to 100 MB/s. When my MacBook Pro uses Wi-Fi, the speed drops to about 40-50 MB/s but it is still great because of the convenience.

Later I started using it as a NAS to store the Final Cut Pro library. The 4k home gym videos I shot using iPhone 12 Pro are numerous 1! The hard drive keeps getting filled up.

I can get another portal hard drive, but then it will get filled up again, say in less than a month? So it occurred to me that I need a proper home server with full NAS capacity.

Looking for a Successor

I did a bit of research but I am not able to find a good product. I suspect the reason is the NAS build is a niche area while the PC industry is gaming-centric, focusing on getting faster, bigger, and fancier hardware with unnecessary RGB lights, that is where the profits are I presume.

I came across some innovative products on AliExpress from China, such as the TopTon N5105 board. It is more powerful, consumes slightly more electricity, and it has 6 SATA cables! It would be a perfect successor for my PR4.

But I am not comfortable ordering electronic stuff from AliExpress, returning it or sending it back for repair would be a nightmare.

PS: The company is growing fast, it continued to innovate, and the product lines extended to Intel N100 with an additional NVME drive and a USB-C. Their website and marketing materials look notched up quite a bit. I kind of regret not taking the risk back then.

Unexpected

The other day, I was re-organising (again) my home office, so had to move a bookshelf. I started moving it without taking everything off, then a motherboard fell off. It was the z170a with an i5-6600k and a heat sink attached to it. The motherboard was in my first desktop that I purchased 10 years ago when I started participating in Kaggle competitions in 2014.

After a quick inspection, I saw some pins were bent. I felt ashamed and sorry for the motherboard that I had not taken care of it. So I made a promise: if it survived the fall, I would use it for my NAS.

Well, it did so I found my NAS.

Setting up z170a

While putting it up, one SATA port was snapped and came up, but the rest is still fine. Apart from that, everything else went smoothly. The Debian 12 became much easier to install with the isohybrid technology and the non-free firmware is now part of the installation image itself.

The server setup scripts and configuration are saved in a selfhosted-services git repository so restoring the services took little efforts.

I had one little trick: I assigned the IP address of RP4 to the new z170a server so that on the client side I didn’t have to change anything. This was achieved rather easily: few clicks in the ASUS router web UI and then a reboot.

While setting it up, I noticed the z170a system is much more responsive, thanks to the 3.5 GHz i5-6600k CPU and a much faster SSD over the SD card. I was able to run multiple processes at the same time.

The longest part is copying files from the 3TB portal hard drive to the z170a’s internal HDD, which took about 20 hours.

It has great extensibilities: there are 3 free SATA for HHD and two PCIe slots.

Power Consumption

The only downside is that it consumes a lot more electricity. When testing in barebone, it drew only 10W. After putting everything together with additional HDDs, fans, and ethernet cable, the power metre jumped to 45W. I removed hard drives one by one to see where the bottleneck is.

  • No HDD, 27W
  • IronWolf alone, 32W, 5W increases.
  • IronWolf + Seagate, 37W, another 5W increase.
  • IronWolf + Seagate + Toshiba, 45W, 8W increase.

So I kept only IronWolf which is a 3TB NAS grade HDD.

I also tried tweaking the BIOS and Linux kernel to get to C-states but I felt it was over-engineering so I am happily settled down with 27W.

Footnotes

1 I record weightlifting to correct and improve my techniques.

Use Ledger-Cli to Track DIY Project Expenses

Table of Contents

  1. Personal Technical Challenge
  2. Baby Steps
  3. Why? - Effort Estimation

Personal Technical Challenge

I used ledger-cli1 before and it was a painful experience. The problem was not rooted in the tool but in how I intended to use it: I wanted to track all my expenses, from buying a cup of coffee to booking a holiday package. When I started this journey, there was a massive jump from knowing little to nothing about personal finance to doing double-entry accounting in plain text.

Though I gave up, it introduced me to the idea of owning my bank transaction data in text files on my personal computer. So over the years, I manually curated about 8 years of historical transaction data.

If you haven’t done so, I strongly recommend you go to your banks’ website and download the transaction data manually, going as far back as you can. You will notice that the banks only give access to 3-5 years of data2. It’s a shame that banks use outdated technologies but it is better than having nothing.

Since I had the data, I did some analysis and charts in Python/R. But I kept wondering what ledger-cli can offer. I occasionally saw blog posts on ledger-cli in the Emacs communities, so there must be something out there.

It also has become a personal challenge. I turned not to give up but put it aside to tackle it again after I got older.

Baby Steps

Hopefully, I had become smarter as well. This time, to ensure I can successfully adopt the tool, I am going to reduce the scope to limit to only tracking DIY project expenses.

I love DIY and I wish I had more days for DIY projects. It is usually labour-intensive and I feel hyped and extremely confident after a couple of DIY. Pairing it with learning ledger-cli, a cognitive-intensive activity, would make them a nice bundle3.

Though the usage is simple, the question it can answer is important. I want to know, during or after the DIY project, how much it exactly costs. I could use a much simpler tool, like spreadsheets or a pen/notebook, but I want it to be a stepping stone to acquire ledger-cli properly in the future.

Why? - Effort Estimation

I need an accurate answer to the actual costs so that I can use the data to train myself in cost estimation. This is an very important skill to have as a homeowner, it would put me in a much better position in negotiation with the tradesman. A lot of the people in the UK complained that they or their relatives got ripped off by tradesman.4

In general, house repairs and improvements are getting much more expensive every year, due to the shortage of labourers, inflation and Brexit etc. To give an example using my last two quotes, adding an electrical socket costs £240 and replacing a small section of water pipes costs £500.

I have a good habit of using org-mode to track time, my goal to add ledger-cil to my system to track the expenses. After that, I would know if it is really worth doing the DIY or finding a proper tradesman. The total cost itself is not the only metric that matters, but n very essential one to have.

Footnotes

1 https://ledger-cli.org/

2 Why don’t banks give access to all your transaction activity?

3 I might pick it up from Atomic Habit

4 How many of you have been ripped off by builders / tradesmen? (or know someone closely that has)

Finding Highly Correlated Features

Table of Contents

  1. Motivation
  2. Implementation
  3. Parameterisation

Motivation

From a modelling perspective, it is not a big problem to have highly correlated features in the dataset. We have regularised Lasso/Ridge regression that are designed to deal with this kind of dataset. The ensemble trees are robust enough to be almost immune from this. Of course all model requires proper hyperparameter tuning with proper cross validation.

The problem raises in understanding the feature contributions: if there are 5 features that are highly correlated, and their individual contribute could be tiny, but their true contribution should be aggregated by adding the contribution together and considered them as a group, e.g. adding their coefficients in Ridge, and adding feature importance in LightGBM.

If their aggregated feature importance turns out to be indeed little, I can remove them from the model to have a simpler model. A mistake I used to make is removing the correlated features based on their individual feature importance, it leads to less performant models.

A better and cleaner approach is to the clean up correlated features to begin with, then I won’t need to do the feature importance aggregation, and it would speed up the model development cycle: there are less features to look at, to train the model, to verify the data qualities etc. When the model goes live in production, it translates to less data to source and maintenance.

Implementation

So I need to enrich my tool set to identify highly correlated features. I couldn’t find an existing library that does that, so I implemented it myself.

The key steps are:

  1. Based on the correlation matrix, create a correlation long table. Each row stands for the correlation between feature $X_1$ and feature $X_2$. Assuming there are three features in the dataset, the table looks like this.

    Row X1 X2 Corr
    1 A B 0.99
    2 A C 0.80
    3 B C 0.95
  2. Remove rows if the correlation is less than the threshold $T$. It significantly reduces the input to Step 3.

    If the threshold is 0.9, then the Row 2 will be removed.

  3. Treat the correlation table as a directed graph,

    1. Let $E$ be the unexplored nodes, filled with all the features $X_1$ in the start, $R$ is the result.

    2. For each node in $E$,

      1. Continue to travel the graph in depth-first fashion until there is no connections left, and add the connected node to the result $R$ at each visit.

      2. Remove the connected nodes in $R$ from the remaining nodes to explore in $E$.

The vanilla Python code corresponding to Step 3 is listed below. The ds object is a pandas.DataFrame, multi-indexed by $X_1$ and $X_2$, so ds.loc['A'].index gives all the connected features from $A$ whose correlation with $A$ is large than the provided threshold.

 
def find_neighbors(ds: pd.DataFrame, root: str, res: set):
    """recursively find the nodes connected with root in graph ds.
    """
    res.add(root)
    if root in ds.index:
        ns = ds.loc[root].index.tolist()
        for n in ns:
            find_neiboughr(ds, n, res)
    else:
        return []

def find_correlated_groups(ds: pd.DataFrame):
    """
    The ds object is a pandas.DataFrame, multi-indexed by X1 and X2.
    """
    res = defaultdict(set)

    # contiune til all nodes are visited.
    cols = ds.index.get_level_values(0).unique().tolist()
    while len(cols) != 0:

        # always start from the root as ds is directed graph.
        col = cols[0]
        find_neighbors(ds, col, res[col])

        # remove connected nodes from the remaining.
        for x in res[col]:
            if x in cols:
                cols.remove(x)
    return res

The result is a collection of mutually exclusive groups. Each group contains a set of highly correlated features, for example

Group A: {A, B, C}
Group D: {D, K, Z}

The next step is to decide which feature to keep and remove the rest within each group. The deciding factors can be data availability (e.g. choose the one feature with less missingness), costs in data sourcing (e.g. free to download from the internet) or familiarity (e.g. the feature is well understood by people) etc.

Parameterisation

There are two hyperparameters:

  • The correlation type: It can be Pearson for numerical data and Spearman for ordinal/categorical data. For a large dataset, it would take some time to calculate the correlation matrix.

  • The correlation threshold $T$: The higher the threshold, the less number of features to remove, so it is less effective. However, if the threshold is set too low, it leads to a high false positive rate, e.g. two features can be correlated, but they can still complement each other in the model.

I would test a range of values from 0.9 to 1, and review the results. Below graph shows the number of features to remove with varying thresholds.

  • When $T=0.9$, there are about 95 groups, and in total 153 features to remove.
  • When $T=1$, there are 29 groups, and in total 37 features to remove.


Proper end-to-end test runs are required to identify the best hyperparameters. As a quick rule of thumb, those 37 duplicated features identified with $T=1$ can be dropped without further testing.

The group sizes with varying thresholds $T$ provide an interesting insight of the data. The 75% percentile of the group sizes is plotted, which suggests that apart from the 33 duplicated features, there are a large number of paired features (i.e. group size is 2) whose correlation is large, more than 92%.


Less Excel, More R/Python in Emacs

Table of Contents

  1. Excel Is Great
  2. But
  3. Emacs Has More To Offer

Excel Is Great

Regardless of how powerful and convenient the R/Python data ecosystem becomes, there is still value in looking at the data in Excel, especially when exploring the data together with less technical people.

Thanks to its trivial interface Excel is widely used in data analysis: hoover the mouse to select columns, apply filters then calculate some statistics. Most of the time that is all it takes to get the answers the clients are seeking.

I recently realised that having transparency and working with the tools that clients use plays a crucial role in strengthening the trust and delivering the impacts to the business. Sometimes I think I should do more in Excel.

But

The problem with Excel is reproducibility - I’m not able to codify the clickings done in Excel and integrate them into the automated data pipeline. It is rather foolish to have quality control procedures, including code reviews, automated testing, CI etc in the system but in the very end drop all those gatekeepers and go for error-prone manuals.

Plus it is way more efficient to have everything done in one place to have a smooth process with no fractions. It is a key factor in enabling quick turnaround.

So I had the motive to limit the usage of Excel to deliver data to the business and pair data analysis. Again I have been looking into how much it can be done without leaving Emacs.

Emacs Has More To Offer

I was pleased to discover the ess-view-data package and its Python counterpart python-view-data. They interact with an active R/Python session in Emacs and print out data.frame objects in plain text, a.k.a. view data. What’s more, it can process the data before viewing, for example, subset the data row/column-wise, summarise the dataset etc.

The package keeps a record of the data processing pipeline so in the end I would have a copy of the R/Python code that generates the output. I can then effortlessly transfer the code to a script to ensure reproducibility in the future.

Another benefit derives from having a plain text buffer for the data. It is handy in exploring large datasets with an excessive number of columns. For example, the dataset I work on daily basis has about 300 columns. It contains different flavours of the financials, the raw values, imputed, ranked, smoothed etc.

It’s not possible to remember all the column names even after more time was spent in giving meaningful names or ensuring the correct columns are referred to. Having a persistent plain text buffer that I can search for makes finding the right column names a lot easier. It also helps to check what’s in and not in the data.

That’s my first impression of Shuguang Sun’s packages, it looks promising.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!