Yi Tang Data Science and Emacs

State of This Blog

Table of Contents

This static blog is built using Jekyll in 2014. It survived after 7 years which is a success when it comes to personal blogging. Part of the reason is having a good blogging workflow: write posts in Org Mode, export to HTML with a front matter, build the site using Jekyll, send the folder to an Amazon S3 bucket, and that’s it. All done in Emacs of course.

Technical Debt

I added a few things to the workflow to enhance the reading experience including code highlights, centred images with caption, table of content etc. There are more features I want to add but at the same time, I want to be able to just write.

With that mindset, whenever there are issues, I apply quick fixes without a deep understanding of the actual causes. It seems efficient until recently some fixes become counter-productive.

I started seeing underscore (_) is exported as \_ and <p​> tag appears in code snippets. It all sounds like quick fix, but I just couldn’t get it correct after few hours. For the last few posts, I had to manually fix them for each of the read-edit-export-fix iteration.

Revisit the Tech Stack

I have an ambitious goal for this blog. So it is time to go sweep the carpet. I studied the technologies used for this blog, Jekyll, AWS and Org Mode exporting. It was a good chance to practise Org-roam for taking atomic notes. The time is well spent as I learnt a lot.

I was impressed I got the whole thing up and running 7 years ago. I don’t think I have the willpower to do it now.

Still, there are a lot of things that I do not have a good understand, e.g. the Liquid templates, HTML and CSS tags etc. The syntax just puts me off.

Long Ride with Jekyll

I prefer a simple format like Org Mode or Markdown and don’t have to deal with HTML/CSS at all. There are a couple of occasions when I cannot resist the temptation to look for an alternative to Jekyll. There’s no luck in the search. It seems HTML is the only way because it is native to the web.

So the plan is to stick with Jekyll for at least a few years. In the next couple of weeks, I’d try to fix all the issues, after that, gradually add more features to enhance the writing and reading experience.

I hope people who also uses the similar tech stack (Org-mode, Emacs, Jekyll, AWS) can benefit my work.

Setup Emacs Servers in MacOS

I switched to MacOS last year for editing home gym videos. I was and am still amazed by how fast the M1 chip is for exporting 4K videos. The MacOS also enriched the Emacs experience which makes it deserve another blog post.

So I have been slowly adapting my Emacs configuration and workflow to MacOS. One of the changes is the Emacs server.

The goal is to have fully loaded Emacs instances running all the time so I can use them at any time and anywhere, in Terminal or Spotlight. They are initiated upon login. In cases of Emacs crashes (it is rare but more often than I like) or I have to stop them because I messed up the configuration, they restart automatically.

Emacs Server Configuration

I have this setup in Linux using systemd, as in my previous blog post.

In MacOS, the launchctl is the service manager. It provides a user interface to list, start and stop services.

To build an Emacs server, create a plist file in ~/Library/LaunchAgents folder. In my case, I named it emacs_work.plist.

 1: # cat ~/library/LaunchAgents/emacs_work.plist
 2: <plist version="1.0">
 3:   <dict>
 4:     <key>Label</key>
 5:     <string>emacs_work</string>
 6:     <key>ProgramArguments</key>
 7:     <array>
 8:       <string>/opt/homebrew/opt/emacs-plus@29/bin/emacs</string>
 9:       <string>--fg-daemon=work</string>
10:     </array>
11:     <key>RunAtLoad</key>
12:     <true/>
13:     <key>KeepAlive</key>
14:     <true/>    
15:     <key>StandardOutPath</key>
16:     <string>/tmp/emacs_work.stdout.log</string>
17:     <key>StandardErrorPath</key>
18:     <string>/tmp/emacs_work.stderr.log</string>
19:   </dict>
20: </plist>

It is an extension of Emacs Plus' plist file. I made a few changes for running two Emacs servers: one for work (data sciences, research) and one for personal usage (GTD, books). Taking the "work" server as an example, the important attributes of the plist configuration file are:

Line 5
The unique service name to launchctl
Line 8
The full path to the Emacs program. In my case, it is /opt/homebrew/opt/emacs-plus@29/bin/emacs
Line 9
The "--fg-daemon" option set the Emacs server name to "work". Later I can connect to this server by specifying "-s=work" option to emacsclient
Line 13
The KeepAlive is set to true so it keeps trying to restart the server in case of failures
Line 16 and 18
The location of standard output and error files. They are used to debug. Occasionally I have to check those files to see why Emacs servers stopped working, usually because of me introducing bugs in my .emacs.d.

With the updated plist files in place, I start the Emacs servers with

launchctl load -w ~/Library/LaunchAgents/emacs_work.plist
launchctl load -w ~/Library/LaunchAgents/emacs_org.plist

The launchctl list | grep -i emacs is a handy snippet that lists the status of the services whose name includes "emacs". The output I have right now is

PID Exit Code Server ID
1757 0 emacs_org
56696 0 emacs_work

It shows both Emacs servers are running fine with exit code 0.

Launch Emacs GUI in Terminal

I can now open a Emacs GUI and connect it to the "work" Emacs server by running emacsclient -c -s work &. The -c option

Launch Emacs GUI in Spotlight

In MacOS, I found it is natural to open applications using Spotlight, for example, type ⌘ + space to invoke Spotlight, put "work" in the search bar, it narrows the search down to "emacs_work" application, and hit return to finalise the search. It achieves the same thing as the command above but can be used anywhere.

I uploaded a demo video on YouTube to show it in action. You might want to watch it at 0.5x speed because I typed so fast...

To implement this shortcut, open "Automator" application, start a new "Application", select "Run Shell Script", and paste the following bash code

/opt/homebrew/opt/emacs-plus@29/bin/emacsclient \
    --no-wait \
    --quiet \
    --suppress-output \
    --create-frame -s work \
    "$@"

and save it as emacsclient_work in the ~/Application folder.

Essentially, the bash script above is wrapped up as a MacOS application, named emacsclient_work and the Spotlight searches the application folder by default.

Speed Up Sparse Boolean Data

I’m working on replicating the (Re-)Imag(in)ing Price Trends paper - the idea is to train a Convolutional Neutral Network (CNN) "trader" to predict the stocks' return. What makes this paper interesting is the model uses images of the pricing data, not in the traditional time-series format. It takes financial charts like the one below and tries to mimic the traders' behaviours to buy and sell stocks to optimise future returns.


Alphabet 5-days Bar Chart Shows OHLC Price and Volume Data

I like this idea. So it becomes my final assignment for Deep Learning Systems: Algorithm and Implementations course.

Imaging On-the-fly

To train the model, the price and volume data are transformed into black-white images which is just a 2D matrix with 0s and 1s. For just around 100 stocks' pricing history, there are around 1.2 million images in total.

I used the on-the-fly imaging process during training: in each batch, it loads pricing data for a given stock, sample one day in the history, slice a chunk of pricing data, and then convert it to an image. It takes about 0.2 milliseconds (ms) to do all that, so in total it takes 4 minutes to loop through all the 1.2 million images.

%%timeit 
df = MarketData(DATA_DIR)['GOOGL']
imager = ImagingOHLCV(img_resolution, price_prop=price_prop)
img = imager(df.tail(5))

1.92 ms ± 26.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

To train 10 epochs, that's 40 minutes in loading data. To train one epoch on the full dataset with 5,000 stocks, that's 200 minutes in loading data alone!

PyToch utilises multiple processing in loading the data using CPU while training using GPU. So the problem is less severe, but I'm using the needle, the deep learning framework we developed during the course, it does have this functionality yet.

During training using needle, the GPU utilisation is only around 50%. After all the components in the end-to-end are almost completed, it is time to train with more data, go deeper (larger/more complicated morel), try hyper-parameters tuning etc.

But before moving to the next stage, I need to improve the IO.

Scipy Sparse Matrix

In the image above, there are a lot of black pixels or zeros in the data matrix. In general only 5%-10% of pixels are white in this dataset.

So my first attempt was to use scipy's spare matrix instead of numpy's dense matrix: I save the sparse matrix, loaded it, and then convert it back to a dense matrix for training CNN model.

%%timeit
img_sparse = sparse.csr_matrix(img)
sparse.save_npz('/tmp/sparse_matrix.npz', img_sparse)
img_sparse_2 = sparse.load_npz('/tmp/sparse_matrix.npz')
assert np.all(img_sparse_2 == img)

967 µs ± 4.99 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

It reduces the IO time to 1ms, so about half of the time, not bad, but I was expecting a lot more as the sparseness is high.

Numpy Bites

Then I realised the data behind images is just 0 and 1, in fact, a lot of zeros, and only some are 1. So I can ignore the 0s and only need to save those 1s, then reconstruct the images using those 1.

It is so simple that numpy has functions for this type of data processing already. The numpy.packbites function converts the image matrix of 0 and 1 into a 1D array whose values indicate where the 1s are. Then the numpy.unpackbits does the inverse: it reconstructs the image matrix by using the 1D location array.

This process reduces the time of loading one image to 0.2 milliseconds, that's 10 times faster than the on-the-fly method with only a few lines of code.

%%timeit 
temp_file = "/tmp/img_np_bites.npy"
img_np_bites = np.packbits(img.astype(np.uint8))
np.save(temp_file, img_np_bites)
img_np_bites = np.load(temp_file)
img_np_bites = np.unpackbits(img_np_bites).reshape(img.shape)
assert np.all(img_np_bites == img)

194 µs ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Another benefit is the file size is much smaller: it is 188 bytes compared to 1104 bytes using sparse matrix. So it takes only 226MB of disk space to save 1.2 million images!

Path('/tmp/img_np_bites.npy').stat().st_size, Path('/tmp/sparse_matrix.npz').stat().st_size

188, 1104

Problems of Having Millions of Files

It takes a couple of minutes to generate 1.2 million files on my Debian machine. It is so quick! But then I release this approach is not scalable without modification because there's a limited number of files the OS can accommodate. The technical term is Inode. According to this StackExchange question, once the filesystem is created, one cannot increase the limit (Yes, I was there).

Without going down to the database route, one quick workaround is to bundle the images together, for example, 256 images in one file. So later in training, load 256 images in one go, then split them into chunks. Just ensure the number of images is a multiple of the batch size used in training so I don't have to deal with unequal batch sizes. Since those bundled images are trained together, it reduces the randomness of SGD, so I won't bundle too many images together, 256 sounds about right.

The LSP and other tools can cause problems when they are monitoring folders with a large number of files. Moving them out of the project folder is the way to go so Emacs won't complain or freeze.

PoorMan's CI in Emacs

I have been working on the Deep Learning System course. It is the hardest course I ever studied after university. I would never thought that I need CI for a personal study project. It just shows how complex this course is.

Here is the setup: the goal is to develop a pytorch-like DL library that supports ndarray ops, autograd, and to implement DL models, LSTM for example, from scratch. That's the exciting math part. The tricky part is it supports both CPU devices with C++11 and GPU devices with Cuda. On the user front, the interface is written in Python. I worked on my M1 laptop most of the time, and switch to my Debian desktop for Cuda implementation.

It was a fine Saturday afternoon, I made a breakthrough in implementing the gradient of Convolution Ops in Python after couple of hours of tinkering in a local coffee shop. I rushed home, boosted up Debian to test the Cuda backend, only to find "illegible memory access" error!

It took me a few cycles of rolling back to the previous change in git to find where the problems are. It made me think about the needs of CI. In the ideal scenario, I would have a CI that automatically runs the tests on the CPU and Cuda devices to ensure one bug-fix on CPU side doesn't introduce new bugs on the Cuda, and vice versa. But I don't have this setup at home.

Two Components of PoorMan CI

So I implemented what I call PoorMan CI. It is a semi-automated process that gives me some benefits of the full CI. I tried hard to refrain from doing anything fancy because I don't have time. The final homework is due in a few days. The outcome is simple yet powerful.

The PoorMan CI consists of two parts:

  1. a bunch of bash functions that I can call to run the tests, capture the outputs, save them in a file, and version control it

    For example, wrap the below snippet in a single function

pytest -l -v -k "not training and cuda" \
       > test_results/2022_12_11_12_48_44__fce5edb__fast_and_cuda.log
git add test_results/2022_12_11_12_48_44__fce5edb__fast_and_cuda.log

  1. a log file where I keep track of the code changes, and if the new change fixes anything, or breaks anything.

    In the example below, I have a bullet point for each change committed to git with a short summary, and a link to the test results. The fce5edb and f43d7ab are the git commit hash values.

    - fix grid setup, from (M, N) to (P, M)!
    [[file:test_results/2022_12_11_12_48_44__fce5edb__fast_and_cuda.log]]
    
    - ensure all data/parameters are in the right device. cpu and cuda, all pass! milestone.
    [[file:test_results/2022_12_11_13_51_22__f43d7ab__fast_and_cuda.log]]
    

As you can see, it is very simple!

Benefits

It changed my development cycle a bit: each time before I can claim something is done or fixed, I run this process which takes about 2 mins for two fast runs. I would use this time to reflect on what I've done so far, write down a short summary about what's got fixed and what's broken, check in the test results to git, update the test log file etc.

It sounds tedious, but I found myself enjoying doing it, it gives me confidence and reassurance about the progress I'm making. The time in reflecting also gives my brain a break and provides clarity on where to go next.

During my few hours of using it, it amazes me how easy it is to introduce new issues while fixing existing ones.

Implement in Org-mode

I don't have to use Org-mode for this, but I don't want to leave Emacs :) Plus, Org-mode shines in literate programming where code and documentation are put together.

This is actually how I implemented it in the first place. This section is dedicated to showing how to do it in Org-mode. I'm sure I will come back to this shortly, so it serves as documentation for myself.

Here is what I did: I have a file called poorman_ci.org, a full example can be found at this gist. An extract is demonstrated below.

I group all the tests logistically together into "fast and cpu", "fast and cuda", "slow and cuda", "slow and cuda". I have a top level header named group tests, Each group has their 2nd-level header.

The top header has a property drawer where I specify the shell session within which the tests are run so that

* grouped tests
:PROPERTIES:
:CREATED:  [2022-12-10 Sat 11:32]
:header-args:sh:    :session *hw4_test_runner* :async :results output :eval no
:END:
  1. it is persistent. I can switch to the shell buffer named hw4_test_runner and do something if needed
  2. it runs asynchronically on the background

All the shell code block under the grouped tests inherits those attributes.

The first code block defines variables that used to create a run id. It uses the timestamp and the git commit hash value. The run id is used for all the code blocks.

#+begin_src sh :eval no
wd="./test_results/"
ts=$(date +"%Y_%m_%d_%H_%M_%S")
git_hash=$(git rev-parse --verify --short HEAD)
echo "run id: " ${ts}__${git_hash}$
#+end_src

To run the code block, move the cursor inside the code block, and hit C-c C-c (control c control c).

Then I define the first code block to run all the tests on CPU except language model training. I name this batch of tests "fast and cpu".

#+begin_src sh :var fname="fast_and_cpu.log"
fname_full=${wd}/${ts}__${git_hash}__${fname}
pytest -l -v -k "not language_training and cpu" \
     2>&1 | tee ${fname_full}
#+end_src
  1. It creates the full path of the test results. The fname variable is set at the code clock header, this is a nice feature of Org-mode.
  2. pytest provides an intuitive interface for filtering tests, here I use "not language_training and cpu".
  3. The tee program is used to show the outputs and errors and at the same time save them to a file.

Similarly, I define code blocks for "fast and cuda", "slow and cpu", "slow and cuda".

So at the end of the development cycle, I open the poorman_ci.org file, run the code blocks sequentially, and manually update the change log. That's all.

Machine Learning in Emacs - Copy Files from Remote Server to Local Machine

dired-rsync is a great additional to my Machine Learning workflow in Emacs

Table of Contents

For machine learning projects, I tweaked my workflow so the interaction with remote server is kept as less as possible. I prefer to do everything locally on my laptop (M1 Pro) where I have all the tools for the job to do data analysis, visualisation, debugging etc and I can do all those without lagging or WI-FI.

The only usage of servers is running computation extensive tasks like recursive feature selection, hyperparameter tuning etc. For that I ssh to the server, start tmux, git pull to update the codebase, run a bash script that I prepared locally to fire hundreds of experiments. All done in Emacs of course thanks to Lukas Fürmetz’s vterm.

The only thing left is getting the experiment results back to my laptop. I used two approaches for copying the data to local: file manager GUI and rsync tool in CLI.

Recently I discovered dired-rsync that works like a charm - it combines the two approaches above, providing a interactive way of running rsync tool in Emacs. What’s more, it is integrated seamlessly into my current workflow.

They all have their own use case. In this post, I brief describe those three approaches for coping files with a focus on dired-rsync in terms of how to use it, how to setup, and my thoughts on how to enhance it.

Note the RL stands for remote location, i.e. a folder a in remote server, and LL stands for local location, the RL’s counterpart. The action in discussion is how to efficiently copying files from RL to LL.

File Manager GUI

This is the simplest approach requires little technical skills. The RL is mounted in the file manager which acts as an access point so it can be used just like a local folder.

I usually have two tabs open side by side, one for RL, and one for LL, compare the differences, and then copy what are useful and exists in RL but not in LL.

I used this approach on my Windows work laptop where rsync is not available so I have to copy files manually.

Rsync Tool in CLI

The rsync tool is similar to cp and scp but it is much more power:

  1. It copies files incrementally so it can stop at anytime without losing progress
  2. The output shows what files are copied, what are remaining, copying speed, overall progress etc
  3. Files and folders can be included/excluded by specifying patterns

I have a bash function in the project’s script folder as a shorthand like this

copy_from_debian_to_laptop () {
    # first argument to this function
    folder_to_sync=$1
    # define where the RL is 
    remote_project_dir=debian:~/Projects/2022-May
    # define where the LL is 
    local_project_dir=~/Projects/2022-May          
    rsync -avh --progress \
	  ${remote_project_dir}/${folder_to_sync}/ \
	  ${local_project_dir}/${folder_to_sync}
}

To use it, I firstly cd (change directory) to the project directory in terminal, call copy_from_debian_to_laptop function, and use the TAB completion to quickly get the directory I want to copy, for example

copy_from_debian_to_laptop experiment/2022-07-17-FE

This function is called more often from a org-mode file where I kept track of all the experiments.

Emacs’ Way: dired-rsync

This approach is a blend of the previous two, enable user to enjoy the benefits of GUI for exploring and the power of rsync.

What’s more, it integrates so well into the current workflow by simply switching from calling dired-copy to calling dired-rsync, or pressing r key instead of C key by using the configuration in this post.

To those who are not familiar with copying files using dired in Emacs, here is the step by step process:

  1. Open two dired buffer, one at RL and one at LL, either manually or using bookmarks
  2. Mark the files/folders to copy in the RL dired buffer
  3. Press r key to invoke dired-rsync
  4. It asks for what to copy to. The default destination is LL so press Enter to confirm.

After that, a unique process buffer, named *rsync with a timestamp suffix, is created to show the rsync output. I can stop the copying by killing the process buffer.

Setup for dired=rsync

The dired-rsync-options control the output shown in the process buffer. It defaults to “-az –info=progress2”. It shows the overall progress in one-line, clean and neat (not in MacOS though, see Issue 36). Sometimes I prefer “-azh –progress” so I can see exactly which files are copied.

There are other options for showing progress in modeline (dired-rsync-modeline-status), hooks for sending notifications on failure/success (dired-rsync-failed-hook and dired-rsync-success-hook).

Overall the library is well designed, and the default options work for me, so I can have a bare-minimal configuration as below (borrowed from ispinfx):

(use-package dired-rsync
  :demand t
  :after dired
  :bind (:map dired-mode-map ("r" . dired-rsync))
  :config (add-to-list 'mode-line-misc-info '(:eval dired-rsync-modeline-status 'append))
  )

There are two more things to do on the system side:

  1. In macOS, the default rsync is a 2010 version. It does not work with the latest rsync I have on Debian server so I upgrade it using brew install rsync.

  2. There no way of typing password as a limitation of using process buffer so I have to ensure I can rsync without remote server asking for password. It sounds complicated but fortunately it takes few steps to do as in Setup Rsync Between Two Servers Without Password.

Enhance dired-rsync with compilation mode

It’s such a great library that makes my life much easier. It can be improved further to provide greater user experience, for example, keep the process buffer alive as a log after the coping finished because the user might want to have a look later.

At the moment, there’s no easy way of changing the arguments send to rsync. I might want to test a dry-run (adding -n argument) so I can see exactly what files are going to be copied before running, or I need to exclude certain files/folders, or rerun the coping if there’s new files generated on RL.

If you used compilation buffer before, you know where I am going. That’s right, I am thinking of turning the rsync process buffer into compilation mode, then it would inherit these two features:

  1. Press g to rerun the rsync command when I know there are new files generated on the RL
  2. Press C-u g (g with prefix) to change the rsync arguments before running it for dry-run, inclusion or exclusion

I don’t have much experience in elisp but I had a quick look at source code, it seems there’s no easy of implementing this idea so something to add to my ever-growing Emacs wish-list.

In fact, the limitation comes from using lower level elisp functions. The Emacs Lisp manual on Process Buffers states that

Many applications of processes also use the buffer for editing input to be sent to the process, but this is not built into Emacs Lisp.

What a pity. For now I enjoy using it and look for opportunities to use it.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!