Yi Tang Data Science and Emacs

Machine Learning in Emacs - Copy Files from Remote Server to Local Machine

dired-rsync is a great additional to my Machine Learning workflow in Emacs

Table of Contents

For machine learning projects, I tweaked my workflow so the interaction with remote server is kept as less as possible. I prefer to do everything locally on my laptop (M1 Pro) where I have all the tools for the job to do data analysis, visualisation, debugging etc and I can do all those without lagging or WI-FI.

The only usage of servers is running computation extensive tasks like recursive feature selection, hyperparameter tuning etc. For that I ssh to the server, start tmux, git pull to update the codebase, run a bash script that I prepared locally to fire hundreds of experiments. All done in Emacs of course thanks to Lukas Fürmetz’s vterm.

The only thing left is getting the experiment results back to my laptop. I used two approaches for copying the data to local: file manager GUI and rsync tool in CLI.

Recently I discovered dired-rsync that works like a charm - it combines the two approaches above, providing a interactive way of running rsync tool in Emacs. What’s more, it is integrated seamlessly into my current workflow.

They all have their own use case. In this post, I brief describe those three approaches for coping files with a focus on dired-rsync in terms of how to use it, how to setup, and my thoughts on how to enhance it.

Note the RL stands for remote location, i.e. a folder a in remote server, and LL stands for local location, the RL’s counterpart. The action in discussion is how to efficiently copying files from RL to LL.

File Manager GUI

This is the simplest approach requires little technical skills. The RL is mounted in the file manager which acts as an access point so it can be used just like a local folder.

I usually have two tabs open side by side, one for RL, and one for LL, compare the differences, and then copy what are useful and exists in RL but not in LL.

I used this approach on my Windows work laptop where rsync is not available so I have to copy files manually.

Rsync Tool in CLI

The rsync tool is similar to cp and scp but it is much more power:

  1. It copies files incrementally so it can stop at anytime without losing progress
  2. The output shows what files are copied, what are remaining, copying speed, overall progress etc
  3. Files and folders can be included/excluded by specifying patterns

I have a bash function in the project’s script folder as a shorthand like this

copy_from_debian_to_laptop () {
    # first argument to this function
    folder_to_sync=$1
    # define where the RL is 
    remote_project_dir=debian:~/Projects/2022-May
    # define where the LL is 
    local_project_dir=~/Projects/2022-May          
    rsync -avh --progress \
	  ${remote_project_dir}/${folder_to_sync}/ \
	  ${local_project_dir}/${folder_to_sync}
}

To use it, I firstly cd (change directory) to the project directory in terminal, call copy_from_debian_to_laptop function, and use the TAB completion to quickly get the directory I want to copy, for example

copy_from_debian_to_laptop experiment/2022-07-17-FE

This function is called more often from a org-mode file where I kept track of all the experiments.

Emacs’ Way: dired-rsync

This approach is a blend of the previous two, enable user to enjoy the benefits of GUI for exploring and the power of rsync.

What’s more, it integrates so well into the current workflow by simply switching from calling dired-copy to calling dired-rsync, or pressing r key instead of C key by using the configuration in this post.

To those who are not familiar with copying files using dired in Emacs, here is the step by step process:

  1. Open two dired buffer, one at RL and one at LL, either manually or using bookmarks
  2. Mark the files/folders to copy in the RL dired buffer
  3. Press r key to invoke dired-rsync
  4. It asks for what to copy to. The default destination is LL so press Enter to confirm.

After that, a unique process buffer, named *rsync with a timestamp suffix, is created to show the rsync output. I can stop the copying by killing the process buffer.

Setup for dired=rsync

The dired-rsync-options control the output shown in the process buffer. It defaults to “-az –info=progress2”. It shows the overall progress in one-line, clean and neat (not in MacOS though, see Issue 36). Sometimes I prefer “-azh –progress” so I can see exactly which files are copied.

There are other options for showing progress in modeline (dired-rsync-modeline-status), hooks for sending notifications on failure/success (dired-rsync-failed-hook and dired-rsync-success-hook).

Overall the library is well designed, and the default options work for me, so I can have a bare-minimal configuration as below (borrowed from ispinfx):

(use-package dired-rsync
  :demand t
  :after dired
  :bind (:map dired-mode-map ("r" . dired-rsync))
  :config (add-to-list 'mode-line-misc-info '(:eval dired-rsync-modeline-status 'append))
  )

There are two more things to do on the system side:

  1. In macOS, the default rsync is a 2010 version. It does not work with the latest rsync I have on Debian server so I upgrade it using brew install rsync.

  2. There no way of typing password as a limitation of using process buffer so I have to ensure I can rsync without remote server asking for password. It sounds complicated but fortunately it takes few steps to do as in Setup Rsync Between Two Servers Without Password.

Enhance dired-rsync with compilation mode

It’s such a great library that makes my life much easier. It can be improved further to provide greater user experience, for example, keep the process buffer alive as a log after the coping finished because the user might want to have a look later.

At the moment, there’s no easy way of changing the arguments send to rsync. I might want to test a dry-run (adding -n argument) so I can see exactly what files are going to be copied before running, or I need to exclude certain files/folders, or rerun the coping if there’s new files generated on RL.

If you used compilation buffer before, you know where I am going. That’s right, I am thinking of turning the rsync process buffer into compilation mode, then it would inherit these two features:

  1. Press g to rerun the rsync command when I know there are new files generated on the RL
  2. Press C-u g (g with prefix) to change the rsync arguments before running it for dry-run, inclusion or exclusion

I don’t have much experience in elisp but I had a quick look at source code, it seems there’s no easy of implementing this idea so something to add to my ever-growing Emacs wish-list.

In fact, the limitation comes from using lower level elisp functions. The Emacs Lisp manual on Process Buffers states that

Many applications of processes also use the buffer for editing input to be sent to the process, but this is not built into Emacs Lisp.

What a pity. For now I enjoy using it and look for opportunities to use it.

Move Between Windows in Emacs using windmoveMD

Table of Contents

Started Seeing

The good thing about Emacs is that you can always tweak it to suit your needs. For years I’ve been doing it for productivity reasons. Now for the first time, I’m doing it for health reasons.

Life can be sht sometimes, when I was in my mid 20s, I was reshaping every aspects of my life for good. But optician told me my vision can only get worse. I wasn’t paying much attention, busy with my first job and learning.

Last month, I was told my right eye’s vision got whole point worse, whatever that means. Now I’m wearing a new pair of glasses, seeing the world in 4K using both eyes, noticing so much details. It makes the world so vibrate and exciting. It comes with a price though, my eyes get tired quickly, and it become so easy to get annoyed by little things.

One of them is switching windows in Emacs. Even though I am in the period of calibrating to the new glasses, I decided to take some actions.

Ace-Window

Depends on the complexity of the tasks, I usually have about 4-8 windows laid on my 32 inch monitor. If that’s not enough, I would have an additional frame of similar windows layout, doubling the number of windows to 8-16.

So I found myself switching between windows all the time. The action itself is straightforward with ace-window.

The process can be breakdwon into five steps:

  1. Invoke ace-window command by pressing F2 key,
  2. The Emacs buffers fade-in,
  3. A red number pops-up at the top left corner of each window,
  4. I press the number key to switch the window it associates with,
  5. After that, the content in each Emacs buffer are brought back.

This gif from ace-window git repo demonstrates the process. img

This approach depends on visual feedback - I have to look at the corner of the window to see the number. Also, the screen flashes twice during the process.

I tried removing the background dimming, increase the font size of the number to make it easier to see, and bunch of other tweaks.

In the end, my eyes were not satisfied.

Windmove

So I started looking for alternative approaches and found windmove which is built-in.

The idea is simple - keep move to the adjacent window by move left, right, up, or down until it arrives at the window I want.

So it uses the relative location between windows instead of assigning each window a unique number and then using the number for switching.

Is it really better? Well with this approach, I use my eyes a lot less as I do not have to look for the number. Plus, I feel this is more nature as I do not need to work out the directions, somehow I just know I need to move right twice or whatever to get to the destination.

The only issue I had so far is the conflicts with org-mode’s calendar. I like the keybinding in org-mode, so I disabled windmove in org-mode’s calendar with the help from this stackoverflow question.

The following five lines of code is all I need to use windmove.

(windmove-default-keybindings)
(define-key org-read-date-minibuffer-local-map (kbd "<left>") (lambda () (interactive) (org-eval-in-calendar '(calendar-backward-day 1))))
(define-key org-read-date-minibuffer-local-map (kbd "<right>") (lambda () (interactive) (org-eval-in-calendar '(calendar-forward-day 1))))
(define-key org-read-date-minibuffer-local-map (kbd "<up>") (lambda () (interactive) (org-eval-in-calendar '(calendar-backward-week 1))))
(define-key org-read-date-minibuffer-local-map (kbd "<down>") (lambda () (interactive) (org-eval-in-calendar '(calendar-forward-week 1))))

I created an git branch for switching from ace-window to windmove. I would try it for a month before merge it into master branch.

Back to where it started

After using it for few days, I realised this is the very package I used for switch windows back in 2014 when I started learning Emacs. I later then switched to ace-window because it looks pretty cool.

Life is changing, my perspectives are changing, so is my Emacs configuration. This time, it goes back to where I started 8 years ago.

Wireless Backup Solution Using Raspberry Pi for MacOS

If you need automated backups for Time Machine and have a Raspberry Pi, You will find this post useful.

Motivation

After 3 months using my brand new MacBook Pro 14 M1 Pro, one of the USB-C port stopped working. I will have to send it back, not sure what Apple will do with it but I can't bear the risk of losing data. So I need a backup.

In fact, I need to backup regularly for situation like this so that's why I worked on it.

Wireless Backup Solution

The easiest solution is to get a USB-C portable SSD, plug it into my laptop and open Time Machine to start back up, do it once a week and call it a day.

But I'm reluctant to add more devices to my already cluttered home lab. There are a few hard drives in the drawers, it would be good to utilise them.

So I decided to set up a Time Machine backup solution using on my Raspberry Pi 4. The benefits are

  1. no additional costs, save me about £50-£100
  2. no need to buy new stuff, so fewer things to care of
  3. wireless backup to keep my desk clean

Later I realised the benefits of having a wireless backup is overlooked. It can backup anytime and anywhere in my house. Also, because of convenience, I can have more granular backups - instead of weekly backup, I have hourly backup without getting the cables and hard drives. I do less but get more value out of it.

The only concern I had was the speed. It turns out with SAMBA 3 protocol, I can get 55 MB/s write speed and 40 MB/s read speed from laptop to Raspberry Pi. So in theory, it would take around 2.5 hours to backup my 500 GB laptop. It might be a lot but only for the first backup, the subsequent incremental backup would be much simpler and faster, for example, as of now, the Time Machine completed a new backup within 3 minutes in the background without my notice.

A portal USB-C SSD can finish the backup within minutes but it's an overkill for an ordinary user like me and it's inconvenient.

So I'm satisfied with the current solution.

Set Up Raspberry Pi

I read a few guides on setting up Raspberry Pi for Time Machine, and I found this guide most accurate and useful.

One thing I noticed is the AFP (Apple File Protocol) is deprecated, so make sure you use SAMBA as the protocol.

Additionally, I followed this stack overflow answer to auto-mount the SAMBA server so that every time I reboot my laptop, the Time Machine will be ready to back up.

Time Machine Backup frequency

By default, Time Machine does hourly backup.

If you feel hourly backup is not necessary, you can change it by updating this file

/System/Library/LaunchDaemons/com.apple.backupd-helper.plist

for example, to change the frequency from hourly to daily backup, change the interval value from 3600 to 43200.

In the end, I left it with the default hourly backup so it does many small backup hourly instead of one big backup daily.

Backup for Backups

After couple of hours of work, I managed to get a wireless backup solution for my laptop so I won't have to worry about data loss. Plus I can time-travel files at hourly intervals.

One concern that occurred to me was the backup sits on my local hard drive. If the hard drive died, I would lose all my backups.

To solve that problem, I will have to go through the rabbit hole of doing backup for backups, or backup to a remote location or cloud, or setup a Raspberry Pi RAID.

At the moment, I'm not very concerned - I have Apple iCloud to back up my photos, videos, notes etc and I use GitHub to host my org-files and code. So having a backup for backups is not necessary for me for now.

Managing Emacs Server as Systemd Service

Using Emacs Server Without Systemd

I live in Emacs entirely apart from using browser for googling. Having an Emacs server running on the background makes Emacs available all the time. So I won't worry about closing it accidental.

It is not hard to do that, just run

emacs --daemon

in command line to start the Emacs server. It will load user configuration file as usual. Then run

emacsclient -c & 

to open an Emacs GUI instance that uses the Emacs server. That's how I have been doing for a while.

An better approach is using systemd. It is the services manager of Linux. Whenever my Debian 11 laptop boot up, systemd would start a bunch of services in parallel, for example, Networking manager connects WIFI, Bluetooth connects wireless keyboard so everything would be ready after I login. And I want Emacs to be ready as well.

I can achieve that by simply having an shell script automatically running after login. But there are benefits of using systemd. It has bunch of sub-commands for managing services, for example, checking logs, status etc.

It's a nice tool to have, I can use it for example Jupyter Notebook server.

That's why I pulled the trigger and spent 2 hours in implementing and testing it. Here's the technical bit.

How to Implement As Systemd Service

In order to use systemd to manage Emacs server, I firstly need a configuration file (which is called unit file). Debian Wiki provides a short description of the syntax and parameter of unit file.

I found an simple one in Emacs Wiki. It looks like this

[Unit]
Description=Emacs text editor
Documentation=info:emacs man:emacs(1) https://gnu.org/software/emacs/

[Service]
Type=forking
ExecStart=/usr/bin/emacs --daemon
ExecStop=/usr/bin/emacsclient --eval "(kill-emacs)"
Environment=SSH_AUTH_SOCK=%t/keyring/ssh
Restart=on-failure

[Install]
WantedBy=default.target

The important parameters are

ExecStart
It tells systemd what to do when starting Emacs

service, in this case it runs /usr/bin/emacs --daemon command.

ExecStop
it tells systemd what to do when shutting down Emacs

service, in this case it runs /usr/bin/emacsclient --eval "(kill-emacs)" command.

If you are using an Emacs built in a difference directory, you have to change /usr/bin/emacs to wherever your Emacs is located.

Then save the configuration file as ~.config/systemd/user/emacs.service/.

After that run

systemctl enable --user emacs

so systemd would copy the configuration file into central places and it would start Emacs service at boot time.

To run Emacs service right now, use

systemctl start --user emacs

This is what I see in my console

emacs.service - Emacs text editor
Loaded: loaded (/home/yitang/.config/systemd/user/emacs.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-06-14 09:12:26 BST; 24h ago
Docs: info:emacs
man:emacs(1)
https://gnu.org/software/emacs/
Main PID: 5222 (emacs)
Tasks: 5 (limit: 19027)
Memory: 154.7M
CPU: 3min 25.049s
CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/emacs.service
├─ 5222 /usr/bin/emacs --daemon
└─16086 /usr/bin/aspell -a -m -d en_GB -p /home/yitang/git/.emacs.d/local/ispell-dict --encoding=utf-8

Jun 14 09:11:57 7270 emacs[5222]: No event to add
Jun 14 09:11:57 7270 emacs[5222]: Package dash-functional is obsolete; use dash 2.18.0 instead
Jun 14 09:12:01 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/org-mode.el (source)...done
Jun 14 09:12:01 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/refile.el (source)...
Jun 14 09:12:01 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/refile.el (source)...done
Jun 14 09:12:01 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/scripting.el (source)...
Jun 14 09:12:26 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/scripting.el (source)...done
Jun 14 09:12:26 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/load_config.el (source)...done
Jun 14 09:12:26 7270 emacs[5222]: Starting Emacs daemon.
Jun 14 09:12:26 7270 systemd[4589]: Started Emacs text editor.

Enhance User Experience

For far, I have the following two tweaks to make the usage of systemd more pleasant.

sudo Privilege

The Emacs server is started using my own account, so it doesn't have the sudo privilege. In order to edit files that requires sudo permission, simple open the file in Emacs, or in command line with

emascclient -c FILENAME

then type M-x sudo inside Emacs, type the sudo password. If the password is correct, I can edit and save the file as sudo user.

Environment Variables

The customised shell configuration in .bashrc are loaded when opening an interactive shell session. So the Emacs server managed by systemd would not have the environment variables, alias, functions or whatever defined in .bashrc.

This stackoverflow post provides the rationale and how to tweak the unit file so systemd would load .bashrc.

This problem can solved a lot easier on the Emacs side, by using exec-path-from-shell package. It will ensure the environment variables inside Emacs are the same as in the user's interactive shell.

Simply put the following in your .emacs would do the trick.

(exec-path-from-shell-initialize)

Start Emacs Server Before Login?

The systemd services under my account would only start after I login. Because I have tons of Emacs configuration, I still have to wait few seconds before Emacs server is ready. So it would be awesome to have the Emacs server starting to load before I login.

This doesn't seems to be simple to implement, because technically, it would require the Emacs server to be defined on system level, but it will load files in my personal home drive without me being logged in. It might be still okay since I'm the sole user of my laptop, but I have to tweak the permissions and would probably end up with non-secure permission setting.

So I leave this idea here.

Kaggle Avito Demand Prediction Challenge - 22th Solution

Avito Demand Prediction Challenge asks Kagglers to predict the "demand" likelihood of an advertisement. If an listed 2nd-hand Iphone 6 is selling for £20,000, then the "demand" is likely to be very low. This is the my first competition to build model using tabular data, text, and also images.

I teamed up with Rashmi, Abhimanyu, Yiang, Samrat and we finished at 22 among 1917 teams. So far, I have four silver medals and my rank is 542 among 83,588 Kaggler.

This is an interesting competition for me. I was about to quit this competition and Kaggle because of other commitments in life/work. Just one day before team merge deadline, Rashmi asked me to join, at that time, my position is 880-th, about 50%, and Rashmi's team is about 82-th. So I decided to join and finish this competition which I already spent about many hours.

Final Ensemble Models

As part of this team, I worked on final ensemble models. Immediately after join, i completed 5 tasks:

  1. make sure everyone uses the same agreed cross validation schema. This is essential for building ensemble model.
  2. provide model_zoo.md document to keep track of all level 1 models, their train/valid/lb scores, feature used, and file path to their oof/test prediction.
  3. write merge_oof.py to combine all oof/test predictions together.
  4. write R scripts for glmnet ensemble.
  5. write python scripts for LightGBM ensemble.

Once new model is built, other team member update the model_zoo.md and upload the data to a private github repo. Then I update the merge_oof.py to include new models' result, and run glmnet and LightGBM ensemble. We had this ensemble workflow automated so it takes little effort to see the ensemble model's performance.

I spent some times analysing the coefficients/weights of L1 model and tried to exclude models with negative and lower weights. To my surprise it doesn't help at all. The final submission is a glmnet ensemble with 41 models (lgb + xgb + NN).

Also, LightGBM ensemble has much better cv score but the LB score is worse. I suspect it is because there are leakage in L1 models and glmnet is more robust to leakage since it's linear model. Unfortunately, there's no enough time to identify which models have leakage.

Collaboration

This is my 2nd time work in a team, although there's a lot space for improvement collaborating when compared with a professional data scientist team but as night/weekend project, we have done a really good job as a team.

The setup for collaboration:

  1. Slack for discussion. we have channel for general, final_ensemble, random for cat photos etc.
  2. we also used Slack for sharing features which i personal don't like.
  3. Private github repo for sharing code and oof/test predictions.
  4. Monday.com for managing tasks. it gives a nice overview of what everyone's up to.

we tried very hard to get a gold, but other teams work even more harder. At one point we were at 17, and finished at 22.

Some Kagglers to Avoid

Finally, when we waited 1 hour for the final deadline, we had a lovely discussion about our past disqualification experience. We were all shocked when we were at different team in Toxic competition but team up with the same person. We shared their person's multiple Kaggle accounts and added to our personal block-list.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!