Yi Tang Data Scientist with Emacs

Managing Emacs Server as Systemd Service

Using Emacs Server Without Systemd

I live in Emacs entirely apart from using browser for googling. Having an Emacs server running on the background makes Emacs available all the time. So I won't worry about closing it accidental.

It is not hard to do that, just run

emacs --daemon

in command line to start the Emacs server. It will load user configuration file as usual. Then run

emacsclient -c & 

to open an Emacs GUI instance that uses the Emacs server. That's how I have been doing for a while.

An better approach is using systemd. It is the services manager of Linux. Whenever my Debian 11 laptop boot up, systemd would start a bunch of services in parallel, for example, Networking manager connects WIFI, Bluetooth connects wireless keyboard so everything would be ready after I login. And I want Emacs to be ready as well.

I can achieve that by simply having an shell script automatically running after login. But there are benefits of using systemd. It has bunch of sub-commands for managing services, for example, checking logs, status etc.

It's a nice tool to have, I can use it for example Jupyter Notebook server.

That's why I pulled the trigger and spent 2 hours in implementing and testing it. Here's the technical bit.

How to Implement As Systemd Service

In order to use systemd to manage Emacs server, I firstly need a configuration file (which is called unit file). Debian Wiki provides a short description of the syntax and parameter of unit file.

I found an simple one in Emacs Wiki. It looks like this

[Unit]
Description=Emacs text editor
Documentation=info:emacs man:emacs(1) https://gnu.org/software/emacs/

[Service]
Type=forking
ExecStart=/usr/bin/emacs --daemon
ExecStop=/usr/bin/emacsclient --eval "(kill-emacs)"
Environment=SSH_AUTH_SOCK=%t/keyring/ssh
Restart=on-failure

[Install]
WantedBy=default.target

The important parameters are

ExecStart
It tells systemd what to do when starting Emacs

service, in this case it runs /usr/bin/emacs --daemon command.

ExecStop
it tells systemd what to do when shutting down Emacs

service, in this case it runs /usr/bin/emacsclient --eval "(kill-emacs)" command.

If you are using an Emacs built in a difference directory, you have to change /usr/bin/emacs to wherever your Emacs is located.

Then save the configuration file as ~.config/systemd/user/emacs.service/.

After that run

systemctl enable --user emacs

so systemd would copy the configuration file into central places and it would start Emacs service at boot time.

To run Emacs service right now, use

systemctl start --user emacs

This is what I see in my console

emacs.service - Emacs text editor
Loaded: loaded (/home/yitang/.config/systemd/user/emacs.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-06-14 09:12:26 BST; 24h ago
Docs: info:emacs
man:emacs(1)
https://gnu.org/software/emacs/
Main PID: 5222 (emacs)
Tasks: 5 (limit: 19027)
Memory: 154.7M
CPU: 3min 25.049s
CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/emacs.service
├─ 5222 /usr/bin/emacs --daemon
└─16086 /usr/bin/aspell -a -m -d en_GB -p /home/yitang/git/.emacs.d/local/ispell-dict --encoding=utf-8

Jun 14 09:11:57 7270 emacs[5222]: No event to add
Jun 14 09:11:57 7270 emacs[5222]: Package dash-functional is obsolete; use dash 2.18.0 instead
Jun 14 09:12:01 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/org-mode.el (source)...done
Jun 14 09:12:01 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/refile.el (source)...
Jun 14 09:12:01 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/refile.el (source)...done
Jun 14 09:12:01 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/scripting.el (source)...
Jun 14 09:12:26 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/config/scripting.el (source)...done
Jun 14 09:12:26 7270 emacs[5222]: Loading /home/yitang/git/.emacs.d/load_config.el (source)...done
Jun 14 09:12:26 7270 emacs[5222]: Starting Emacs daemon.
Jun 14 09:12:26 7270 systemd[4589]: Started Emacs text editor.

Enhance User Experience

For far, I have the following two tweaks to make the usage of systemd more pleasant.

sudo Privilege

The Emacs server is started using my own account, so it doesn't have the sudo privilege. In order to edit files that requires sudo permission, simple open the file in Emacs, or in command line with

emascclient -c FILENAME

then type M-x sudo inside Emacs, type the sudo password. If the password is correct, I can edit and save the file as sudo user.

Environment Variables

The customised shell configuration in .bashrc are loaded when opening an interactive shell session. So the Emacs server managed by systemd would not have the environment variables, alias, functions or whatever defined in .bashrc.

This stackoverflow post provides the rationale and how to tweak the unit file so systemd would load .bashrc.

This problem can solved a lot easier on the Emacs side, by using exec-path-from-shell package. It will ensure the environment variables inside Emacs are the same as in the user's interactive shell.

Simply put the following in your .emacs would do the trick.

(exec-path-from-shell-initialize)

Start Emacs Server Before Login?

The systemd services under my account would only start after I login. Because I have tons of Emacs configuration, I still have to wait few seconds before Emacs server is ready. So it would be awesome to have the Emacs server starting to load before I login.

This doesn't seems to be simple to implement, because technically, it would require the Emacs server to be defined on system level, but it will load files in my personal home drive without me being logged in. It might be still okay since I'm the sole user of my laptop, but I have to tweak the permissions and would probably end up with non-secure permission setting.

So I leave this idea here.

Kaggle Avito Demand Prediction Challenge - 22th Solution

Avito Demand Prediction Challenge asks Kagglers to predict the "demand" likelihood of an advertisement. If an listed 2nd-hand Iphone 6 is selling for £20,000, then the "demand" is likely to be very low. This is the my first competition to build model using tabular data, text, and also images.

I teamed up with Rashmi, Abhimanyu, Yiang, Samrat and we finished at 22 among 1917 teams. So far, I have four silver medals and my rank is 542 among 83,588 Kaggler.

This is an interesting competition for me. I was about to quit this competition and Kaggle because of other commitments in life/work. Just one day before team merge deadline, Rashmi asked me to join, at that time, my position is 880-th, about 50%, and Rashmi's team is about 82-th. So I decided to join and finish this competition which I already spent about many hours.

Final Ensemble Models

As part of this team, I worked on final ensemble models. Immediately after join, i completed 5 tasks:

  1. make sure everyone uses the same agreed cross validation schema. This is essential for building ensemble model.
  2. provide model_zoo.md document to keep track of all level 1 models, their train/valid/lb scores, feature used, and file path to their oof/test prediction.
  3. write merge_oof.py to combine all oof/test predictions together.
  4. write R scripts for glmnet ensemble.
  5. write python scripts for LightGBM ensemble.

Once new model is built, other team member update the model_zoo.md and upload the data to a private github repo. Then I update the merge_oof.py to include new models' result, and run glmnet and LightGBM ensemble. We had this ensemble workflow automated so it takes little effort to see the ensemble model's performance.

I spent some times analysing the coefficients/weights of L1 model and tried to exclude models with negative and lower weights. To my surprise it doesn't help at all. The final submission is a glmnet ensemble with 41 models (lgb + xgb + NN).

Also, LightGBM ensemble has much better cv score but the LB score is worse. I suspect it is because there are leakage in L1 models and glmnet is more robust to leakage since it's linear model. Unfortunately, there's no enough time to identify which models have leakage.

Collaboration

This is my 2nd time work in a team, although there's a lot space for improvement collaborating when compared with a professional data scientist team but as night/weekend project, we have done a really good job as a team.

The setup for collaboration:

  1. Slack for discussion. we have channel for general, final_ensemble, random for cat photos etc.
  2. we also used Slack for sharing features which i personal don't like.
  3. Private github repo for sharing code and oof/test predictions.
  4. Monday.com for managing tasks. it gives a nice overview of what everyone's up to.

we tried very hard to get a gold, but other teams work even more harder. At one point we were at 17, and finished at 22.

Some Kagglers to Avoid

Finally, when we waited 1 hour for the final deadline, we had a lovely discussion about our past disqualification experience. We were all shocked when we were at different team in Toxic competition but team up with the same person. We shared their person's multiple Kaggle accounts and added to our personal block-list.

Build Notification Features

Data processing time will becomes longer and longer as the increasing rate of data volumes. Users may check-in frequently to see the whether it is finished. Most of the time they will found it hasn't, doing this the users make contact switch which break the flow of whatever the using was doing.

Sometimes the user can't stop doing so, either because they are impatient, or because they really have a deadline to catch. Also, with less likelihood, they might find errors in the processing, either because of the QC check fails, or running out of computational resources.

Giving this fact, it really makes sens to have your program actively inform the user on the process so that they doesn't need to check-in at all. Because users will be notified immediately whenever the whole progressing is completed, or there's error that the user needs to take action onup.

This blog posts walk though the basics of sending Emails in Python, composing and sending out Emails. Each component is broken down into small piceses. It helps you debug/tests Email program, and personalise your Emails. In the end, you should be able to build a email robot.

Prerequisite

Before going into the technical details, you have to check that you are able to send out emails. You need the

  • [ ] SMTP server,
  • [ ] user name,
  • [ ] password,
  • [ ] port number, and
  • [ ] communication protocol.

You could easily found out these information from your Email service provider. An example of Gmail is at Here.

To check if you have all the information correct, run the following snippet. It will try to send out an empty Email to yourself. Make sure you fill in the username and password before hit go.

  import smtplib
  username = <Fill In>
  password = <Fill In>
  conn = smtplib.SMTP("smtp.gmail.com")
  conn.starttls()  # set connection to TLS mode
  conn.login(username, password,)  # Log in to the remote server
  conn.sendmail(username, [username], 'For testing')  # Send emails
  conn.quit()  # close connection.
  

If you don't see error messages, that's great, you are all set. There should be an Email in your Inbox. It has no subject and for testing in the main body.

If you do, you need to double check the information, and try again. If you are sure that the information are correct, btu still can't send out, check your network configuration, maybe the firewall block the connection.

Once you are able to send out an empty email the next is to compose an full Email.

Compose An Email in Python

An Email is consisted of multiple parts, the Subject, Body, attachments, Signature, and also some meta-data including from, to, and date.

Firstly, create an object of MIMEMutiple() class. It will be the building block of your Email. The approached described here is to add each components into it.

Email meta data

Start with meta data.

   msg = MIMEMultipart()
   msg['From'] = 'your_email_address@somewhere.com'
   msg['To'] = 'email_friend_#28473@somewhere.com, email_friend_#122xs2212@somewhere.com'
   msg['Date'] = formatdate(localtime=True)  # standard.
   

Email body

For plain text, simply add

   body_txt = '''
   Hello,

   Just to tell you Python is awesome.
   '''
   plain_body = MIMEText(body_txt, "plain")
   

You could also try to compose an complex HTML email in Python, or use diffenret tool to generate the HTML and import in Python.

   html_txt = r'''\
   <html>
   <head></head>
   <body>
   <p>Hi!<br>
   How are you?<br>
   Here is the <a href="http://www.python.org">link</a> you wanted.
   </p>
   </body>
   </html>
   '''
   html_body = email.mime.text.MIMEText(html_txt, 'html')

   

Attachment

You can attach files of any type in an Email, specially, for image, you could use MIMEImage, and for audio, you could use MIMEAudio.

But you don't have to be specific. MIMEApplication would be sufficeincy for all cases. It will configure the file type of the attached file. Use it as follows:

 with open(fpath, 'rb') as fp:
     part = MIMEApplication(fp.read(), Name=basename(fpath))  # file content as string
     part['Content-Disposition'] = 'attachment; filename="%s"' % basename(f)  # attachment description.
     msg.attach(part)  # attach to the msg.
 

Put Everything Together

   from os.path import basename
   import smtplib
   from email.mime.application import MIMEApplication
   from email.mime.multipart import MIMEMultipart
   from email.mime.text import MIMEText
   from email.utils import formatdate

   # Configure your email
   username = <Fill in>
   password = <Fill in>
   recipents = <Fill in>
   subject = 'Hey'
   attachments = []  # add attachment here.
   body_txt = '''
      Hello,

      Just to tell you Python is awesome.
      '''

   # Email - meta data
   msg = MIMEMultipart()
   msg['From'] = username
   msg['To'] = ', '.join(recipents)
   msg['Subject'] = subject
   msg['Date'] = formatdate(localtime=True)  # standard.

   # Email - main body
   plain_body = MIMEText(body_txt, "plain")
   msg.attach(plain_body)

   # attachments
   for fpath in attachments or []:
       with open(fpath, 'rb') as fp:
	   part = MIMEApplication(fp.read(), Name=basename(fpath))  # file content as string
	   part['Content-Disposition'] = 'attachment; filename="%s"' % basename(fpath)  # attachment description.
	   msg.attach(part)  # attach to the msg.


   # send out email
   conn = smtplib.SMTP("smtp.gmail.com")
   conn.starttls()  # set connection to TLS mode
   conn.login(username, password)  # Log in to the remote server
   conn.sendmail(username, [username], msg.as_string())  # Send emails
   conn.quit()  # close connection.

   

Wrap everything in a Class

At JBARML, we are planing to send users emails for a progress update. Many of the data generating process are lured together and automated by a workflow manager. The whole programcan takes upto weeks to complete. Actively sending update progeress to user is much more senssible then user loggin in to a remote machien and check it now and then.

In this case,

email to notify user the progress of the data generating workflow.

etags - Build a TAG for Multiple R Packages

Here is what tried to build a TAG for multiple R packages. It enable me to jump to a location where the function/variable is defined and modify if I want to.

Useful variable and functions

ess-r-package-library-path
default path to find packages, should be a list
ess-r-package-root-file
if the folder has DESCRIPTION file, then the folder is a R package.
(ess-build-tags-for-directory DIR TAGFILE)
build tag on DIR to TARGET.
tags-table-list
List of file names of tags tables to search.
(visit-tags-table FILE &optional LOCAL)
Tell tags commands to use tags table file.
;; new variable 
(defvar ess-r-package-library-tags nil
  "A TAG file for multiple R packages.")

(setq ess-r-package-library-path '("~/tmp/feather/R" "~/tmp/RPostgres/"))
(setq ess-r-package-library-tags "~/tmp/all_tags")

(dolist (pkg-path ess-r-package-library-path)
  (let ((pkg-name (ess-r-package--find-package-name pkg-path)))
    (unless (and pkg-name pkg-path
                 (file-exists-p (expand-file-name ess-r-package-root-file pkg-path)))
      (error "Not a valid package. No '%s' found in `%s'." ess-r-package-root-file pkg-path))
    (ess-build-tags-for-directory pkg-path ess-r-package-library-tags)
    ))

Note the workhorse is ess-build-tags-for-directory which does what it means. The core of this function use find and etags program. The find program will find files with extension .cpp, R, nw etc, and then feed to (using pipe) to the etags program which generate a TAG table. These two steps are demonstrated in the following snippet, which is grabbed from the source code of ess-build-tags-for-directory.

(setq find-cmd (format "find %s -type f -size 1M \\( -regex \".*\\.\\(cpp\\|jl\\|[RsrSch]\\(nw\\)?\\)$\" \\)" (car ess-r-package-library-path)))

(setq regs (delq nil (mapcar (lambda (l)
                               (if (string-match "'" (cadr l))
                                   nil ;; remove for time being
                                 (format "/%s/\\%d/"
                                         (replace-regexp-in-string "/" "\\/" (nth 1 l) t)
                                         (nth 2 l))))
                             imenu-generic-expression)))
(setq tags-cmd (format "etags -o %s --regex='%s' -" "~/lala"
                       (mapconcat 'identity regs "' --regex='")))

(setq sh-cmd (format "%s | %s" find-cmd tags-cmd))
(shell-command sh-cmd)

Note when they are used in Emacs, the tags-table-list variable is appended with the path to the new TAG table. So that the user can use xref-find-definitions (M-.) to jump (if the point is under a word) or select which function/variable to jump to. The users then check the function/variable definition, or modify it if it is necessary. Then call xref-pop-marker-stack (M-,) to jump back.

Compare RPostgres and RPostgreSQL Package

R is a great language for R&D. It's fast to write prototypes, and has great visualisation tools. One of constraints of R is it stores the data in system memory. When the data becomes too big to fit in the memory, we asked the user has to manually split the dataset and then aggregate the output later. This process is inefficient and error prone for a non-technical user.

I started an R development project to automate this split-aggregate process. A viable solution is to store the whole data in PostgreSQL, and let R to fetch one small chunk of the data at a time, do the calculation, and then save the output to PostgreSQL. This solution requires frequently data transferring between these two systems, which could be a bottleneck in performance. So I did a comparison of two R packages that interface R and PostgreSQL.

RPosrgreSQL
is supported and developed in the Google Summer of Code 2008 program. It is currently out of development. The last publication is in 2013.
RPostgres
is a new package which provides similar functionality to RPostgreSQL but rewrite using C++ and Rcpp. The development is led by Kirill Müller.

Based on my testing, the RPostgres package is about 30% faster than RPostgreSQL.

The testing set-up is quite simple: I write an R script to send data to and get data out from a remote PostgreSQL database. It logs how long each task takes to complete in R. To avoid other factors that can affect the speed, it repeats this process 20 times and use the minimal run-time as the final score. The dataset transferred between R and PostgreSQL is a flat table with three columns and the number of rows varies from ten thousand to one million.

The run-time in seconds are plotted against number for rows for each package and operation.

nil

Here is a summary of what I observed:

  1. RPostgreSQL is slower than RPostgres. For getting data out, it's 75% slower, which is massive! For writing, difference is closer, it's about 20%. When combine both scores together, it is about 33% slower.
  2. Particularly, it's slower to read than to write for RPostgreSQL package, the ratio is about 1.5. While as it's quicker to read than to write for RPostgres, the ratio is about 0.8. This is an interesting observation.
  3. Both package has a nice feature - the reading/writing time linearly depends on the number of rows. This makes the time estimation reliable. I would be confident to say that for 2 millions rows, it takes RPostgres package about 6 seconds to read.

I don't why which part of implementation makes the RPostgres faster. I guess its the usage of C++ and the magical Rcpp package.

Here is the script just in case you want to your own tests.

library(data.table)                     
library(ggplot2)
library(microbenchmark)
library(RPostgreSQL)
library(DBI)   
                                        # config for PostgreSQL database
host.name <- NULL
database.name <- NULL
postgres.user <- NULL
postgres.passwd <- NULL
postgres.port <- NULL
temporary.table.name <- NULL

                                        # config for testing
nrows <- seq(10 * 1e3, 1 * 1e6, length = 10)
repeats <- 20


                                        # open PostgreSQL connection
pg.RPostgreSQL <- dbConnect(dbDriver("PostgreSQL"),
                           host = host.name,
                           dbname = database.name,
                           user = postgres.user,
                           password = postgres.passwd,
                           port = postgres.port)
pg.RPostgres <- dbConnect(RPostgres::Postgres(),
                         host = host.name,
                         dbname = database.name,
                         user = postgres.user,
                         password = postgres.passwd,
                         port = postgres.port)

ReadWriteWarpper <- function(pg.connection) {
                                        # helper function 
    write <- function() dbWriteTable(pg.connection, temporary.table.name, dt, overwrite = TRUE)
    read <- function() dbReadTable(pg.connection, temporary.table.name)

    var <- list()
    for (n in nrows) {
                                        # create a dataset
        dt <- data.table(x = sample(LETTERS, n, T),  # character
                        y = rnorm(n), # double
                        z = sample.int(n, replace=)) # integer

                                        # read and write once first.
        write()
        read()

                                        # run and log run-time
        res <- microbenchmark(write(),
                             read(),
                             times = repeats)

                                        # parse 
        var[[as.character(n)]] <- data.table(num_row = n,
                                            operation = res$expr,
                                            time = res$time)
    }

                                        # aggregate and return
    rbindlist(var)
}

                                        # run
df0 <- ReadWrite(pg.RPostgres); df1 <- ReadWrite(pg.RPostgreSQL)
df0$pacakge <- "RPostgres"; df1$package <- "RPostgreSQL"
df <- rbind(df0, df1)
plot.df <- df[, min(time) / 1e9, .(num_row, operation, package)]

## generate plot
plot.df[, operation := gsub("\\(|\\)", "", operation)]
ggplot(plot.df, aes(x=num_row, y=V1, col = package)) +
    geom_path() +
    geom_point() +
    facet_wrap(~operation) +
    theme_bw() +
    labs(x="Number of rows",
         y="Run time (sec)"
         )
If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!