Yi Tang Data Science and Emacs

Less Excel, More R/Python in Emacs

Table of Contents

  1. Excel Is Great
  2. But
  3. Emacs Has More To Offer

Excel Is Great

Regardless of how powerful and convenient the R/Python data ecosystem becomes, there is still value in looking at the data in Excel, especially when exploring the data together with less technical people.

Thanks to its trivial interface Excel is widely used in data analysis: hoover the mouse to select columns, apply filters then calculate some statistics. Most of the time that is all it takes to get the answers the clients are seeking.

I recently realised that having transparency and working with the tools that clients use plays a crucial role in strengthening the trust and delivering the impacts to the business. Sometimes I think I should do more in Excel.

But

The problem with Excel is reproducibility - I’m not able to codify the clickings done in Excel and integrate them into the automated data pipeline. It is rather foolish to have quality control procedures, including code reviews, automated testing, CI etc in the system but in the very end drop all those gatekeepers and go for error-prone manuals.

Plus it is way more efficient to have everything done in one place to have a smooth process with no fractions. It is a key factor in enabling quick turnaround.

So I had the motive to limit the usage of Excel to deliver data to the business and pair data analysis. Again I have been looking into how much it can be done without leaving Emacs.

Emacs Has More To Offer

I was pleased to discover the ess-view-data package and its Python counterpart python-view-data. They interact with an active R/Python session in Emacs and print out data.frame objects in plain text, a.k.a. view data. What’s more, it can process the data before viewing, for example, subset the data row/column-wise, summarise the dataset etc.

The package keeps a record of the data processing pipeline so in the end I would have a copy of the R/Python code that generates the output. I can then effortlessly transfer the code to a script to ensure reproducibility in the future.

Another benefit derives from having a plain text buffer for the data. It is handy in exploring large datasets with an excessive number of columns. For example, the dataset I work on daily basis has about 300 columns. It contains different flavours of the financials, the raw values, imputed, ranked, smoothed etc.

It’s not possible to remember all the column names even after more time was spent in giving meaningful names or ensuring the correct columns are referred to. Having a persistent plain text buffer that I can search for makes finding the right column names a lot easier. It also helps to check what’s in and not in the data.

That’s my first impression of Shuguang Sun’s packages, it looks promising.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!
comments powered by Disqus