Yi Tang Data Scientist with Emacs

Import Irregular Data Files Into R With Regular Expression - an BODC Example

The first step in data analysis is to get the data into the modelling platform. But it may not be as straightforward as it used to be since nowadays statistician are more likely face the data files that are not in CSV or others format that can feed directly to the read.table() function in R, in which cases, we need to understand the data files in terms of the structure and apply pre-process first. My general strategy is to discard the unnecessary information in the data files and hopefully leave a regular data files.

In my last week's post, Why I Should Explore Regular Expression and Why I Haven't, I expressed my interests in Regular Expression and lucky I got a chance to use it for getting the data into R. It provides me a different strategy: pick only what I am interested in.

The Irregular Data Files

The task is simple: I have about 1,800 .text data files downloaded from British Oceanographic Data Centre (BODC). They are the historical tidal data and are separated by year and by port. I need to combine all the data into one giant table in R, and save it later for modelling.

One sample data file looks like this:

Port:              P035
Site:              Wick
Latitude:          58.44097
Longitude:         -3.08631
Start Date:        01JAN1985-00.00.00
End Date:          03OCT1985-19.00.00
Contributor:       National Oceanography Centre, Liverpool
Datum information: The data refer to Admiralty Chart Datum (ACD)
Parameter code:    ASLVZZ01 = Surface elevation (unspecified datum) of the water body                      
  Cycle    Date      Time      ASLVZZ01     Residual  
 Number yyyy mm dd hh mi ssf           f            f 
     1) 1985/01/01 00:00:00      1.0300      -0.3845  
     2) 1985/01/01 01:00:00      1.0400      -0.3884  
     3) 1985/01/01 02:00:00      1.2000      -0.3666

The first 9 lines are the metadata, which describes the port ID, name and location of the port, and other information about the data. The line 10 and 11 are the headers of the data matrix.

First Attempt - Skip Lines

After the glimpse of the data sample, my first thought was to skip the first 12 lines and treat the rest as a regular data files that has space as separator. It can be easily done by using read.table() with skip = 12 option.

read.table(data.file, skip = 12) ## error

It turned out this approach won't work for some files because when the way of measuring tidal were changed, the date and port were highlighted, leaving a second chunk of data matrix but again with metadata and few other characters. It looks like this:

;; end of first chunk 

########################################
 Difference in instrument
########################################

Port: P035
;; other metadata 

Second Attempt - Remove Lines

Although the first attempt isn't success, I've learnt a bit about the structure of the data files. And based on that, I came up with a second approach: read the data files into R as a vector of string, one element for a line, and then remove all the lines which are metadata. They start with Port:, Site: or Longitude: etc or the ### chunk. It can be done using grep function, which tells me exactly which element of the vector contains the metadata.

s <- readLines(data.file)
metainfo.list <- c("Port:", "Site:", "Latitude:", "Longitude:", "Start Date:", "End Date:", "Contributor:", "Datum information:", "Parameter code:")
meta.line.num <- sapply(metainfo.list, function(i) {
    grep(pattern = i, s)
})
res.2 <- s[-meta.line.num]

This approach works well as long as the metainfo.list contains all the lines I'd like to remove. The downside is that I won't able to know I've includes all of them until the whole process is finished. So when I was waiting for the program to finish, I came up with a third approach, a better one.

Third Attempt - Capture Lines (RegExp)

The above two approaches are to discard the unnecessary information, but I may be in the situation that there are other lines that should be discard but I haven't encounter yet, then the process becomes tedious try-error and takes quite long.

Equally, another approach is to select exactly what I am interested in by using regular expression. But first, I have to identify pattern. Each data point was recorded at a certain point, and therefore must be associated with a timestamp, for example, the first data point is recorded at 1926-01-01 00:00:00. They also has an ID values with an closing parentage's, for example 1.

1) 1985/01/01 00:00:00      1.0300      -0.3845  

So the content of my interests are have a common pattern that can be summarised as: the lines that start with a number of spaces, and also have

observation ID
few integers, and an ending parentheses,
observation date
few integers with forward slashes that means year, month and day, and then a space,
observation time
few integers with colons, means hour, minutes and seconds.

The patterns in RegExp can be formulated as the roi.pattern variable and the whole process can be implemented as:

roi.pattern <- "[[:space:]]+[[:digit:]]+\\) [[:digit:]]{4}/[[:digit:]]{2}/[[:digit:]]{2}"
roi.line.num <- grep(pattern = roi.pattern, s)
res.3 <- s[roi.line.num]

To me, there isn't an absolute winner between the second and third approach, but I prefer to use regular expression because it has more fun with it; I am a statistician and like to spot patterns.

Also, it is an direct approach and more flexible. Note I can continue to add components to the regular expression to increase the confidence in selecting the right data matrix. For example, there are spaces and then few integers at the timestamp. But it will presumably increase the run-time.

Code and Sample Data

You can download the exmaple data and run the scripts listed below in R to reproduce all the results.

#### * Path
data.file <- "~/Downloads/1985WIC.txt" ## to the downloaded data file

#### * Approach 1
read.table(data.file, skip = 11) ## error

#### * Approach 2
s <- readLines(data.file)
metainfo.list <- c("Port:", "Site:", "Latitude:", "Longitude:", "Start Date:", "End Date:", "Contributor:", "Datum information:", "Parameter code:")
meta.line.num <- sapply(metainfo.list, function(i) {
    grep(pattern = i, s)
})
res.2 <- s[-meta.line.num]

#### * Approach 3
roi.pattern <- "[[:space:]]+[[:digit:]]+\\) [[:digit:]]{4}/[[:digit:]]{2}/[[:digit:]]{2}"
roi.line.num <- grep(pattern = roi.pattern, s)
res.3 <- s[roi.line.num]

Why I Should Explore Regular Expression and Why I Haven't

Like many R users who are not actually programmer, I am afraid of regular expression (RegExp), whenever I saw something like

I'd told myself I won't be able to understand it and gave up on the sight.

But I've collected few RegExp patterns that do magical jobs. My favourites are the dot (.) and dollar ($) sign and I usually use them with list.files() to filter the file names in a directory. For example,

list.files(pattern = ".RData$")
list.files(pattern = ".text$")

The first line returns all the R image files, which have file names ending with RData, and for the second all the text files which have file names ended with text. Basically in regular expression, dot sign (.) means anything, and dollar sign ($) means the end of a string. By combining these two, I am able to select multiple files with certain patterns, without manually picking one by one.

How powerful is that! It is an inspirational example that motivates myself from time to time to look deeper and get my head on the topic of regular expression. But I just couldn't have a clear picture of how to us it.

I think the main problems for me to understand RegExp in R are

The syntax is content-sensitive

A subtle change can lead to random results. For example, the above pattern can also be \\.RData$, which means file names ended with .RData. The dot (.) sign here literally means ".". Adding two backslashes \\ changes the meaning of the pattern completely, but both gives the same results. It gave me so much frustration when extrapolating a pattern that works in one case to a similar case but get random results.

The syntax is hard to read

The RegExp pattern above are reasonably easy to understand, if one spent 10 minutes reading the manual, but the following is just crazy.

m <- regexec(pattern = "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", x)

There are 12 parentheses, 6 square brackets and many other symbols. Even same symbol have different meanings, and it's hard to find out exactly what they means because

There isn't enough learning materials

I've never seen an R book that mentioned regular expression. This topic is certainly not a teaching content in university courses or training workshops.

Even google fails to find any meaningful resource except for the Text Processing in Wiki, which is the best I could find.

Although there are related questions in StackOverflow, most of the answers were set in a very specific situation. It's hard make it applicable to other situations or learn this topic from the discrete Q&As.

It has created a mental barrier that statistician shouldn't teach nor learn RegExp at all, or at least for me. But my limited experience suggests that it is such a powerful feature that I've missed a lot.

But

I believe there will be more chances to process text files, for example, parse the log files of this blog. RegExp can improve the efficiency to a great extent. So I am considering to invest the time to learn it properly.

Are you a R user? What's your experience with regular expression? Do you have good learning materials to recommend? If so, please share your experience on the less-talked area.

Use Emacs's Org-mode to Effectively Manage Small Projects

Table of Contents

DEADLINE: <2015-06-09 Tue 20:00>

Org-mode is great to serve as knowledge management tool, it also has helped me increase my personal effectiveness. Recently I have been exploring org-mode for managing small projects in the business environment, in which collaboration happens occeasionally between me and the project team members.

In this post I summarised my workflow to organise, manage and monitor a project. The implementation of this workflow revolves around the collaboration. I have been practise this workflow for a while and can see my growth in planing and managing skills.

Organising

I use a broad definition of project: as long as a task that requires a series of sub-tasks to be done, then it is a project. Normally I categories any tasks that relates to a project into three groups:

Project Tasks
the major tasks that must to been done in order to deliver the project product.
Tasks
administrative or miscellaneous tasks that keep the project goes on, like sent out the invoice.
Notes
anything that is important to the project and therefore worthy keeping a record, like meeting notes or decision made that that impacts the project progress.

Each category has a corresponding top level section or heading. Once this outline is setup, it is very convenient to view content under these categories, regardless of what tasks I was working on, either reading emails, coding, or writing report. Org-mode can scan all the .org files in a direcotry, and creates a tree-structure, with the file name being the root, and headings being the nodes.

An intuitive way to locate a any node is to start from the beginning, the process is same as finding a section in a text book. It can be summarised as:

  1. first, find the right book by its name,
  2. then find the right part,
  3. then narrow down to the right section,

and continue to the section I am interested in. An more pleasure way is to use fuzzy match supported by Helm package - I can narrow down the selection by random nodes. For example, as the images below shows, to locate headline under this article among 40 org files, I only need to search "Small pro", because there are only three headlines has "Small" in its name, "small changes", "small talk", and "small project", and "pro" narrow down to the unique headline.

It saves me a lot of time in remembering where I saved one notes, and wandering around the files to find something. I only explain a bit of the features of Helm, if you want to try out, you can find my configuration here. I recommend a good tutorial if you want to know more.

nil

Use_org_mode_to_manage_a_small_project.png

Figure 1: Test image

We usually a couple of projects at the same time. Also, create a new tasks or notes is easy. org-capture-mode would create a temporary node and by default it will be saved as a subtree in refile.org, or I can directly re-locate the headline directly to this project using the locating mechanism above.

These two features are most enjoyable to use, and make me away from wandering in multiple directories, trying to find the right files, and therefore increase my productivity. Never under estimate how long you will spent in finding in one file.

Managing

Projects usually come with hard deadlines about the product delivery. Setting and change deadlines in org-mode is pleasurable with org-deadline C-c C-d.

It brings up a mini-calendar buffer (shown below), I can use shift+left and shfit+right to move forward and backward for a day, or shift-up and shift-down to move between weeks, and hit RET to select a deadline. Apart from navigating, I can also choose to type the exact date directly, like "2015-07-25" and hit RET.

nil

Once the deadline is set it will show up in that day's calendar. I don't want to suddenly realise there is a deadline I had on that day. So it makes sense to have an early warning period to show the tasks if it is due in days. This behaviour is governed by the org-deadline-warning-days variable. In my Emacs configuration, I set to 30 days. It gives me plenty of time to do any tasks.

I also set deadlines for sub-tasks since it is quite easy to do in org-mode. But coming up with realistic deadlines is difficult. To me, it must give enough time to do the task properly, to the PM, it must be fit in the whole project plan and resource. Both are likely to have different opinion on how long to implement the new features with documentation. It is quite important skills to have: to me, it reflects my understand on the problem and also my own technical capability, to the manager, it is part of their project plan.

My initial estimation may be far from the actual effort, especially when the problem domain is new to me, or I haven't done similar tasks before. The more I do, the better I am good at estimating. At this stage, I practise this skill seriously, and like to have someone with more experienced to review my estimation.

To make this task easy for them, I'd present an overall view of the project time-lines, which clearly shows the period allocate to the specific tasks. org-timeline will generate a time-sorted view for all the tasks. The recent feedback I received is that I tend to overlook the time spent on documentation and tests. Someone with more than 10 years in software development says they usually takes about 3x times on these two tasks together than actually coding.

time-line view also provides benchmark to the progress and I check it frequently to make sure I am on track. It gives the PM a reference for swapping tasks if some becomes urgent.

Monitor

Additional to have the early warning system to prevent sudden surprise, org-mode provides another way of monitoring the project in terms of resource - the actual time I spent on the project. This feature is quite useful when I am given a quite loose deadline but with limited resource, say 150 hours.

Since the sub-tasks are mostly defined in the early stage, whenever I start to do it, I clock in first by org-clock-in. The clocking will be stopped once I manually clock out, or clock in to another task, or the tasks is completed (marked as DONE.) For each clock entry, it shows start time, end time and duration.

Multiple clocking logs are accumulated, and each entry shows the start time, end time, and duration. The durations can be added up and tells me exactly how much time I spent on each tasks. The whole tasks under the project and aggregated across the whole project, by one single function org-clock-report (C-c C- C-r).

Table 1: Clock summary at [2015-06-14 Sun 11:17]
Headline Time     Effort
Total time 10:41      
TODO Use Emacs's org-mode to Manage a Small Project 10:41      
  TODO Tasks   1:45    
   DONE add example for org-refile     0:35 0:30
   NEXT add example for org-clock-report     0:13 0:15
   NEXT proof read     0:11 0:15
   NEXT proof read - 2     0:46 1:00

It is normal to underestimate the complexity of an tasks, and spent too much time in resolve them, and usually I can catch up the in the later stage, however if I had the feeling the overall progress has been affected, I need require more sources from the PM, and the quote I will give is extra hours I had based on my initial estimation. That's an quick reaction.

Also, the clock-report table tells me the different between my effort estimation and the actual time I spent on that tasks.

Control the Plotting Order in ggplot2

nil

The above two plots show the same data (included below), and if you are going to present one to summarise your findings, which will you choose? It is very likely you are going to pick the right one, because

  1. the linear increasing feature of bars is pleasant to see,
  2. it is easier to compare the categories, the ones on the right has higher value than the ones on the left, and
  3. categories with lowest and highest value are clearly shown,

In this article I am trying to explain how to specify the plotting orders in ggplot to whatever you want and encourage R starters to use ggplot2.

To create a bar plot is dead easy in R, take this dataset as an example,

mode count
ssh-mode 2361
fundamental-mode 4626
git-commit-mode 4869
mu4e-compose-mode 4964
emacs-lisp-mode 6205
shell-mode 10046
minibuffer-inactive-mode 12624
inferior-ess-mode 25774
ess-mode 47115
org-mode 78195

to get the plot on the right side, reorder the table by count (it is already been done), then

with(df, barplot(count, names.arg = mode)) 

will do the job. That's simple and easy, it does what you provide. This is completely different to ggplot() paradigm, which does a lot computation behind the scene.

ggplot(df, aes(mode, count)) + geom_bar()

will give you the first plot; the categories are in alphabetically order. In order to get a pleasant increasing order that depends on the count or any other variable, or even manually specified order, you have to explicitly change the level of factors.

df$mode.ordered <- factor(df$mode, levels = df$mode)

create another variable mode.oredered which looks the same as mode, except for the underlying levels are in different. It is set to the order of counts. Run the same ggplot code again will give you the plot on the right. How does it work?

First, every factor in R is mapped into an integer, and the default mapping algorithm is

  1. sort the factor vector alphabetically,
  2. map the first factor to 1, and last to 10.

So emacs-lisp-mode is mapped to 1 and ssh-mode is mapped to 10.

What the reorder script can do is to sort the factors by count, so that ssh-mode is mapped to 1 and org-mode is mapped to 10, I.e. the factor order which are set to the order of count.

How does this affects ggplot? I presume ggplot do the plotting on the order of levels, or let's say on the integer space, I.e. do the plotting from 1 to 10, and then add the labels for each.

In this example, the default barplot function did the job. Usually we need to do extra data manipulation so that ggplot will do what we want, in exchange for the plot good better and may fits in the other plots. Without considering the time constraints, I would encourage people to stick with ggplot because like many other things in life, once you understand, it becomes easier to do. For example, it is actually very easy to specify the order manually with only two steps:

  • first, sort the whole data.frame to a variable,
  • then change the levels options in factor() to what ever you want.

To show a decreasing trends - the reverse order of increasing, just use levels = rev(mode). How neat!

RExercise - Analyst Your Exercise Data in R

RExercise is a by-product of the ActivityDashboard. It parses your exercise data in .GPX format and for each workout, it returns

location table
a data.frame with longitude, latitude, elevation at a particular recording time,
summary table
a one-row data.frame of summary statistics about the workout, includes duration, distance, speed etc.

It comes with a helper function Parse_GPX_all to do the batch process and combine all data.frame together, also add city and country to the summary tables. Then you can see all the activities summary in one table, and use it to query both location and summary table, for example, how many miles did you run last year? How many cities had you run? It meant to make you feel great by showing you have achieved a lot.

Currently it parsing data from RunKeeper and Strava perfectly. .GPX format is generic data format so applying RExercise to data from other apps shuodn't be a problem. If you do, please feel free to contact me, I am extermely friendly to people who do exercise (:d), or sent me a pull request if you already figure out.

Demo

Suppose you have those .GPX data files,

20150108-170830-Run.gpx 
20150109-171835-Run.gpx 
20150111-113750-Run.gpx 
20150112-171906-Walk.gpx

RExercise will gives you a location table and summary table as follows:

Table 1: A Summary Table
id activity date start.time name duration (h) distance (km) speed (km/h) elevation (m) climb (m)
20150108-170830 Run 2015-01-08 17:08:14 Afternoon 0.13 0.74 5.4 109.0 11.1
20150109-171835 Run 2015-01-09 17:18:14 after work 0.42 3.33 7.9 110.5 60.1
20150111-113750 Run 2015-01-11 11:37:14 Sunday 0.50 4.25 8.4 130.6 136.6
20150112-171906 Run 2015-01-12 17:19:14 after work 0.51 4.08 7.9 110.4 88.6
Table 2: A Location Table
lon lat ele time
-2.019050 53.961909 108.4 2015-01-11 11:37:50
-2.017989 53.961375 109.8 2015-01-11 11:38:27
-2.018019 53.961427 109.8 2015-01-11 11:38:29
-2.018004 53.961536 109.8 2015-01-11 11:38:30
-2.018189 53.962276 110.4 2015-01-11 11:38:33
-2.018141 53.962277 110.4 2015-01-11 11:38:34
-2.018090 53.962276 110.4 2015-01-11 11:38:35

Usage

1. Install

devtools::install_github("yitang/rexercise")

2. Download GPX data

3. Set working directory and app

all.data <- Parse_GPX_all(data.dir = "~/ExerciseData/Strave/",
			 app = "Strava",
			 add.city = TRUE) 

You should have two tables as shown in Demo section.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!