The first step in data analysis is to get the data into the modelling
platform. But it may not be as straightforward as it used to be since
nowadays statistician are more likely face the data files that are not
in CSV or others format that can feed directly to the read.table()
function in R, in which cases, we need to understand the data files in
terms of the structure and apply pre-process first. My general
strategy is to discard the unnecessary information in the data files
and hopefully leave a regular data files.
In my last week's post, Why I Should Explore Regular Expression and
Why I Haven't, I expressed my interests in Regular Expression and lucky
I got a chance to use it for getting the data into R. It provides me a
different strategy: pick only what I am interested in.
The Irregular Data Files
The task is simple: I have about 1,800 .text data files downloaded
from British Oceanographic Data Centre (BODC). They are the historical
tidal data and are separated by year and by port. I need to combine all
the data into one giant table in R, and save it later for modelling.
One sample data file looks like this:
Port: P035
Site: Wick
Latitude: 58.44097
Longitude: -3.08631
Start Date: 01JAN1985-00.00.00
End Date: 03OCT1985-19.00.00
Contributor: National Oceanography Centre, Liverpool
Datum information: The data refer to Admiralty Chart Datum (ACD)
Parameter code: ASLVZZ01 = Surface elevation (unspecified datum) of the water body
Cycle Date Time ASLVZZ01 Residual
Number yyyy mm dd hh mi ssf f f
1) 1985/01/01 00:00:00 1.0300 -0.3845
2) 1985/01/01 01:00:00 1.0400 -0.3884
3) 1985/01/01 02:00:00 1.2000 -0.3666
The first 9 lines are the metadata, which describes the port ID, name
and location of the port, and other information about the data. The
line 10 and 11 are the headers of the data matrix.
First Attempt - Skip Lines
After the glimpse of the data sample, my first thought was to
skip the first 12 lines and treat the rest as a regular data files
that has space as separator. It can be easily done by using
read.table() with skip = 12 option.
read.table(data.file,skip=12)## error
It turned out this approach won't work for some files because
when the way of measuring tidal were changed, the date and port were
highlighted, leaving a second chunk of data matrix but again with
metadata and few other characters. It looks like this:
;; end of first chunk
########################################
Difference in instrument
########################################
Port: P035
;; other metadata
Second Attempt - Remove Lines
Although the first attempt isn't success, I've learnt a bit about the
structure of the data files. And based on that, I came up with a
second approach: read the data files into R as a vector of string, one
element for a line, and then remove all the lines which are metadata.
They start with Port:, Site: or Longitude: etc or the ###
chunk. It can be done using grep function, which tells me exactly
which element of the vector contains the metadata.
This approach works well as long as the metainfo.list contains all
the lines I'd like to remove. The downside is that I won't able to
know I've includes all of them until the whole process is finished. So
when I was waiting for the program to finish, I came up with a third
approach, a better one.
Third Attempt - Capture Lines (RegExp)
The above two approaches are to discard the unnecessary information,
but I may be in the situation that there are other lines that should
be discard but I haven't encounter yet, then the process becomes
tedious try-error and takes quite long.
Equally, another approach is to select exactly what I am interested in
by using regular expression. But first, I have to identify pattern.
Each data point was recorded at a certain point, and therefore must be
associated with a timestamp, for example, the first data point is
recorded at 1926-01-01 00:00:00. They also has an ID values with an
closing parentage's, for example 1.
1) 1985/01/01 00:00:00 1.0300 -0.3845
So the content of my interests are have a common pattern that can
be summarised as: the lines that start with a number of spaces, and
also have
observation ID
few integers, and an ending parentheses,
observation date
few integers with forward slashes that
means year, month and day, and then a space,
observation time
few integers with colons, means hour, minutes and
seconds.
The patterns in RegExp can be formulated as the roi.pattern variable
and the whole process can be implemented as:
To me, there isn't an absolute winner between the second and third
approach, but I prefer to use regular expression because it has more
fun with it; I am a statistician and like to spot patterns.
Also, it is an direct approach and more flexible. Note I can continue
to add components to the regular expression to increase the confidence
in selecting the right data matrix. For example, there are spaces and
then few integers at the timestamp. But it will presumably increase
the run-time.
Code and Sample Data
You can download the exmaple data and run the scripts listed below
in R to reproduce all the results.
#### * Pathdata.file<-"~/Downloads/1985WIC.txt"## to the downloaded data file#### * Approach 1read.table(data.file,skip=11)## error#### * Approach 2s<-readLines(data.file)metainfo.list<-c("Port:","Site:","Latitude:","Longitude:","Start Date:","End Date:","Contributor:","Datum information:","Parameter code:")meta.line.num<-sapply(metainfo.list,function(i){grep(pattern=i,s)})res.2<-s[-meta.line.num]#### * Approach 3roi.pattern<-"[[:space:]]+[[:digit:]]+\\) [[:digit:]]{4}/[[:digit:]]{2}/[[:digit:]]{2}"roi.line.num<-grep(pattern=roi.pattern,s)res.3<-s[roi.line.num]
Like many R users who are not actually programmer, I am afraid of
regular expression (RegExp), whenever I saw something like
I'd told myself I won't be able to understand it and gave up on the
sight.
But I've collected few RegExp patterns that do magical
jobs. My favourites are the dot (.) and dollar ($) sign and I usually
use them with list.files() to filter the file names in a directory. For
example,
The first line returns all the R image files, which have file names
ending with RData, and for the second all the text files which have
file names ended with text. Basically in regular expression, dot
sign (.) means anything, and dollar sign ($) means the end of a
string. By combining these two, I am able to select multiple files
with certain patterns, without manually picking one by
one.
How powerful is that! It is an inspirational example that motivates
myself from time to time to look deeper and get my head on the topic
of regular expression. But I just couldn't have a clear picture of how to
us it.
I think the main problems for me to understand RegExp in R are
The syntax is content-sensitive
A subtle change can lead to random results. For example, the above
pattern can also be \\.RData$, which means file names ended with
.RData. The dot (.) sign here literally means ".". Adding two
backslashes \\ changes the meaning of the pattern completely, but
both gives the same results. It gave me so much frustration when
extrapolating a pattern that works in one case to a similar case
but get random results.
The syntax is hard to read
The RegExp pattern above are reasonably easy to understand, if one
spent 10 minutes reading the manual, but the following is just crazy.
There are 12 parentheses, 6 square brackets and many other symbols.
Even same symbol have different meanings, and it's hard to find out
exactly what they means because
There isn't enough learning materials
I've never seen an R book that mentioned regular expression. This topic
is certainly not a teaching content in university courses or training
workshops.
Even google fails to find any meaningful resource except for the Text
Processing in Wiki, which is the best I could find.
Although there are related questions in StackOverflow, most of the
answers were set in a very specific situation. It's hard make it
applicable to other situations or learn this topic from the discrete Q&As.
It has created a mental barrier that statistician shouldn't teach nor
learn RegExp at all, or at least for me. But my limited experience
suggests that it is such a powerful feature that I've missed a lot.
But
I believe there will be more chances to process text files, for example,
parse the log files of this blog. RegExp can improve the efficiency to
a great extent. So I am considering to invest the time to learn it properly.
Are you a R user? What's your experience with regular expression? Do
you have good learning materials to recommend? If so, please share
your experience on the less-talked area.
Org-mode is great to serve as knowledge management tool, it also has helped
me increase my personal effectiveness. Recently I have been exploring
org-mode for managing small projects in the business environment,
in which collaboration happens occeasionally between me and the project team members.
In this post I summarised my workflow to organise, manage and monitor a
project. The implementation of this workflow revolves around the
collaboration. I have been practise this workflow for a while and can
see my growth in planing and managing skills.
Organising
I use a broad definition of project: as long as a task that requires a series
of sub-tasks to be done, then it is a project. Normally I categories
any tasks that relates to a project into three groups:
Project Tasks
the major tasks that must to been done in order to
deliver the project product.
Tasks
administrative or miscellaneous tasks that keep the project goes
on, like sent out the invoice.
Notes
anything that is important to the project and therefore
worthy keeping a record, like meeting notes or decision
made that that impacts the project progress.
Each category has a corresponding top level section or heading. Once this outline is setup, it
is very convenient to view content under these categories,
regardless of what tasks I was working on, either reading emails,
coding, or writing report. Org-mode can scan all the .org
files in a direcotry, and creates a tree-structure, with the file name
being the root, and headings being the nodes.
An intuitive way to locate a any node is to start from the beginning,
the process is same as finding a section in a text book. It can be summarised as:
first, find the right book by its name,
then find the right part,
then narrow down to the right section,
and continue to the section I am interested in. An more pleasure way is
to use fuzzy match supported by Helm package - I can narrow down the
selection by random nodes. For example, as the images below shows, to
locate headline under this article among 40 org files, I only need to search "Small pro",
because there are only three headlines has "Small" in its name, "small
changes", "small talk", and "small project", and "pro" narrow down to
the unique headline.
It saves me a lot of time in remembering where I saved one notes, and
wandering around the files to find something. I only explain a bit of
the features of Helm, if you want to try out, you can find my
configuration here. I recommend a good tutorial if you want to know
more.
Figure 1: Test image
We usually a couple of projects at the same time. Also, create a new
tasks or notes is easy. org-capture-mode would create a temporary
node and by default it will be saved as a subtree in refile.org, or I
can directly re-locate the headline directly to this project using the
locating mechanism above.
These two features are most enjoyable to use, and make me away from
wandering in multiple directories, trying to find the right files, and
therefore increase my productivity. Never under estimate how long you
will spent in finding in one file.
Managing
Projects usually come with hard deadlines about the product delivery.
Setting and change deadlines in org-mode is pleasurable with org-deadlineC-c C-d.
It brings up a mini-calendar buffer (shown below), I can use
shift+left and shfit+right to move forward and backward for a
day, or shift-up and shift-down to move between weeks, and hit
RET to select a deadline. Apart from navigating, I can also choose to
type the exact date directly, like "2015-07-25" and hit RET.
Once the deadline is set it will show up in that day's calendar. I
don't want to suddenly realise there is a deadline I had on that day.
So it makes sense to have an early warning period to show the tasks if
it is due in days. This behaviour is governed by the
org-deadline-warning-days variable. In my Emacs configuration, I set
to 30 days. It gives me plenty of time to do any tasks.
I also set deadlines for sub-tasks since it is quite easy to do in
org-mode. But coming up with realistic deadlines is difficult. To me,
it must give enough time to do the task properly, to the PM, it must
be fit in the whole project plan and resource. Both are likely to have
different opinion on how long to implement the new features with
documentation. It is quite important skills to have: to me, it
reflects my understand on the problem and also my own technical
capability, to the manager, it is part of their project plan.
My initial estimation may be far from the actual effort, especially
when the problem domain is new to me, or I haven't done similar tasks
before. The more I do, the better I am good at estimating. At this
stage, I practise this skill seriously, and like to have someone with
more experienced to review my estimation.
To make this task easy for them, I'd present an overall view of the
project time-lines, which clearly shows the period allocate to the
specific tasks. org-timeline will generate a time-sorted view for
all the tasks. The recent feedback I received is that I tend to
overlook the time spent on documentation and tests. Someone with more
than 10 years in software development says they usually takes about 3x
times on these two tasks together than actually coding.
time-line view also provides benchmark to the progress and I check it
frequently to make sure I am on track. It gives the PM a reference for
swapping tasks if some becomes urgent.
Monitor
Additional to have the early warning system to prevent sudden surprise,
org-mode provides another way of monitoring the project in terms of resource -
the actual time I spent on the project. This feature is quite useful
when I am given a quite loose deadline but with limited resource,
say 150 hours.
Since the sub-tasks are mostly defined in the early stage, whenever I
start to do it, I clock in first by org-clock-in. The clocking will
be stopped once I manually clock out, or clock in to another task, or
the tasks is completed (marked as DONE.) For each clock entry, it
shows start time, end time and duration.
Multiple clocking logs are accumulated, and each entry shows the start
time, end time, and duration. The durations can be added up and tells
me exactly how much time I spent on each tasks. The whole tasks under
the project and aggregated across the whole project, by one single
function org-clock-report (C-c C- C-r).
Table 1: Clock summary at [2015-06-14 Sun 11:17]
Headline
Time
Effort
Total time
10:41
TODO Use Emacs's org-mode to Manage a Small Project
10:41
TODO Tasks
1:45
DONE add example for org-refile
0:35
0:30
NEXT add example for org-clock-report
0:13
0:15
NEXT proof read
0:11
0:15
NEXT proof read - 2
0:46
1:00
It is normal to underestimate the complexity of an tasks, and spent
too much time in resolve them, and usually I can catch up the in the
later stage, however if I had the feeling the overall progress has
been affected, I need require more sources from the PM, and the quote
I will give is extra hours I had based on my initial estimation.
That's an quick reaction.
Also, the clock-report table tells me the different between my effort
estimation and the actual time I spent on that tasks.
The above two plots show the same data (included below), and if you are going to present one to summarise your findings, which will you choose? It is very likely you are going to pick the right one, because
the linear increasing feature of bars is pleasant to see,
it is easier to compare the categories, the ones on the right has higher value than the ones on the left, and
categories with lowest and highest value are clearly shown,
In this article I am trying to explain how to specify the plotting orders in ggplot to whatever you want and encourage R starters to use ggplot2.
To create a bar plot is dead easy in R, take this dataset as an example,
mode
count
ssh-mode
2361
fundamental-mode
4626
git-commit-mode
4869
mu4e-compose-mode
4964
emacs-lisp-mode
6205
shell-mode
10046
minibuffer-inactive-mode
12624
inferior-ess-mode
25774
ess-mode
47115
org-mode
78195
to get the plot on the right side, reorder the table by count (it is already been done), then
with(df,barplot(count,names.arg=mode))
will do the job. That's simple and easy, it does what you provide. This is completely different to ggplot() paradigm, which does a lot computation behind the scene.
ggplot(df,aes(mode,count))+geom_bar()
will give you the first plot; the categories are in alphabetically order. In order to get a pleasant increasing order that depends on the count or any other variable, or even manually specified order, you have to explicitly change the level of factors.
df$mode.ordered<-factor(df$mode,levels=df$mode)
create another variable mode.oredered which looks the same as mode, except for the underlying levels are in different. It is set to the order of counts. Run the same ggplot code again will give you the plot on the right. How does it work?
First, every factor in R is mapped into an integer, and the default mapping algorithm is
sort the factor vector alphabetically,
map the first factor to 1, and last to 10.
So emacs-lisp-mode is mapped to 1 and ssh-mode is mapped to 10.
What the reorder script can do is to sort the factors by count, so that ssh-mode is mapped to 1 and org-mode is mapped to 10, I.e. the factor order which are set to the order of count.
How does this affects ggplot? I presume ggplot do the plotting on the order of levels, or let's say on the integer space, I.e. do the plotting from 1 to 10, and then add the labels for each.
In this example, the default barplot function did the job. Usually we need to do extra data manipulation so that ggplot will do what we want, in exchange for the plot good better and may fits in the other plots. Without considering the time constraints, I would encourage people to stick with ggplot because like many other things in life, once you understand, it becomes easier to do. For example, it is actually very easy to specify the order manually with only two steps:
first, sort the whole data.frame to a variable,
then change the levels options in factor() to what ever you want.
To show a decreasing trends - the reverse order of increasing, just use levels = rev(mode). How neat!
I am a search-guy: when I want to know something, I use the search functionality to locate to where has the keyword, and I didn't use my eyes to scan the page, it's too slow and harmful.
Emacs provides powerful functionality to do searching. For example, I use these commands very often (with the key-binds),
isearch (C-s), search for a string and move the cursor to there,
helm-swoop (C-F1), find all the occurrences of a string, pull out the lines containing the string to another buffer where I can edit and save,
helm-multi-swoopM-X, apply helm-swoop to multiple buffers, very handy if I want to know where a function is called in different buffers.
projectile-grep or helm-projectile-grepC p s g, find which files in current project contains a specific string, similar to helm-multi-swoop limits the search to files in project directory.
I love doing searching in Emacs, but the problem is to have to remember all the key-binds for different tasks. Also, sometimes, I forgot about what alternatives I have and usually go with the one that I most familiar with, which usually means not the right one. I sometimes realise I use isearch multiple times to do what ace-jump-word-mode can achieve by just once.
Org-mode Hydras incoming! gives me some idea to group all these functions together, and press a single key to perform different tasks, so this can free my mind from remembering all the key-binds. Also, I can write the few lines of text to reminds myself when to do what, and this potentially can solve problem two.
So next time, when I want to search something, I just press F4, and then it brings up all the choices I have, and I don't need to worry about the key-binds or which to use! That's cool!
I am looking forward simplifying my Emacs workflow using hydra package, the key challenge is to identify the logical similarities among the tasks and then group them together accordingly. For hydra-search(), it is "search something on somewhere".