Why I Should Explore Regular Expression and Why I Haven't
18 Jun 2015Like many R users who are not actually programmer, I am afraid of regular expression (RegExp), whenever I saw something like
I'd told myself I won't be able to understand it and gave up on the sight.
But I've collected few RegExp patterns that do magical
jobs. My favourites are the dot (.
) and dollar ($
) sign and I usually
use them with list.files()
to filter the file names in a directory. For
example,
The first line returns all the R image files, which have file names
ending with RData, and for the second all the text files which have
file names ended with text. Basically in regular expression, dot
sign (.
) means anything, and dollar sign ($
) means the end of a
string. By combining these two, I am able to select multiple files
with certain patterns, without manually picking one by
one.
How powerful is that! It is an inspirational example that motivates myself from time to time to look deeper and get my head on the topic of regular expression. But I just couldn't have a clear picture of how to us it.
I think the main problems for me to understand RegExp in R are
The syntax is content-sensitive
A subtle change can lead to random results. For example, the above
pattern can also be \\.RData$
, which means file names ended with
.RData
. The dot (.) sign here literally means ".". Adding two
backslashes \\
changes the meaning of the pattern completely, but
both gives the same results. It gave me so much frustration when
extrapolating a pattern that works in one case to a similar case
but get random results.
The syntax is hard to read
The RegExp pattern above are reasonably easy to understand, if one spent 10 minutes reading the manual, but the following is just crazy.
There are 12 parentheses, 6 square brackets and many other symbols. Even same symbol have different meanings, and it's hard to find out exactly what they means because
There isn't enough learning materials
I've never seen an R book that mentioned regular expression. This topic is certainly not a teaching content in university courses or training workshops.
Even google fails to find any meaningful resource except for the Text Processing in Wiki, which is the best I could find.
Although there are related questions in StackOverflow, most of the answers were set in a very specific situation. It's hard make it applicable to other situations or learn this topic from the discrete Q&As.
It has created a mental barrier that statistician shouldn't teach nor learn RegExp at all, or at least for me. But my limited experience suggests that it is such a powerful feature that I've missed a lot.
But
I believe there will be more chances to process text files, for example, parse the log files of this blog. RegExp can improve the efficiency to a great extent. So I am considering to invest the time to learn it properly.
Are you a R user? What's your experience with regular expression? Do you have good learning materials to recommend? If so, please share your experience on the less-talked area.