Table of Contents
The first step in data analysis is to get the data into the modelling
platform. But it may not be as straightforward as it used to be since
nowadays statistician are more likely face the data files that are not
in CSV or others format that can feed directly to the
function in R, in which cases, we need to understand the data files in
terms of the structure and apply pre-process first. My general
strategy is to discard the unnecessary information in the data files
and hopefully leave a regular data files.
In my last week's post, Why I Should Explore Regular Expression and Why I Haven't, I expressed my interests in Regular Expression and lucky I got a chance to use it for getting the data into R. It provides me a different strategy: pick only what I am interested in.
The Irregular Data Files
The task is simple: I have about 1,800
.text data files downloaded
from British Oceanographic Data Centre (BODC). They are the historical
tidal data and are separated by year and by port. I need to combine all
the data into one giant table in R, and save it later for modelling.
One sample data file looks like this:
Port: P035 Site: Wick Latitude: 58.44097 Longitude: -3.08631 Start Date: 01JAN1985-00.00.00 End Date: 03OCT1985-19.00.00 Contributor: National Oceanography Centre, Liverpool Datum information: The data refer to Admiralty Chart Datum (ACD) Parameter code: ASLVZZ01 = Surface elevation (unspecified datum) of the water body Cycle Date Time ASLVZZ01 Residual Number yyyy mm dd hh mi ssf f f 1) 1985/01/01 00:00:00 1.0300 -0.3845 2) 1985/01/01 01:00:00 1.0400 -0.3884 3) 1985/01/01 02:00:00 1.2000 -0.3666
The first 9 lines are the metadata, which describes the port ID, name and location of the port, and other information about the data. The line 10 and 11 are the headers of the data matrix.
First Attempt - Skip Lines
After the glimpse of the data sample, my first thought was to
skip the first 12 lines and treat the rest as a regular data files
that has space as separator. It can be easily done by using
skip = 12 option.
It turned out this approach won't work for some files because when the way of measuring tidal were changed, the date and port were highlighted, leaving a second chunk of data matrix but again with metadata and few other characters. It looks like this:
;; end of first chunk ######################################## Difference in instrument ######################################## Port: P035 ;; other metadata
Second Attempt - Remove Lines
Although the first attempt isn't success, I've learnt a bit about the
structure of the data files. And based on that, I came up with a
second approach: read the data files into R as a vector of string, one
element for a line, and then remove all the lines which are metadata.
They start with Port:, Site: or Longitude: etc or the ###
chunk. It can be done using
grep function, which tells me exactly
which element of the vector contains the metadata.
This approach works well as long as the metainfo.list contains all the lines I'd like to remove. The downside is that I won't able to know I've includes all of them until the whole process is finished. So when I was waiting for the program to finish, I came up with a third approach, a better one.
Third Attempt - Capture Lines (RegExp)
The above two approaches are to discard the unnecessary information, but I may be in the situation that there are other lines that should be discard but I haven't encounter yet, then the process becomes tedious try-error and takes quite long.
Equally, another approach is to select exactly what I am interested in by using regular expression. But first, I have to identify pattern. Each data point was recorded at a certain point, and therefore must be associated with a timestamp, for example, the first data point is recorded at 1926-01-01 00:00:00. They also has an ID values with an closing parentage's, for example 1.
1) 1985/01/01 00:00:00 1.0300 -0.3845
So the content of my interests are have a common pattern that can be summarised as: the lines that start with a number of spaces, and also have
- observation ID
- few integers, and an ending parentheses,
- observation date
- few integers with forward slashes that means year, month and day, and then a space,
- observation time
- few integers with colons, means hour, minutes and seconds.
The patterns in RegExp can be formulated as the roi.pattern variable and the whole process can be implemented as:
To me, there isn't an absolute winner between the second and third approach, but I prefer to use regular expression because it has more fun with it; I am a statistician and like to spot patterns.
Also, it is an direct approach and more flexible. Note I can continue to add components to the regular expression to increase the confidence in selecting the right data matrix. For example, there are spaces and then few integers at the timestamp. But it will presumably increase the run-time.
Code and Sample Data
You can download the exmaple data and run the scripts listed below in R to reproduce all the results.