[R] Awk and Vilno

Wed Jun 13 03:24:41 CEST 2007

In clinical trial data preparation and many other data situations, the
statistical programmer needs to merge and re-merge multiple input
files countless times. A syntax for merging files that is clear and
concise is very important for the statistical programmer's
productivity.

Here is how Vilno does it:

inlist dataset1 dataset2 dataset3 ;
joinby variable1 variable2  where ( var3<=var4 ) ;

Each column in a dataset has a variable name ( variable1, variable2,
var3, var4 ).
You are merging three input datafiles: dataset1, dataset2, and dataset3.
The joinby statement asks for a many-to-many join, rather like the SQL
SELECT statement.
[ The mergeby statement asks for a many-to-one join , more efficient ]
[ The readby statement asks for interleaving of rows, the rows don't
"match up" ,
  but one row goes under the preceding row (100 rows + 100 rows -> 200
output rows ]
The join( or merge ) is done with variable1*variable2 subgroups: A row
from dataset1 where variable1=4 and variable2="Sam" can only match to
a row from dataset2 where variable1=4 and variable2="Sam". Also, any
match-ups where it is not the case that var3<=var4 are also excluded.

Here's how the SAS datastep will do it:

merge dataset1 dataset2 dataset3 ;
by variable1 variable2 ;
if ^( var3<=var4 ) then delete ;

[Actually, the SAS datastep can only do a many-to-one join, but you
can do a PROC SQL paragraph to do an SQL SELECT statement, then export
the results to a SAS datastep afterwards.]

The point is : there are lots of data preparation scenarios where
large numbers of merges need to be done. This is an example where
Vilno and SAS are easier to use than the competition. I'm sure an Awk
programmer can come up with something, but the result would be
awkward.

You can also find other data preparation problems where the best tool
is Awk. Looking through "Sed & Awk" (O'Reilly) gives a good idea. I'm
not expert Awk-er sure, but I think I can see that Awk and Vilno are
really like apples and oranges.

For scanning inconsistently structured ASCII data files, where
different rows have different column specifications, Awk is a better
tool.

For data problems that lend themselves to UNIX-style regular
expressions, Awk, again, is a great tool.

If you have a data manipulation problem that is incredibly simple,
then converting an ascii data file to binary, and then back, may not
seem worth it. Awk, again, wins. But the asciitobinary and
binarytoascii statement ( there and back ) only takes 4 lines or so,
so Vilno is really not that bad.

Certain apsects of Vilno and SAS are a bit more user-friendly:
Each column has a variable name, such as "PatientID".
Awk uses $1, $2, $3 , as variable names for columns. Not user-friendly.
In both Vilno and SAS (and SQL) the possibility of "MISSING" ( or
"NULL" ) is built into the data values held in the columns. So you
don't have to use separate boolean variables to track MISSING vs
NOT-MISSING. Very convenient.

Vilno does have a lot of functionality that is a lot harder to
implement in most other programming languages. (You can implement that
functionality, but it would take a ton of code - the three merge-in
options for Vilno are an example).

The upshot:

Awk is a hammer.
Vilno is a screwdriver.