[R] Tools For Preparing Data For Analysis

Robert Wilkins irishhacker at gmail.com
Fri Jun 15 02:35:18 CEST 2007


[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ]

Hmm,

I don't think you need a retain statement.

if first.patientID ;
or
if last.patientID ;

ought to do it.

It's actually better than the Vilno version, I must admit, a bit more concise:

if ( not firstrow(patientID) ) deleterow ;

Ah well.

**********************************
For the folks asking for location of software ( I know posted it, but
it didn't connect to the thread, and you get a huge number of posts
each day , sorry):

Vilno , find at
http://code.google.com/p/vilno

DAP & PSPP,  find at
http://directory.fsf.org/math/stats

Awk, find at lots of places,
http://www.gnu.org/software/gawk/gawk.html

Anything else? DAP & PSPP are hard to find, I'm sure there's more out there!
What about MDX? Nahh, not really the right problem domain.
Nobody uses MDX for this stuff.

******************************************************

If my examples , using clinical trial data are boring and hard to
understand for those who asked for examples
( and presumably don't work in clinical trials) , let me
know. Some of these other examples I'm reading about are quite interesting.
It doesn't help that clinical trial databases cannot be public. Making
a fake database would take a lot of time.
The irony is , even with my deep understanding of data preparation in
clinical trials,
the pharmas still don't want to give me a job ( because I was gone for
many years).

********************************************************
Let's see if this post works : thanks to the folks who gave me advice
on how to properly respond to a post within a  thread . ( Although the
thread in my gmail account is only a subset of the posts visible in
the archives ). Crossing my fingers ....

On 6/10/07, Peter Dalgaard <p.dalgaard at biostat.ku.dk> wrote:
> Douglas Bates wrote:
> > Frank Harrell indicated that it is possible to do a lot of difficult
> > data transformation within R itself if you try hard enough but that
> > sometimes means working against the S language and its "whole object"
> > view to accomplish what you want and it can require knowledge of
> > subtle aspects of the S language.
> >
> Actually, I think Frank's point was subtly different: It is *because* of
> the differences in view that it sometimes seems difficult to find the
> way to do something in R that  is apparently straightforward in SAS.
> I.e. the solutions exist and are often elegant, but may require some
> lateral thinking.
>
> Case in point: Finding the first or the last observation for each
> subject when there are multiple records for each subject. The SAS way
> would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that
> you can compare the subject ID with the one from the previous record,
> working with data that are sorted appropriately.
>
> You can do the same thing in R with a for loop, but there are better
> ways e.g.
> subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or
> maybe
> do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or
> something involving aggregate(). (The latter approaches generalize
> better to other within-subject functionals like cumulative doses, etc.).
>
> The hardest cases that I know of are the ones where you need to turn one
> record into many, such as occurs in survival analysis with
> time-dependent, piecewise constant covariates. This may require
> "transposing the problem", i.e. for each  interval you find out which
> subjects contribute and with what, whereas the SAS way would be a
> within-subject loop over intervals containing an OUTPUT statement.
>
> Also, there are some really weird data formats, where e.g. the input
> format is different in different records. Back in the 80's where
> punched-card input was still common, it was quite popular to have one
> card with background information on a patient plus several cards
> detailing visits, and you'd get a stack of cards containing both kinds.
> In R you would most likely split on the card type using grep() and then
> read the two kinds separately and merge() them later.
>
>



More information about the R-help mailing list