[R] [r] How to pick colums from a ragged array?

PIKAL Petr petr.pikal at precheza.cz
Tue Oct 23 13:49:17 CEST 2012


Hi

I did not check your code and rather followed your explanation. BTW, thanks for test data.

small change in data frame to make DATE as Date class

datum<-as.Date(as.character(DATE), format="%Y%m%d")
id.d <- data.frame(ID,datum )

ordering by date

id.d<-id.d[order(id.d$datum),]


two functions to test if first two dates are the same or last two dates are the same

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

change one last date in the data frame to be the same as previous

id.d[35,2]<-id.d[36,2]

and here are results

sapply(split(id.d, id.d$ID), testlast)
   58   167   323   547   794   814   841   910   999  1019 
FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE 

> sapply(split(id.d, id.d$ID), testfirst)
   58   167   323   547   794   814   841   910   999  1019 
FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE

Now you can select ID which is true and remove it from your data
which(sapply(split(id.d, id.d$ID), testlast))

and use it for your data frame to subset/remove
id.d$ID == as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE

However I am not sure if this is exactly what you want.

Regards
Petr

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Stuart Leask
> Sent: Tuesday, October 23, 2012 11:38 AM
> To: r-help at r-project.org
> Subject: [R] [r] How to pick colums from a ragged array?
> 
> I have a large dataset (~1 million rows) of three variables: ID
> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
> date).
> Patients may have been assigned more than one diagnosis at any one
> appointment - leading to two rows, same ID and DATE but different
> DIAGNOSIS.
> The diagnoses may change between appointments.
> 
> I want to subset the data in two ways:
> 
> -          define groups of patients by the first diagnosis given
> 
> -          define groups of patients by the last diagnosis given.
> 
> The problem:
> Unfortunately, a small number of patients have been given more than one
> diagnosis at their first (or last) appointment. These individuals I
> need to identify and remove, as it's not possible to say uniquely what
> their first (or last) diagnosis was. So I need to identify and remove
> these individuals which have pairs of rows with the same ID and (lowest
> or highest) DATE. The size of the dataset precludes the option of doing
> this by eye.
> 
> I suspect there is a very elegant way of doing this in R.
> 
> This is what I've come up with:
> 
> 
> -          Sort by DATE then ID
> 
> -          Make a ragged array of DATE by ID
> 
> -          Remove IDs that only occur once.
> 
> -          Subtract the first and second DATEs. Remove IDs for which
> this = zero, as this will only be true for IDs for which the
> appointment is recorded twice (because there were two diagnoses
> recorded on this date).
> 
> -          (Then do the same to get the 'last appointment' duplicates,
> by reversing the initial sort by DATE.)
> 
> I am stuck at the 'Subtract dates' step: I would like to get the data
> out of the ragged array by columns (so e.g. I end up with a matrix of
> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
> the ragged array.
> 
> I hope someone can help. My ugly code is below, with some data for
> testing.
> 
> 
> Stuart
> 
> 
> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
> and Honorary Consultant Pychiatrist Institute of Mental Health,
> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
> 115 82 30419
> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
> Google 'Dr Stuart Leask'
> 
> 
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
> 
> DATE <-
> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> ,20091224,20050503,19870508,19880223,19880330)
> 
> id.d <- cbind (ID,DATE )
> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> ragged array, 1-n DATES for every NAME
> 
> # Inelegant attempt to remove IDs that only have one entry:
> 
> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the
> dates per row
> # Since DATE is in 'year mo da', if there's only one date, sum will be
> less than 2100000:
> rag.t <- rag.s [ rag.s > 21000000 ]
> multi.dates <- rownames ( rag.t )                         # all the IDs
> with >1 date
> rag.am <- rag.a [ multi.dates ]                           # rag.am only
> has IDs with > 1 Date
> 
> 
> # But now I'm stuck.
> # Each row of the array is rag.am$ID.
> # So I can't pick columns of DATEs from the ragged array.
> 
> This message and any attachment are intended solely for the addressee
> and may contain confidential information. If you have received this
> message in error, please send it back to me, and immediately delete it.
> Please do not use, copy or disclose the information contained in this
> message or in any attachment.  Any views or opinions expressed by the
> author of this email do not necessarily reflect the views of the
> University of Nottingham.
> 
> This message has been checked for viruses but the contents of an
> attachment may still contain software viruses which could damage your
> computer system:
> you are advised to perform your own checks. Email communications with
> the University of Nottingham may be monitored as permitted by UK
> legislation.
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list