[R] [r] How to pick colums from a ragged array?

Tue Oct 23 13:20:59 CEST 2012

Hello,

Thinking again, if you just want the first/last in each ID that repeats 
the DATE, the following function does the job. Since there were no such 
cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ])
}

getRepeat(id.d)  # defaults to first = TRUE
getRepeat(id.d, first = FALSE)  # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
> Hello,
>
> I'm not sure I understand it well, in the solution below the only 
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing. 
> The first is the vector to average (to apply FUN to) and the second is 
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID 
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
>> date).
>> Patients may have been assigned more than one diagnosis at any one 
>> appointment - leading to two rows, same ID and DATE but different 
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> -          define groups of patients by the first diagnosis given
>>
>> -          define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than 
>> one diagnosis at their first (or last) appointment. These individuals 
>> I need to identify and remove, as it's not possible to say uniquely 
>> what their first (or last) diagnosis was. So I need to identify and 
>> remove these individuals which have pairs of rows with the same ID 
>> and (lowest or highest) DATE. The size of the dataset precludes the 
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> -          Sort by DATE then ID
>>
>> -          Make a ragged array of DATE by ID
>>
>> -          Remove IDs that only occur once.
>>
>> -          Subtract the first and second DATEs. Remove IDs for which 
>> this = zero, as this will only be true for IDs for which the 
>> appointment is recorded twice (because there were two diagnoses 
>> recorded on this date).
>>
>> -          (Then do the same to get the 'last appointment' 
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data 
>> out of the ragged array by columns (so e.g. I end up with a matrix of 
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from 
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for 
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA
>> Clinical Senior Lecturer and Honorary Consultant Pychiatrist
>> Institute of Mental Health, Innovation Park
>> Triumph Road, Nottingham, Notts. NG7 2TU. UK
>> Tel. +44 115 82 30419 
>> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create 
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
>> # Since DATE is in 'year mo da', if there's only one date, sum will 
>> be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t )                         # all the 
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ]                           # rag.am 
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee 
>> and may contain confidential information. If you have received this 
>> message in error, please send it back to me, and immediately delete 
>> it.   Please do not use, copy or disclose the information contained 
>> in this message or in any attachment.  Any views or opinions 
>> expressed by the author of this email do not necessarily reflect the 
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an 
>> attachment
>> may still contain software viruses which could damage your computer 
>> system:
>> you are advised to perform your own checks. Email communications with 
>> the
>> University of Nottingham may be monitored as permitted by UK 
>> legislation.
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.