[R] Making tapply code more efficient

jim holtman jholtman at gmail.com
Fri Feb 27 19:59:21 CET 2009


On something the size of your data it took about 30 seconds to
determine the number of unique teachers per student.

> x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967, TRUE))
> # split the data so you have the number of teachers per student
> system.time(t.s <- split(x[,2], x[,1]))
   user  system elapsed
   0.92    0.01    0.94
> t.s[1:7]  # sample data
$`1`
[1] 16

$`2`
[1] 3

$`3`
[1] 1

$`4`
[1] 17

$`6`
[1]  9  9 19

$`7`
[1] 20

$`9`
[1]  3 16 16 10  8 17

> # count number of unique teachers per student
> system.time(t.a <- sapply(t.s, function(x) length(unique(x))))
   user  system elapsed
  20.17    0.10   20.26
>
>
>
> t.a[1:10]
 1  2  3  4  6  7  9 10 11 12
 1  1  1  1  2  1  5  1  1  1


On Fri, Feb 27, 2009 at 9:46 AM, Doran, Harold <HDoran at air.org> wrote:
> Previously, I posed the question pasted down below to the list and
> received some very helpful responses. While the code suggestions
> provided in response indeed work, they seem to only work with *very*
> small data sets and so I wanted to follow up and see if anyone had ideas
> for better efficiency. I was quite embarrased on this as our SAS
> programmers cranked out programs that did this in the blink of an eye
> (with a few variables), but R was spinning for days on my Ubuntu machine
> and ultimately I saw a message that R was "killed".
>
> The data I am working with has 800967 total rows and 31 total columns.
> The ID variable I use as the index variable in tapply() has 326397
> unique cases.
>
>> length(unique(qq$student_unique_id))
> [1] 326397
>
> To give a sense of what my data look like and the actual problem,
> consider the following:
>
> qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)),
> teacher_unique_id = factor(c(10,10,20,20,25)))
>
> This is a student achievement database where students occupy multiple
> rows in the data and the variable teacher_unique_id denotes the class
> the student was in. What I am doing is looking to see if the teacher is
> the same for each instance of the unique student ID. So, if I implement
> the following:
>
> same <- function(x) length( unique(x) ) == 1
> results <- data.frame(
>        freq = tapply(qq$student_unique_id, qq$student_unique_id,
> length),
>        tch = tapply(qq$teacher_unique_id, qq$student_unique_id, same)
> )
>
> I get the following results. I can see that student 1 appears in the
> data twice and the teacher is always the same. However, student 2
> appears three times and the teacher is not always the same.
>
>> results
>  freq   tch
> 1    2  TRUE
> 2    3 FALSE
>
> Now, implementing this same procedure to a large data set with the
> characteristics described above seems to be problematic in this
> implementation.
>
> Does anyone have reactions on how this could be more efficient such that
> it can run with large data as I described?
>
> Harold
>
>> sessionInfo()
> R version 2.8.1 (2008-12-22)
> x86_64-pc-linux-gnu
>
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U
> TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=
> C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATI
> ON=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
>
>
> ##### Original question posted on 1/13/09
> Suppose I have a dataframe as follows:
>
> dat <- data.frame(id = c(1,1,2,2,2), var1 = c(10,10,20,20,25), var2 =
> c('foo', 'foo', 'foo', 'foobar', 'foo'))
>
> Now, if I were to subset by id, such as:
>
>> subset(dat, id==1)
>  id var1 var2
> 1  1   10  foo
> 2  1   10  foo
>
> I can see that the elements in var1 are exactly the same and the
> elements in var2 are exactly the same. However,
>
>> subset(dat, id==2)
>  id var1   var2
> 3  2   20    foo
> 4  2   20 foobar
> 5  2   25    foo
>
> Shows the elements are not the same for either variable in this
> instance. So, what I am looking to create is a data frame that would be
> like this
>
> id      freq    var1    var2
> 1       2       TRUE    TRUE
> 2       3       FALSE   FALSE
>
> Where freq is the number of times the ID is repeated in the dataframe. A
> TRUE appears in the cell if all elements in the column are the same for
> the ID and FALSE otherwise. It is insignificant which values differ for
> my problem.
>
> The way I am thinking about tackling this is to loop through the ID
> variable and compare the values in the various columns of the dataframe.
> The problem I am encountering is that I don't think all.equal or
> identical are the right functions in this case.
>
> So, say I was wanting to compare the elements of var1 for id ==1. I
> would have
>
> x <- c(10,10)
>
> Of course, the following works
>
>> all.equal(x[1], x[2])
> [1] TRUE
>
> As would a similar call to identical. However, what if I only have a
> vector of values (or if the column consists of names) that I want to
> assess for equality when I am trying to automate a process over
> thousands of cases? As in the example above, the vector may contain only
> two values or it may contain many more. The number of values in the
> vector differ by id.
>
> Any thoughts?
>
> Harold
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?




More information about the R-help mailing list