[R] Speeding up code?

Ignacio Martinez ignacio82 at gmail.com
Thu Jul 16 19:49:45 CEST 2015


Thank Jim!

This makes a huge difference. Can you explain why are data frame slower
than a matrix? Any other suggestions on how to improve the code would be
greatly appreciated.

Thanks again!

Ignacio

On Thu, Jul 16, 2015 at 1:42 PM jim holtman <jholtman at gmail.com> wrote:

> Actually looking at the result, you don't need the transpose; that was an
> artifact of how you were doing it before.
>
>  xm <- do.call(rbind, str_split(string = AllpairsTmp, pattern = "-"))
>  # convert to dataframe and do transpose on matrix and not dataframe
>  separoPairs <- as.data.frame((xm), stringsAsFactors = FALSE)
>
>
>
>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
> On Thu, Jul 16, 2015 at 1:37 PM, jim holtman <jholtman at gmail.com> wrote:
>
>> Here is one improvement.  Avoid dataframes in some of these cases.  This
>> create a character matrix and then converts to a dataframe after doing the
>> transpose of the matrix.  This just takes less than 10 seconds on my system:
>>
>>
>> >  library(stringr)
>> >  # create character matrix; avoid dataframes in this case
>> >  print(proc.time())
>>    user  system elapsed
>>   15.52    5.24  587.70
>> >  xm <- do.call(rbind, str_split(string = AllpairsTmp, pattern = "-"))
>> >  # convert to dataframe and do transpose on matrix and not dataframe
>> >  separoPairs <- as.data.frame(t(xm), stringsAsFactors = FALSE)
>> >  print(proc.time()
>> +
>> + )
>>    user  system elapsed
>>   20.90    5.36  596.57
>> >
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> On Thu, Jul 16, 2015 at 7:56 AM, Ignacio Martinez <ignacio82 at gmail.com>
>> wrote:
>>
>>> Hi Collin,
>>>
>>> The objective of the gen.names function is to generate N *unique *random
>>> names, where N is a *large *number. In my computer `gen.names(n = 50000)`
>>> takes under a second, so is probably not the root problem in my code.
>>> That
>>> said, I would love to improve it. I'm not exactly sure how you propose to
>>> change it using sample. What is the object that I would be sampling? I
>>> would love to run a little benchmark to compare my version with yours.
>>>
>>> What really takes a long time to run is:
>>>
>>>     separoPairs <- as.data.frame(str_split(string = AllpairsTmp, pattern
>>> =
>>> "-"))
>>>
>>> So that and the chunk of code before that is probably where I would get
>>> big
>>> gains in speed. Sadly, I have no clue how to do it differently
>>>
>>> Thanks a lot for the help!!
>>>
>>> Ignacio
>>>
>>>
>>> On Wed, Jul 15, 2015 at 11:34 PM Collin Lynch <cflynch at ncsu.edu> wrote:
>>>
>>> > Hi Ignacio, If I am reading your code correctly then the top while
>>> loop is
>>> > essentially seeking to select a random set of names from the original
>>> set,
>>> > then using unique to reduce it down, you then iterate until you have
>>> built
>>> > your quota.  Ultimately this results in a very inefficient attempt at
>>> > sampling without replacement.  Why not just sample without replacement
>>> > rather than loop iteratively and use unique?  Or if the set of possible
>>> > names are short enough why not just randomize it and then pull the
>>> first n
>>> > items off?
>>> >
>>> >     Best,
>>> >     Collin.
>>> >
>>> > On Wed, Jul 15, 2015 at 11:15 PM, Ignacio Martinez <
>>> ignacio82 at gmail.com>
>>> > wrote:
>>> >
>>> >> Hi R-Help!
>>> >>
>>> >> I'm hoping that some of you may give me some tips that could make my
>>> code
>>> >>
>>> > more efficient. More precisely, I would like to make the answer to my
>>> >> stakoverflow
>>> >> <
>>> >>
>>> http://stackoverflow.com/questions/31137940/randomly-assign-teachers-to-classrooms-imposing-restrictions
>>>
>>> >> >
>>> >
>>> >
>>> >> question more efficient.
>>> >>
>>> >> This is the code:
>>> >>
>>> >> library(dplyr)
>>> >> library(randomNames)
>>> >> library(geosphere)
>>> >>
>>> > set.seed(7142015)# Define Parameters
>>> >
>>> >
>>> >> n.Schools <- 20
>>> >> first.grade<-3
>>> >> last.grade<-5
>>> >> n.Grades <-last.grade-first.grade+1
>>> >> n.Classrooms <- 20 # THIS IS WHAT I WANTED TO BE ABLE TO CHANGE
>>> >> n.Teachers <- (n.Schools*n.Grades*n.Classrooms)/2 #Two classrooms per
>>> >> teacher
>>> >> # Define Random names function:
>>> >> gen.names <- function(n, which.names = "both", name.order =
>>> "last.first"){
>>> >>   names <- unique(randomNames(n=n, which.names = which.names,
>>> >> name.order = name.order))
>>> >>   need <- n - length(names)
>>> >>   while(need>0){
>>> >>     names <- unique(c(randomNames(n=need, which.names = which.names,
>>> >> name.order = name.order), names))
>>> >>     need <- n - length(names)
>>> >>   }
>>> >>   return(names)}
>>> >> # Generate n.Schools names
>>> >> gen.schools <- function(n.schools) {
>>> >>   School.ID <-
>>> >>     paste0(gen.names(n = n.schools, which.names = "last"), ' School')
>>> >>   School.long <- rnorm(n = n.schools, mean = 21.7672, sd = 0.025)
>>> >>   School.lat <- rnorm(n = n.schools, mean = 58.8471, sd = 0.025)
>>> >>   School.RE <- rnorm(n = n.schools, mean = 0, sd = 1)
>>> >>   Schools <-
>>> >>     data.frame(School.ID, School.lat, School.long, School.RE) %>%
>>> >>     mutate(School.ID = as.character(School.ID)) %>%
>>> >>     rowwise() %>%  mutate (School.distance = distHaversine(
>>> >>       p1 = c(School.long, School.lat),
>>> >>       p2 = c(21.7672, 58.8471), r = 3961
>>> >>     ))
>>> >>   return(Schools)}
>>> >>
>>> >> Schools <- gen.schools(n.schools = n.Schools)
>>> >> # Generate Grades
>>> >> Grades <- c(first.grade:last.grade)
>>> >> # Generate n.Classrooms
>>> >>
>>> >> Classrooms <- LETTERS[1:n.Classrooms]
>>> >> # Group schools and grades
>>> >>
>>> >> SchGr <- outer(paste0(Schools$School.ID, '-'), paste0(Grades, '-'),
>>> >> FUN="paste")#head(SchGr)
>>> >> # Group SchGr and Classrooms
>>> >>
>>> >> SchGrClss <- outer(SchGr, paste0(Classrooms, '-'),
>>> >> FUN="paste")#head(SchGrClss)
>>> >> # These are the combination of  School-Grades-Classroom
>>> >> SchGrClssTmp <- as.matrix(SchGrClss, ncol=1, nrow=length(SchGrClss) )
>>> >> SchGrClssEnd <- as.data.frame(SchGrClssTmp)
>>> >> # Assign n.Teachers (2 classroom in a given school-grade)
>>> >> Allpairs <- as.data.frame(t(combn(SchGrClssTmp, 2)))
>>> >> AllpairsTmp <- paste(Allpairs$V1, Allpairs$V2, sep=" ")
>>> >>
>>> >> library(stringr)
>>> >> separoPairs <- as.data.frame(str_split(string = AllpairsTmp, pattern =
>>> >> "-"))
>>> >> separoPairs <- as.data.frame(t(separoPairs))
>>> >> row.names(separoPairs) <- NULL
>>> >> separoPairs <- separoPairs %>% select(-V7)  %>%  #Drops empty column
>>> >>   mutate(V1=as.character(V1), V4=as.character(V4), V2=as.numeric(V2),
>>> >> V5=as.numeric(V5)) %>% mutate(V4 = trimws(V4, which = "both"))
>>> >>
>>> >> separoPairs[120,]$V4#Only the rows with V1=V4 and V2=V5 are valid
>>> >
>>> >
>>> >> validPairs <- separoPairs %>% filter(V1==V4 & V2==V5) %>% select(V1,
>>> V2,
>>> >> V3, V6)
>>> >> # Generate n.Teachers
>>> >>
>>> >> gen.teachers <- function(n.teachers){
>>> >>   Teacher.ID <- gen.names(n = n.teachers, name.order = "last.first")
>>> >>   Teacher.exp <- runif(n = n.teachers, min = 1, max = 30)
>>> >>   Teacher.Other <- sample(c(0,1), replace = T, prob = c(0.5, 0.5),
>>> >> size = n.teachers)
>>> >>   Teacher.RE <- rnorm(n = n.teachers, mean = 0, sd = 1)
>>> >>   Teachers <- data.frame(Teacher.ID, Teacher.exp, Teacher.Other,
>>> >> Teacher.RE)
>>> >>   return(Teachers)}
>>> >> Teachers <- gen.teachers(n.teachers = n.Teachers) %>%
>>> >>   mutate(Teacher.ID = as.character(Teacher.ID))
>>> >> # Randomly assign n.Teachers teachers to the "ValidPairs"
>>> >> TmpAssignments <- validPairs[sample(1:nrow(validPairs), n.Teachers), ]
>>> >> Assignments <- cbind.data.frame(Teachers$Teacher.ID, TmpAssignments)
>>> >> names(Assignments) <- c("Teacher.ID", "School.ID", "Grade", "Class_1",
>>> >> "Class_2")
>>> >> # Tidy Data
>>> >> library(tidyr)
>>> >> TeacherClassroom <- Assignments %>%
>>> >>   gather(x, Classroom, Class_1,Class_2) %>%
>>> >>   select(-x) %>%
>>> >>   mutate(Teacher.ID = as.character(Teacher.ID))
>>> >> # Merge
>>> >> DF_Classrooms <- TeacherClassroom %>% full_join(Teachers,
>>> >> by="Teacher.ID") %>% full_join(Schools, by="School.ID")
>>> >> rm(list=setdiff(ls(), "DF_Classrooms")) # Clean the work space!
>>> >>
>>> >> *I want to end up with the same*  'DF_Classrooms *data frame* but
>>> getting
>>> >
>>> >
>>> >> there in a more efficient way. In particular, when is use n.Classrooms
>>> >> <-4 the
>>> >>
>>> > code run fast, but *if I increase it to something like 20 it is
>>> painfully
>>> >> slow.*
>>> >>
>>> >> Thanks!!!
>>> >>
>>> >>         [[alternative HTML version deleted]]
>>> >>
>>> >> ______________________________________________
>>> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >> PLEASE do read the posting guide
>>> >> http://www.R-project.org/posting-guide.html
>>> >> and provide commented, minimal, self-contained, reproducible code.
>>> >>
>>> >
>>> >
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list