[R] speeding up likelihood computation

Mon Dec 3 19:29:57 CET 2007

Jim and others,

As far as I can see the computation of certain conditional probabilities
(required for computing my likelihood function) is what is slowing down
the evaluation of the likelihood function. Let me explain what these
conditional probabilities are and how I am (no doubt inefficiently)
calculating them. I apologize for the long post but I could not explain
the whole thing without some detailed examples, etc. 

For every family, we are given a completed birth sequence (call it S_i)
and the desired maximum number of children (call it N_i); for instance
S_i might be GBB (where G stands for a girl and B stands for a boy) and
N_i might be 4. For each family, we want to compute the the probability
of observing the birth sequence, S_i, given that the family is
"targeting" sons. Since, a priori, we do not know the desired target
(for sons) for family i, we need to allow for all the feasible
possibilities. So, when a family states that the maximum number of
children that it desires is N_i, we need to allow for the possibilities
that the family targets 1 son, 2 sons, ... , (N_i-1) sons. Of course the
actual birth sequence might assign zero probability to some of these
possibilities; but we cannot rule out any of these a priori. 

Let us denote a target for sons as k_i. So, for family i with birth
sequence S_i and desired maximum number of children N_i, we need to
compute the following (N_i-1) conditional probabilities: P(S_i|N_i,
k_i=1, T_i=1), P(S_i|N_i, k_i=2, T_i=1), ... , P(S_i|N_i, k_i=N_i-1,
T_i=1). 

Let q be the probability of male birth; it is a parameter in my model
and will be estimated. Now, to compute something like P(S_i|N_i, k_i,
T_i=1), we merely need to observe whether the family has any child after
the k_i^{th} son. If there is a child after the k_i^{th} son, then we
assign zero probability to P(S_i|N_i, k_i, T_i=1); else we assign it a
probability of (q^(nb[j]))*((1-q)^(ng[j])), where q is the probability
of a male birth, nb is the number of boys in the sequence S_i and ng is
the number of girls in the sequence.

An example might clarify matters. Suppose a family reports that the
maximum number of children it desires to have is 4 and the birth
sequence (starting with the first born child) for the family is observed
to be GGBG (where G stands for a girl and B stands for a boy). For such
a family, we need to compute the following probabilities: $(GGBG|N_i=4,
k_i=1, T_i=1), P(GGBG|N_i=4, k_i=2, T_i=1), and P(GGBG|N_i=4, k_i=3,
T_i=1). Since there is a child after the first boy, this family could
not possibly be targeting one son; hence P(GGBG|N_i=4, k_i=1, T_i=1)=0.
But the family could conceivably be targeting two or even three sons;
these possibilities are not ruled out by the observed birth sequence.
Hence P(GGBG|N_i=4, k_i=2, T_i=1)=(q)*((1-q)^3), and similarly P(GGBG|
N_i=4, k_i=3, T_i=1)=(q)*((1-q)^3). 

To clarify matters further, take another example. Suppose the family in
question reports a maximum desired family size (number of children) of 4
and we observe the following completed birth sequence for the same
family: BGB. Since there is a child after the first boy, this family
could not possibly be targeting one son; hence P(BGB|N_i=4, k_i=1,
T_i=1)=0. Since there is no child after the second boy, the family could
possibly be targeting 2 sons; hence P(BGB|N_i=4, k_i=2,
T_i=1)=(q^2)*((1-q)). And since the family stops at three children (with
two sons), it cannot be targeting three sons; to target three sons, the
family should have gone for another child and not stopped at the third
child (and second son). Hence, P(BGB|N_i=4, k_i=3, T_i=1)=0.

In my sample I let N_i run from 2 to 6. So, depending on whether N_i is
2 , 3, 4, 5 or 6, I need to compute these conditional probabilities. For
instance, if N_i is 2, I need to compute only one conditional
probability; if N_i is 6, I need to compute five of these conditional
probabilities. This is how I proceed.

I start by creating a 6x6 matrix of zeros and a 1x15 vector of zeros.

   S <- matrix(0, 6, 6)  # CREATE A 6x6 MATRIX OF ZEROS
   y <- numeric(15)      # CREATE A 1x15 VECTOR OF ZEROS 

Then the following part of the code computes the conditional
probabilities as rows of the matrix S. The code picks up the jth row if
the family has N_i=j. Once this is done, I store the lower triangle of
the matrix (i.e., entries below the principal diagonal) in the y vector.

--------------- begin code fragment -------------------------

    q <- 1/(1+exp(delta))   # PROBABILITY OF MALE BIRTH

 ###### N1 IS "DESIRED FAMILY SIZE" ######################
 ##### alive IS THE NUMBER OF CHILDREN ALIVE IN THE FAMILY
 ##### nb IS NUMBER OF BOYS IN THE FAMILY
 ##### ng IS THE NUMBER OF GIRLS IN THE FAMILY

    if (alive[j]==N1) {

        for (i in 1:(N1-1)) {
                S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j]))
        }
    }

    else {
        for (i in 1:(N1-1)) {
                S[N1,i] <- 0
        }

    }

 ######### CREATE A 1x6 VECTOR WITH POSITION OF BOYS WITHIN FAMILY
 # boy1 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS NO
BOYS)
 # boy2 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 1
BOY)
 # boy3 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 2
BOYS)
 # boy4 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 3
BOYS)
 # boy5 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 4
BOYS)
 # boy6 GIVES BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS ONLY 5
BOYS)

      x <- c(boy1[j], boy2[j], boy3[j], boy4[j], boy5[j])

      if (N1>1) {
                     for (i in 1:(N1-1)) {
                               if (alive[j]>x[i] & x[i]>0) {
                                   S[N1,i] <- 0
                               }
                               if (x[i] == alive[j] ) {
                                   S[N1,i] <-
(q^(nb[j]))*((1-q)^(ng[j]))
                               }
                     }
      }

   y <-
c(S[2,1],S[3,1],S[3,2],S[4,1],S[4,2],S[4,3],S[5,1],S[5,2],S[5,3],S[5,4],S[6,1],S[6,2],S[6,3],S[6,4],S[6,5])

------------------ end code fragment ------------------------------

Later, I use the entries in the y vector for computing the likelihood.
Any suggestions on how to rework this part of the code will, I think,
improve overall efficiency.

Thanks in advance for your time.

Deepankar

On Sun, 2007-12-02 at 20:54 -0500, jim holtman wrote:
> One thing that I would suggest that you do is to use Rprof on a subset
> of the data that runs for 10-15 minutes and see where some of the hot
> spots are.  Since you have not provided commented, minimal,
> self-contained, reproducible code, it is hard to determine where the
> inefficiencies are since we don't have any data to run against it.
> Some of the loop look like you are just assigning a value to a vector,
> e.g.,
> 
> 
> >    if (alive[j]==N1) {
> >
> >        for (i in 1:(N1-1)) {
> >                S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j]))
> >        }
> >    }
> >
> >    else {
> >        for (i in 1:(N1-1)) {
> >                S[N1,i] <- 0
> >        }
> >
> >    }
> 
> that can be done without loops, but without data, it is hard to
> determine.  Run Rprof and see what summary.Rprof shows to indicate
> where to focus on.
> 
> On Dec 2, 2007 12:49 PM, DEEPANKAR BASU <basu.15 at osu.edu> wrote:
> > R Users:
> >
> > I am trying to estimate a model of fertility behaviour using birth history data with maximum likelihood. My code works but is extremely slow (because of several for loops and my programming inefficiencies); when I use the genetic algorithm to optimize the likelihood function, it takes several days to complete (on a machine with Intel Core 2 processor [2.66GHz] and 2.99 GB RAM). Computing the hessian and using it to calculate the standard errors takes a large chunk of this time.
> >
> > I am copying the code for my likelihood function below; it would be great if someone could suggest any method to speed up the code (by avoiding the for loops or by any other method).
> >
> > I am not providing details of my model or what exactly I am trying to do in each step of the computation below; i would be happy to provide these details if they are deemed necessary for re-working the code.
> >
> > Thanks.
> > Deepankar
> >
> >
> > --------- begin code -----------------------
> >
> > LLK1 <- function(paramets, data.frame, ...) {  # DEFINING THE LOGLIKELIHOOD FUNCTION
> >
> > # paramets IS A 1x27 VECTOR OF PARAMETERS OVER WHICH THE FUNCTION WILL BE MAXIMISED
> > # data.frame IS A DATA FRAME. THE DATA FRAME CONTAINS OBSERVATIONS ON SEVERAL VARIABLES
> > # (LIKE EDUCATION, AGE, ETC.) FOR EACH RESPONDENT. COLUMNS REFER TO VARIABLES AND ROWS REFER
> > # TO OBSERVATIONS.
> >
> > ########## PARAMETERS ###############################
> >
> > # alpha: interaction between son targeting and family size
> > # beta : son targeting
> > # gamma : family size
> > # delta : a 1x6 vector of probabilities of male birth at various parities (q1, q2, q3, q4, q5, q6)
> > # zeta : a 1x11 vector of conditional probabilities with zeta[1]=1 always
> >
> > alpha <- paramets[1]      # FIRST PARAMETER
> > beta <- paramets[2:9]     # SECOND TO SEVENTH PARAMETER
> > gamma <- paramets[10:16]
> > delta <- paramets[17]
> > zeta <- paramets[18:27]   # LAST 10 PARAMETERS
> >
> > ################# VARIABLES ###############################
> > # READING IN THE VARIABLES IN THE DATA FRAME
> > # AND RENAMING THEM FOR USE IN THE LIKELIHOOD FUNCTION
> >
> > everborn <- data.frame$v201
> > alive <- data.frame$alive
> > age <- data.frame$age
> > edu <- data.frame$edu
> > rural <- data.frame$rur
> > rich <- data.frame$rich
> > middle <- data.frame$middle
> > poor <- data.frame$poor
> > work <- data.frame$work
> > jointfam <- data.frame$jfam
> > contracep <- data.frame$contra
> > hindu <- data.frame$hindu
> > muslim <- data.frame$muslim
> > scaste <- data.frame$scaste
> > stribe <- data.frame$stribe
> > obc <- data.frame$obc
> > ucaste <- data.frame$ucaste
> > N <- data.frame$dfsize
> > indN <- data.frame$dfsize1  # INDICATOR FUNCTION THAT dfsize==alive
> > nb <- data.frame$nboy
> > ng <- data.frame$ngirl
> > ncord1 <- data.frame$ncord1  # FIRST CHILD: BOY=0; GIRL=1
> > ncord2 <- data.frame$ncord2  #SECOND CHILD: BOY=0; GIRL=1
> > ncord3 <- data.frame$ncord3
> > ncord4 <- data.frame$ncord4
> > ncord5 <- data.frame$ncord5
> > ncord6 <- data.frame$ncord6  # SIXTH CHILD: BOY=0; GIRL=1
> >
> >
> >
> > ######### POSITION OF i^th BOY ################################################
> > boy1 <- data.frame$boy1     # BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS NO BOYS)
> > boy2 <- data.frame$boy2     # BIRTH POSITION OF SECOND BOY (ZERO IF THE FAMILY HAS ONLY ONE BOY)
> > boy3 <- data.frame$boy3
> > boy4 <- data.frame$boy4
> > boy5 <- data.frame$boy5
> > boy6 <- data.frame$boy6     # BIRTH POSITION OF SIXTH BOY (ZERO IF THE FAMILY HAS ONLY FIVE BOYS)
> >
> >
> > ######################## CONDITIONAL PROBABILITIES ##########################
> > qq21 <- 1
> >
> > qq31 <- 1/(1+exp(zeta[1]))
> > qq32 <- exp(zeta[1])/(1+exp(zeta[1]))
> >
> > qq41 <- 1/(1+exp(zeta[2])+exp(zeta[3]))
> > qq42 <- exp(zeta[2])/(1+exp(zeta[2])+exp(zeta[3]))
> > qq43 <- exp(zeta[3])/(1+exp(zeta[2])+exp(zeta[3]))
> >
> > qq51 <- 1/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6]))
> > qq52 <- exp(zeta[4])/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6]))
> > qq53 <- exp(zeta[5])/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6]))
> > qq54 <- exp(zeta[6])/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6]))
> >
> > qq61 <- 1/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> > qq62 <- exp(zeta[7])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> > qq63 <- exp(zeta[8])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> > qq64 <- exp(zeta[9])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> > qq65 <- exp(zeta[10])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> >
> > zeta1 <- c(qq21,qq31,qq32,qq41,qq42,qq43,qq51,qq52,qq53,qq54,qq61,qq62,qq63,qq64,qq65)
> >
> > #########################################################################
> >
> > n <- length(N)         # LENGTH OF SAMPLE; SIZE OF THE MAIN LOOP
> >
> > lglik <- numeric(n)    # INITIALIZING THE LOGLIKELIHOOD FUNCTION
> >                       # CREATES A 1xn VECTOR OF ZEROS
> >
> >  for (j in 1:n) {      # START OF MAIN LOOP
> >
> >    S <- matrix(0, 6, 6)  # CREATE A 6x6 MATRIX OF ZEROS
> >    y <- numeric(15)      # CREATE A 1x15 VECTOR OF ZEROS
> >    N1 <- N[j]       # DESIRED FAMILY SIZE
> >
> >
> >      q <- 1/(1+exp(delta))   # PROBABILITY OF MALE BIRTH
> >
> >
> >    if (alive[j]==N1) {
> >
> >        for (i in 1:(N1-1)) {
> >                S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j]))
> >        }
> >    }
> >
> >    else {
> >        for (i in 1:(N1-1)) {
> >                S[N1,i] <- 0
> >        }
> >
> >    }
> >
> > ######### CREATE A 1x6 VECTOR WITH POSITION OF BOYS WITHIN FAMILY
> >      x <- c(boy1[j], boy2[j], boy3[j], boy4[j], boy5[j])
> >
> >      if (N1>1) {
> >                     for (i in 1:(N1-1)) {
> >                               if (alive[j]>x[i] & x[i]>0) {
> >                                   S[N1,i] <- 0
> >                               }
> >                               if (x[i] == alive[j] ) {
> >                                   S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j]))
> >                               }
> >                     }
> >      }
> >
> >   y <- c(S[2,1],S[3,1],S[3,2],S[4,1],S[4,2],S[4,3],S[5,1],S[5,2],S[5,3],S[5,4],S[6,1],S[6,2],S[6,3],S[6,4],S[6,5])
> >
> >
> >   z1 <- c(age[j],edu[j],work[j],rural[j],poor[j],middle[j],hindu[j])         # DETERMINANTS OF FAMILY SIZE
> >   z2 <- c(1,age[j],edu[j],work[j],contracep[j],rural[j],jointfam[j],hindu[j])  # DETERMINANTS OF SON TARGETING
> >
> >   t1 <- (indN[j])*((q^(nb[j]))*((1-q)^(ng[j])))*(exp(-exp(sum(z1*gamma)))*((exp(sum(z1*gamma)))^N1)*pnorm(-sum(z2*beta)))/factorial(N1)
> >   t2 <- (sum(y*zeta1))*(exp(-exp(sum(z1*gamma) + alpha))*((exp(sum(z1*gamma) + alpha))^N1)*(1-pnorm(-sum(z2*beta)))/factorial(N1))
> >   lglik[j] <- log(t1+t2)
> >  }
> >
> >  return(-sum(lglik)) # RETURNING THE NEGATIVE OF THE LOGLIKELIHOOD
> >                     # SUMMED OVER ALL OBSERVATIONS
> >
> >
> > }
> >
> > ------------ end code ----------------------
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
>