[R] debugging R code and dealing with dependencies

Mike Miller mbmiller+l at gmail.com
Thu Dec 25 10:04:09 CET 2014


I just wanted to put this out there.  It's just some of my observations 
about things that happen with R, or happened in this particular 
investigation.  There were definitely some lessons for me in this, and 
maybe that will be true of someone else.  The main thing I picked up is 
that it is good to put plenty of checks into our code -- if we expect 
input of a certain type or class, then I should either coerce input into 
that structure or test the input and throw an error.  If the function 
works very differently for different kinds of input, this should be 
documented.  The more people are doing this, the better things will go for 
everyone.


I was working with a CRAN package called RFGLS...

http://cran.r-project.org/web/packages/RFGLS/index.html

...and I was getting an error.  After a few rounds of testing I realized 
that the error was caused by a FAMID variable that was of character type.

The problem seemed to be that gls.batch() expected FAMID to be integers, 
but the default ought to be character type because family and individual 
IDs in nearly all genetic-analysis software are character strings (they 
might even be people's names).

This was the error:

Error in sum(blocksize) : invalid 'type' (character) of argument
Calls: gls.batch -> bdsmatrix

To figure out more about it, I spent a bunch of time to go from CMD BATCH 
mode to an interactive session so that I could look at traceback().  That 
got me this additional info:

> traceback()
2: bdsmatrix(sizelist, lme.out$sigma at blocks, dimnames = list(id, id))

bdsmatrix() is from a package on which RFGLS depends:

http://cran.r-project.org/web/packages/bdsmatrix/index.html

The problem is that RFGLS's gls.batch() function is sending something to 
bdsmatrix's bdsmatrix() that it can't handle.  So I look at the code for 
bdsmatrix() and I see this:

      if (any(blocksize <= 0))
          stop("Block sizes must be >0")
      if (any(as.integer(blocksize) != blocksize))
          stop("Block sizes must be integers")
      n1 <- as.integer(sum(blocksize))

The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE) 
only if blocksize contains one or more noninteger numeric values.  It 
doesn't fail if blocksize is character or logical if the character strings 
are integers.  Example:

> 4=="4"
[1] TRUE

That's an interesting feature of R, but I guess that's how it works. 
Also this:

> 1=="1"
[1] TRUE
> 1==TRUE
[1] TRUE
> "1"==TRUE
[1] FALSE

bdsmatrix() has no test that blocksize is numeric, so it fails when 
sum(blocksize) cannot sum character strings.

Next I had to figure out where RFGLS's gls.batch() is going wrong in 
producing sizelist.  It is created in a number of steps, but I identified 
this line as especially suspicious:

test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)

famsize was later converted to sizelist, and this line also includes 
FAMID, so this is likely where the problem originates.  Of course this is 
the big problem with debugging -- it's hard to find the source of an error 
that occurs far downstream in another function from a different package. 
I see that ave() is used, so I have to understand ave().

William Dunlap provided some guidance:

"ave() uses its first argument, 'x', to set the length of its output and 
to make an initial guess at the type of its output.  The return value of 
FUN can alter the type, but only in an 'upward' direction where 
logical<integer<numeric<complex<character<list.  (This is the same rule 
that x[i]<-newvalue uses.)"

In other words, if x is of character type, the output cannot be of integer 
or numeric type even if the output of FUN is always of integer or numeric 
type.  Looking at the ave() code, I can understand that choice:

function (x, ..., FUN = mean)
{
     if (missing(...))
         x[] <- FUN(x)
     else {
         g <- interaction(...)
         split(x, g) <- lapply(split(x, g), FUN)
     }
     x
}

If the factor is missing an element, then the corresponding element of X 
is not changed in the output:

> fact <- gl(2,2)
> fact[3] <- NA
> fact
[1] 1    1    <NA> 2
Levels: 1 2
> ave(1:4, fact)
[1] 1.5 1.5 3.0 4.0

That's a reasonable plan, but it isn't the documented functioning of 
ave().  From the document...

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html

...you get next to nothing about what the function actually does.  It does 
say that x is "a numeric," but the function does not throw an error when x 
is not numeric.  So if someone writes code expecting numeric x, but a user 
provides a non-numeric x, there may be trouble.

I suspect that the programmer saw that the code worked in her examples and 
she went on to other things.  I can't blame the documentation for that, 
but it is possible that if it said something about the relation between 
the type of the input and the type of the output she might have written it 
differently.  In addition, I probably would have caught it sooner and I 
would have understood the problem.

This is how I'll recommend they fix the bug in the code (thanks to those 
of you who helped with this):

temp.vec <- as.character( test.dat$FAMID[ test.dat$FTYPE != 6 ] )
test.dat$famsize[ test.dat$FTYPE != 6 ] <- as.vector( table( temp.vec )[ temp.vec ] )
rm(temp.vec)

I think we should force FAMID to be character from the beginning, though.

Best,
Mike

FYI -- RFGLS code that fails in RFGLS version 1.1:

library(RFGLS)

data(pheno)
data(geno)
data(map)
data(pedigree)
data(rescovmtx)
  #comment out the following two lines and it will run correctly
pedigree$FAMID <- as.character(pedigree$FAMID)
pheno$FAMID <- as.character(pheno$FAMID)
minigwas <- gls.batch(phenfile=pheno,genfile=data.frame(t(geno)),
   pedifile=pedigree,
   snp.names=map[,2],input.mode=c(1,2,3),pediheader=FALSE,
   pedicolname=c("FAMID","ID","PID","MID","SEX"),
   phen="Zscore",covars="IsFemale",
   outfile=NULL,col.names=TRUE,return.value=TRUE)
str(minigwas)

Mike



More information about the R-help mailing list