[R] Speeding up lots of calls to GLM

Tue Mar 13 01:56:07 CET 2012

On Tue, Mar 13, 2012 at 9:39 AM, Davy <davykavanagh at gmail.com> wrote:
> Thanks for the reply. Sorry to be a pain, but could perhaps explain what you
> mean by
> "you can center each SNP variable at its mean to make the interaction
> term uncorrelated with the main effects".

Suppose rs1 and rs2 are your SNPs

rs1cent <- rs1-mean(rs1)
rs2cent <- rs2 -mean(rs2)

rs12interaction <- rs1cent*rs2cent

Now you can approximately test for interaction by testing for
correlation between phenotype and rs12interaction.  The approximation
isn't good enough to be relied on for final results, but it is good
enough to screen out, say, the bottom 99% of the models in settings
where there is not strong linkage disequilibrium (correlation between
SNPs).

The advantage of this is not just the lack of glms, but the fact that
rs12interaction can be computed for a lot of pairs at once, allowing
efficient vectorized code.  Perhaps even for all pairs at once, if you
have enough memory.

> Also, I have never heard of a scores test before but some googling has
> turned up the Lagrange multiplier test. Is this the one you mentioned.

No, the efficient score or Rao score test.  It's based on fitting the
model without interaction and testing whether the efficient score, the
derivative of the loglikelihood, is zero at the null model.  This
doesn't require fitting the interaction model, which is why it saves
time.

Getting large-scale SNP association tests to run fast does require
some reasonable familiarity with what is actually going on in the
internals of the tests.  Or, as many people eventually decide is
easier, brute force computing power.

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland