[R] qr with missing dependent variables
Richard.Mott at well.ox.ac.uk
Thu Dec 8 17:52:29 CET 2005
We have a regression problem which could be solved elegantly if we could
figure out how to get the R residuals() function to accept missing
We have ~20000 gene-expression vectors y, each being measured on the
same set of individuals, but each having a small random number of
For each expression vector we wish to search across the genome looking
for quantitative trait loci - ie chromosomal regions g where the local
genetic structure, represented by the design matrix X(g), gives a
significant linear regression relationship. Depending on the complexity
of the genetic model being investigated, X(g) typically has either 7 or
32 columns, i.e is of non-trivial size. the number of loci g to be
investigated is ~13000, so we have to do 13000*20000 = 260,000,000
multiple regressions. Therefore computational efficiency is important.
We thought of one way to do this: - for each design matrix g, compute
the qr decomposition once, then work out the residual sum of squares for
each of the expression phenotypes using residuals() on the qr object
applied to the expression vector. That way would only need to do the
hard part of the linear regression once.
The problem with this approach is the missing values, which are not
allowed by residuals(). Unfortunatley we can't just eliminate all rows
containing a missing value because we would throw away too much data.
Is there a way round this ? Can we set the missing values to 0 and then
sort out the discrepancies in the residual SS? More generally, is it
consistent to compute a qr decomposition including rows for which there
are no dependent observations ?
As far as I can see, this problem has not been addressed in R-help, but
my apologies if it has !
Richard Mott | Wellcome Trust Centre
tel 01865 287588 | for Human Genetics
fax 01865 287697 | Roosevelt Drive, Oxford OX3 7BN
More information about the R-help