[R] prcomp - surprising structure

peter dalgaard pdalgd at gmail.com
Fri Oct 4 14:10:03 CEST 2013


On Oct 3, 2013, at 16:30 , Hermann Norpois wrote:

> Thanks for answering.
> 
> I already started hunting. But my first doubt was if I used prcomp correctly (and this is in the moment my most important point). So far as I understood your answer is yes. Is that correct? 

Yes. There are a couple of slightly dubious aspects: The missing value imputation, the scaling could be wrong, and the data are highly discrete, but I don't see how either of those would explain the effect. 

On the off chance that it could be a bug triggered by the large number of columns, you could consider disregarding SNP columns (say) 1:50000 and 250000:300000, rerun the prcomp(), and see if you still get the "columns" in the same positions. 

> 
> I am puzzled by the fact that these "columns" are more or less in the middle of my snp-data.

...or to be precise, there are narrow _ranges_ of SNPs in which the eigenvectors have many extreme coordinate values (positive or negative). I suspect that you need to consider what these ranges represent, both in biological terms and in technological terms. Are they parts of chromosomes, or maybe something to do with the SNP-chip layout?

(Unless you come up with an actual bug, we probably should not continue the discussion on this list.)

-pd

> 
> > However, it could also be a biological effect. Are your ids by any chance from the same pedigree? If so, you might be seeing something like  the effect of a crossover event in a distant ancestor. 
> 
> No there is no such pedigree scheme. Things like this are ruled out by IBD-measurement. (Further, the data is checked by an EIGENSTRAT analysis.)
> 
> > (b) the scaling by sqrt(pi*(1-pi)) implicitly requiring Hardy-Weinberg  equilibrium, so if your data are all 0 or 2 (aa or AA) there will be overdispersion.
> 
> This is a good point. But why do find such effects in the "middle" of my data?
> 
> Thanks
> Hermann
> 
> 
> 
> 
> 2013/10/3 peter dalgaard <pdalgd at gmail.com>
> It's not so obvious to me that this is an artifact. What prcomp() says is that some of the eigenvectors have a lot of "activity" in some relatively narrow ranges of SNPs (on the same chromosome, perhaps?). If something artificial is going on, I could imagine effects not so much of centering columns but maybe one of
> (a) imputing zero for missing values
> (b) the scaling by sqrt(pi*(1-pi)) implicitly requiring Hardy-Weinberg equilibrium, so if your data are all 0 or 2 (aa or AA) there will be overdispersion.
> 
> However, it could also be a biological effect. Are your ids by any chance from the same pedigree? If so, you might be seeing something like the effect of a crossover event in a distant ancestor. (Talk to a geneticist, I just "play one on TV".)
> 
> To investigate further, you could go looking at the individual scores and see who is having extreme values on component 2-4 and then go back and see if there is something peculiar about their SNPs in the "strange" region.
> 
> Of course, you might have stumbled upon a bug in R, but I doubt so.
> 
> Happy hunting!
> 
> -pd
> 
> 
> On Oct 3, 2013, at 11:41 , Hermann Norpois wrote:
> 
> > Hello,
> >
> > I did a pca with over 200000 snps for 340 observations (ids). If I plot the
> > eigenvectors (called rotation in prcomp) 2,3 and 4 (e.g. plot
> > (rotation[,2]) I see a strange "column" in my data (see attachment). I
> > suggest it is an artefact (but of what?).
> >
> > Suggestion:
> > I used prcomp this way: prcomp (mat), where mat is a matrix with the column
> > means already substracted followed by a normalisation procedure (see below
> > for details). Is that okay? Or does prcomp repeat substraction steps?
> >
> > Originally my approach was driven by the idea to compute a covariation
> > matrix followed by the use of eigen, but the covariation matrix was to huge
> > to handle. So I switched to prcomp.
> >
> > As I guess that the "columns" in my plots reflect some artefact production
> > I hope to get some help. For the case that my use of prcomp was not okay,
> > could you please give me instructions how to use it - including with the
> > normalisation procedure that I need to include before doing a pca.
> >
> > Thanks
> > Hermann
> >
> > #
> > # mat: matrix with genotypes coded as 0,1 and 2 (columns); IDs
> > (observations) as rows.
> > #
> > prcomp.snp <- function (mat)
> >  {
> >    m <- ncol (mat)
> >    n <- nrow (mat)
> >    snp.namen <- colnames (mat)
> >    for (i in 1:m)
> >                   {
> >                     # snps in columns
> >                     ui <- mat[,i]
> >                     n <- length (which (!is.na(ui)))
> >                     # see methods Price et al. as correction
> >                     pi <- (1+ sum(ui, na.rm=TRUE))/(2+2*n)
> >
> >                     # substract mean
> >                     ui <- ui - mean (ui, na.rm=TRUE)
> >                     # NAs set to zero
> >                     ui[is.na(ui)] <- 0
> >                     # normalisation of the genotype for each ID
> > important normalisation step
> >                     ui <- ui/ (sqrt (pi*(1-pi)))
> >                     # fill matrix with ui
> >                     mat[,i] <- ui
> >                   }
> >    mat <- prcomp (mat)
> >    return (mat)
> >   }
> > <rotplot.png>______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> --
> Peter Dalgaard, Professor
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> 
> 

-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list