[R] prcomp - surprising structure

peter dalgaard pdalgd at gmail.com
Thu Oct 3 15:27:22 CEST 2013


It's not so obvious to me that this is an artifact. What prcomp() says is that some of the eigenvectors have a lot of "activity" in some relatively narrow ranges of SNPs (on the same chromosome, perhaps?). If something artificial is going on, I could imagine effects not so much of centering columns but maybe one of
(a) imputing zero for missing values 
(b) the scaling by sqrt(pi*(1-pi)) implicitly requiring Hardy-Weinberg equilibrium, so if your data are all 0 or 2 (aa or AA) there will be overdispersion.

However, it could also be a biological effect. Are your ids by any chance from the same pedigree? If so, you might be seeing something like the effect of a crossover event in a distant ancestor. (Talk to a geneticist, I just "play one on TV".) 

To investigate further, you could go looking at the individual scores and see who is having extreme values on component 2-4 and then go back and see if there is something peculiar about their SNPs in the "strange" region.

Of course, you might have stumbled upon a bug in R, but I doubt so. 

Happy hunting!

-pd


On Oct 3, 2013, at 11:41 , Hermann Norpois wrote:

> Hello,
> 
> I did a pca with over 200000 snps for 340 observations (ids). If I plot the
> eigenvectors (called rotation in prcomp) 2,3 and 4 (e.g. plot
> (rotation[,2]) I see a strange "column" in my data (see attachment). I
> suggest it is an artefact (but of what?).
> 
> Suggestion:
> I used prcomp this way: prcomp (mat), where mat is a matrix with the column
> means already substracted followed by a normalisation procedure (see below
> for details). Is that okay? Or does prcomp repeat substraction steps?
> 
> Originally my approach was driven by the idea to compute a covariation
> matrix followed by the use of eigen, but the covariation matrix was to huge
> to handle. So I switched to prcomp.
> 
> As I guess that the "columns" in my plots reflect some artefact production
> I hope to get some help. For the case that my use of prcomp was not okay,
> could you please give me instructions how to use it - including with the
> normalisation procedure that I need to include before doing a pca.
> 
> Thanks
> Hermann
> 
> #
> # mat: matrix with genotypes coded as 0,1 and 2 (columns); IDs
> (observations) as rows.
> #
> prcomp.snp <- function (mat)
>  {
>    m <- ncol (mat)
>    n <- nrow (mat)
>    snp.namen <- colnames (mat)
>    for (i in 1:m)
>                   {
>                     # snps in columns
>                     ui <- mat[,i]
>                     n <- length (which (!is.na(ui)))
>                     # see methods Price et al. as correction
>                     pi <- (1+ sum(ui, na.rm=TRUE))/(2+2*n)
> 
>                     # substract mean
>                     ui <- ui - mean (ui, na.rm=TRUE)
>                     # NAs set to zero
>                     ui[is.na(ui)] <- 0
>                     # normalisation of the genotype for each ID
> important normalisation step
>                     ui <- ui/ (sqrt (pi*(1-pi)))
>                     # fill matrix with ui
>                     mat[,i] <- ui
>                   }
>    mat <- prcomp (mat)
>    return (mat)
>   }
> <rotplot.png>______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list