[BioC] PCA or concordance

Wed Mar 3 17:54:53 CET 2010

Hi,

On Wed, Mar 3, 2010 at 11:41 AM, Johnny H <ukfriend22 at googlemail.com> wrote:
> Dear Bioconductors,
> I have some proteomics data for several tissues:
>
> Heart x 3 replicates
> Lung x 3 replicates
>
> Each data set has a gene symbol and the number of peptides for that gene (a
> rough measure of protein expression).
>
> I want to make a data structure like:
>
>            heart1   heart2  heart3  lung1   lung2    lung3
> Gene1  2            4           3        7         9           20
> Gene2    50        45          33      0         1            0
> Gene3  ...... etc
> Gene4
>
> Each number in the data frame corresponds to number of peptides for that
> gene.

I've never worked with proteomics data, but just a quick point since
you're saying you want to "show something" based on the number of
peptides found per protein -- I guess you'll have to somehow normalize
for the (expected) length (# of peptides) of the protein itself?

> Is a Principle Component Analysis useful for this data set?

What are you trying to show?

> What would a PCA  tell me?

There are lots and lots of tutorials and things about PCA on the
intertubes. Here's a quote from the wikipedia article that, I think,
gives a decent "intuition" on what it tries to do:

"""PCA is the simplest of the true eigenvector-based multivariate
analyses. Often, its operation can be thought of as revealing the
internal structure of the data in a way which best explains the
variance in the data. If a multivariate dataset is visualised as a set
of coordinates in a high-dimensional data space (1 axis per variable),
PCA supplies the user with a lower-dimensional picture, a "shadow" of
this object when viewed from its (in some sense) most informative
viewpoint."""

I guess the last sentence, in particular, is useful.

> What function would I use make a nice graphical representation of the data?

What are you trying to show?

> Or should I used a concordance function, something like?
>
> con<-function(y1,y2){
>  d<-(mean(y1) - mean(y2))
>  v1<-var(y1)
>  v2<-var(y2)
>  cov<-cov(y1,y2)
>  con<-(2*cov)/(v1+v2+d^2)
>  return(con)};
>
> This will tell me if two samples have concordance but I don't know how to
> involve all samples. Basically, I want to summarise the data.

Summarize it like how?

Summing up the number of peptides found in each sample is one type of
summary, but might not inform you of what you'd like to be informed
about (you haven't been clear on what that is). It could be
informative in other ways, though, that I guess aren't immediately
obvious: eg. it can give you an idea telling you if you have roughly
the same amount of "input" into each of your replicates.

So ... what are you trying to show?

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact