[BioC] genereating correlation matrix from gene expression data

Robert Castelo robert.castelo at upf.edu
Thu Nov 19 17:31:16 CET 2009


hi Pankaj,

since you mentioned you're a new user of R and Bioconductor let me tell
you that you should take a look to the posting guide at
http://www.bioconductor.org/docs/postingGuide.html if you have not done
so and try to follow the guidelines which, among other things, require
to respond to everyone in the list in order to ensure that the response
is archived so that the emails work as a knowledge base for all of us
and future users.

regarding you question below, tab-separated .txt format is fine to
import the data into R and for that purpose you don't need any specific
function of the 'qpgraph' package but the general-purpose reading
function 'read.table'. depending on whether you have row and column
headers you might need to search for appropriate values of its arguments
which you can do by looking at its help page with:

help(read.table)

but probably a call like this one should work for you:

expData <- read.table("filename.txt", header=TRUE, sep="\t")

a simple QC for the previous operation is to call the function 'head'
which will allow you to check whether you have read what you expect:

head(expData)

then you can do the calls specified below, like:

myPCCs <- qpPCC(expData)

regarding the particular functions 'set.seed', 'matrix', 'rnorm' and
'round' i put them there in order to give you an example that you can
reproduce in your computer. that is, by copying and pasting the example
below (or the example that Steve gave you) into your R session, you
should get the same results in your computer and that should help you in
understanding the solution.

look at the help pages of these functions in order to understand what
they do and you should be able then to figure out why are they given in
that particular order. let me know if after looking to the help pages
you still have problems in understanding the example.

cheers,
robert.

> Hi Robert,
> 
> Thanks for your reply. I have my data as follows :
> 
> At2g14610.1    5.66901    5.13719    5.7854    5.59656
> At2g14610_41.45    5.82303    4.88138    5.7313    5.23123
> At3g26830.1    4.76497    4.5498    4.96243    4.85919
> At2g24850_37.1    4.08244    4.22297    5.39891    4.9186
> At4g10500.1    4.3768    4.15227    4.76488    5.05476
> At1g15520.1    4.42941    3.92527    4.60306    5.17746
> At3g28510.1    4.67999    3.61874    4.75607    4.94384
> At1g57630.1    4.44181    4.20935    4.28739    4.50546
> At3g22600.1    4.64839    3.89768    4.48499    4.3105
> At2g24850.1    3.72857    4.09308    4.85082    4.57857
> At3g57260.1    4.40517    3.80811    4.52704    4.4601
> 
> Here 20 genes in rows and 4 expression values in columns. I have this
> file in xls as well as tab separated .txt format. How can import this
> matrix to  'qpgraph'?  and what these function mean: 
> 
> set.seed(123)
> x <- matrix(rnorm(20), 4, 5)
> 
> round(x, digits=3)
> 
> 
> Thanks and regards,
> 
> Pankaj Barah
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --- On Thu, 19/11/09, Robert Castelo <robert.castelo at upf.edu> wrote:
>         
>         From: Robert Castelo <robert.castelo at upf.edu>
>         Subject: Re: [BioC] genereating correlation matrix from gene
>         expression data
>         To: "pankaj borah" <pankajborah2k3 at yahoo.co.in>
>         Cc: bioconductor at stat.math.ethz.ch
>         Date: Thursday, 19 November, 2009, 2:33 PM
>         
>         Pankaj,
>         
>         in addition to what Steve provided below, if you are dealing
>         with
>         expression data, it might be useful for you the function qpPCC
>         from the
>         bioconductor package 'qpgraph'. it performs the same
>         calculation as the
>         'cor' function illustrated in Steve's message but in addition
>         it accepts
>         as input parameter not only a matrix or a data frame but also
>         an
>         ExpressionSet object (not a big deal) and, more interestingly,
>         it
>         calculates for you the P-values of a two sided test for the
>         null
>         hypothesis of zero Pearson correlation. this latter feature
>         might be
>         useful since as far as i know if one wants to test every pair
>         of genes
>         and get the P-values one needs to call 'cor.test' for each
>         pair of genes
>         within a two nested for loops or some 'apply' call and this
>         function
>         'qpPCC' does it using matrix calculations and thus is faster
>         for that
>         purpose. of course, if you are interested in determining
>         significance
>         through the P-values you will have to deal with the multiple
>         testing
>         problem.
>         
>         library(qpgraph)
>         
>         set.seed(123)
>         x <- matrix(rnorm(20), 4, 5)
>         
>         round(x, digits=3)
>                [,1]   [,2]   [,3]   [,4]   [,5]
>         [1,] -0.560  0.129 -0.687  0.401  0.498
>         [2,] -0.230  1.715 -0.446  0.111 -1.967
>         [3,]  1.559  0.461  1.224 -0.556  0.701
>         [4,]  0.071 -1.265  0.360  1.787 -0.473
>         
>         lapply(qpPCC(x), round, digits=3)
>         $R
>                1      2      3      4      5
>         1  1.000 -0.016  0.958 -0.490  0.437
>         2 -0.016  1.000 -0.271 -0.753 -0.462
>         3  0.958 -0.271  1.000 -0.218  0.431
>         4 -0.490 -0.753 -0.218  1.000 -0.198
>         5  0.437 -0.462  0.431 -0.198  1.000
>         
>         $P
>               1     2     3     4     5
>         1 0.000 0.984 0.042 0.510 0.563
>         2 0.984 0.000 0.729 0.247 0.538
>         3 0.042 0.729 0.000 0.782 0.569
>         4 0.510 0.247 0.782 0.000 0.802
>         5 0.563 0.538 0.569 0.802 0.000
>         
>         cor.test(x[,1], x[,2])
>         
>                 Pearson's product-moment correlation
>         
>         data:  x[, 1] and x[, 2] 
>         t = -0.0231, df = 2, p-value = 0.9837
>         alternative hypothesis: true correlation is not equal to 0 
>         95 percent confidence interval:
>         -0.962  0.960 
>         sample estimates:
>             cor 
>         -0.0163 
>         
>         as a final shameless remark :) if you are interested in trying
>         to sort
>         out genuine from spurious correlations you might want to
>         calculate the
>         so-called "partial correlations" (see
>         http://en.wikipedia.org/wiki/Partial_correlation) which in
>         fact one
>         cannot calculate with gene expression data because typically
>         you will
>         many more genes than samples. instead, as a sort of surrogate
>         for
>         partial correlations, you might find useful to calculate the
>         so-called
>         non-rejection rates through the functions 'qpNrr' or
>         'qpAvgNrr' from the
>         'qpgraph' package (look at the corresponding help pages if
>         you're
>         interested).
>         
>         best,
>         robert.
>         
>         On Wed, 2009-11-18 at 14:05 -0500, Steve Lianoglou wrote:
>         > Hi Pankaj,
>         > 
>         > On Nov 18, 2009, at 1:30 PM, pankaj borah wrote:
>         > 
>         > > Hi all,
>         > >
>         > > i am a new user of R and Bioconductor package. i deal with
>         gene  
>         > > expression data. i was wondering if there is a way to
>         generate a  
>         > > correlation matrix (all to all square symmetric matrix)
>         for  a set  
>         > > of genes and their expression values. is there any
>         available function?
>         > 
>         > Try ?cor to get the pairwise correlation between columns of
>         your  
>         > matrix, eg:
>         > 
>         > R> x <- matrix(rnorm(20), 4, 5)
>         > R> x
>         >             [,1]        [,2]       [,3]       [,4]
>            [,5]
>         > [1,]  0.5818808  0.58781709  2.5511157 -1.5332180  0.6905731
>         > [2,] -0.7640311  0.25960974  0.7246655 -1.5539226 -0.5459625
>         > [3,] -1.7141619 -0.25808091 -0.0868366 -0.6547804  0.4629494
>         > [4,] -2.2906217  0.04932864  0.5694895  0.7736206  0.4078665
>         > 
>         > R> cor(x)
>         >              [,1]         [,2]       [,3]       [,4]
>         [,5]
>         > [1,]  1.00000000  0.851982381  0.8715055 -0.8404848
>         0.074770380
>         > [2,]  0.85198238  1.000000000  0.9429553 -0.5338964
>         0.004627258
>         > [3,]  0.87150545  0.942955253  1.0000000 -0.4719936
>         0.325584972
>         > [4,] -0.84048477 -0.533896414 -0.4719936  1.0000000
>         0.309414781
>         > [5,]  0.07477038  0.004627258  0.3255850  0.3094148
>         1.000000000
>         > 
>         > If your genes are in rows, you'll just need to pass in the
>         transpose  
>         > of your matrix.
>         > 
>         > HTH,
>         > 
>         > -steve
>         > 
>         > --
>         > Steve Lianoglou
>         > Graduate Student: Computational Systems Biology
>         >    |  Memorial Sloan-Kettering Cancer Center
>         >    |  Weill Medical College of Cornell University
>         > Contact Info: http://cbio.mskcc.org/~lianos/contact
>         > 
>         > _______________________________________________
>         > Bioconductor mailing list
>         > Bioconductor at stat.math.ethz.ch
>         > https://stat.ethz.ch/mailman/listinfo/bioconductor
>         > Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>         > 
>         
>         
> 
> 
> ______________________________________________________________________
> The INTERNET now has a personality. YOURS! See your Yahoo! Homepage.



More information about the Bioconductor mailing list