[BioC] Read single channel GenePix in limma [was: Analyze miRNA experiment in Bioconductor]

Paul Geeleher paulgeeleher at gmail.com
Mon May 19 15:46:52 CEST 2008


Ah yes thanks Daniel, I think thats problem solved. Anyway as I said,
here's what I've what I've come up with in analyzing genepix miRNA
data in bioconductor. I doubt this code is perfect so comments and
improvements are encouraged...


# A script for calculating differential expression of miRNA from
single channel genepix data
# Paul Geeleher

library(limma)
library(RColorBrewer)

# create a color pallete for boxplots
cols <- brewer.pal(12, "Set3")

# This defines the column name of the mean Cy3 foreground intensites
Cy3 <- "F532 Mean"

# This defines the column name of the mean Cy3 background intensites
Cy3b <- "B532 Mean"

# Read the targets file (see limmaUsersGuide for info on how to create this)
targets <- readTargets("targets.csv")

# read values from gpr file
# because there is no red channel, read Rb & R to be the same values as G and Gb
# this is a type of hack that allows the function to work
RG <- read.maimages( targets$FileName,
source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b))

# remove the extraneous red channel values
RG$R <- NULL
RG$Rb <- NULL


# create a pData for the data just read
# this indicates which population each array belongs to
pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b'))
rownames(pData) <-  RG$targets$FileName

# create design matrix
design <- model.matrix(~factor(pData$population))

# In my .gpr files all miRNAs contain the string "miR" in their name
# so the grep function can be used to extract all of these, removing
# all control signals and printing buffers etc.
# You need to check your .gpr files to find which signals you should extract.
miRs <- grep("miR", RG$genes$Name)
RG.final <- RG[miRs, ]


# boxplot of foreground and background intensities, useful for quality control
boxplot(data.frame(log2(RG.final$Gb)),main="Background", col=cols)
boxplot(data.frame(log2(RG.final$G)),main="Foreground", col=cols)

# load vsn library, this contains the normalization functions
library('vsn')

# Do VSN normalization and output as vns object
mat <- vsnMatrix(RG.final$G)

# view a boxplot of the normalized data
boxplot(as.data.frame(mat at hx), main="Normalized intensities", col=cols)

# insert rownames into matrix with normalized data
# this will mean that the gene names will appear in our final output
rownames(mat at hx) <- RG.final$genes$Name

# my .gpr files contain 4 "within-array replicates" of each miRNA.
# We need to make full use of this information by calculating the
duplicate correlation
# in order to use the duplicateCorrelation() function below,
# we need to make sure that the replicates of each gene appear in
sequence in this matrix,
# so we sort the normalized values so the replicate groups of 4 miRs
appear in sequence
mat at hx <- mat at hx[order(rownames(mat at hx)), ]

# calculate duplicate correlation between the 4 replicates on each array
corfit <- duplicateCorrelation(mat, design, ndups=4)

# show a boxplot of the correlations
boxplot(tanh(corfit$atanh.correlations))


# fit the linear model, including info on duplicates and correlation
fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus)

# calculate values using ebayes
ebayes <- eBayes(fit)

# output a list of top differnetially expressed genes...
topTable(ebayes, coef = 2, adjust = "BH", n = 100)






On Mon, May 19, 2008 at 1:35 PM, Daniel Brewer <daniel.brewer at icr.ac.uk> wrote:
> Not the most experienced in this area, but I think your design is pretty
> much like section 8.5 in the Limma's user guide.
>
>       "There are two major ways in which this comparison can be made.
> We can
>  1. create a design matrix which includes a coefficient for the mutant
> vs wild type difference,
>     or
>  2. create a design matrix which includes separate coefficients for wild
> type and mutant mice and then extract the difference as a contrast."
>
> So in your situation you would have either
>> design
>   plus   plusvsminus
> a  1      0
> b  1      0
> c  1      0
> d  1      0
> e  1      1
> f  1      1
> g  1      1
>> fit <- lmFit(eset, design)
>> fit <- eBayes(fit)
>> topTable(fit, coef="plusvsminus", adjust="BH")
>
> or
>   plus   minus
> a  1      0
> b  1      0
> c  1      0
> d  1      0
> e  0      1
> f  0      1
> g  0      1
>> fit <- lmFit(eset, design)
>> cont.matrix <- makeContrasts(plusvsminus=plus-minus, levels=design)
>> fit2 <- contrasts.fit(fit, cont.matrix)
>> fit2 <- eBayes(fit2)
>> topTable(fit2, adjust="BH")
>
> Hope that helps.
>
> Paul Geeleher wrote:
>> Ok I'm close to having this all sorted and have the
>> duplicateCorrelation() function working. For posterity I'd be happy to
>> post a the code and detailed explanation from everything I've done to
>> the mailing list. It should be useful for anyone doing similar MiRNA
>> analysis in future. My last question is regarding the design matrix,
>> I'm not sure if I'm creating it properly.
>>
>> I have 7 arrays, a, b, c, d, e, f and g. I want to measure
>> differential expression of arrays a, b, c & d against e, f & g.
>>
>> In some of the tutorials in the limmaUsersGuide, this design matrix is
>> simply created as follows:
>>
>> design <- c(-1, -1, -1, -1, 1, 1, 1)
>>
>> I've also seen it created using methods like this:
>>
>> pData <- data.frame(population = c('HER2+', 'HER2+', 'HER2+', 'HER2+',
>> 'HER2-', 'HER2-', 'HER2-'))
>> rownames(pData) <-  RG$targets$FileName
>> design <- model.matrix(~factor(pData$population))
>>
>>
>> These two different methods give me very different p-values come the
>> end of analysis and I'm wondering what exactly I should be doing?
>>
>> Thanks for any advice,
>>
>> -Paul.
>>
>> On Thu, May 15, 2008 at 1:40 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>> Dear Paul,
>>>
>>> I have no experience with miRNA arrays, so cannot give you any specific
>>> advice.
>>>
>>> With ordinary expression arrays, my practice is to keep the probes separate,
>>> especially if they actually have differing sequences.  There's no good
>>> summarisation method which feeds into the topTable format.  I convert to
>>> genes or transcript only at the interpretation stage.  The only exception
>>> are within array replicates which satisfy the (restrictive) assumptions of
>>> the duplicateCorrelation() function.
>>>
>>> Best wishes
>>> Gordon
>>>
>>> On Wed, 14 May 2008, Paul Geeleher wrote:
>>>
>>>> Hi Gordon,
>>>>
>>>> Thanks for you email. I've followed your steps and am getting some output
>>>> now.
>>>>
>>>> One problem though. When should the summarization step occur? What I
>>>> mean is that, between miRNA and control signals, my GPR file contains
>>>> about 3000 entries and when I am done with analysis topTable will
>>>> return all of these individually. But many of the miRNAs have multiple
>>>> entries in the ".gpr" file. So how, and when, should I go about
>>>> combining these into one value?
>>>>
>>>> Thanks in advance,
>>>> -Paul
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> **************************************************************
>
> Daniel Brewer
>
> Institute of Cancer Research
> Molecular Carcinogenesis
> MUCRC
> 15 Cotswold Road
> Sutton, Surrey SM2 5NG
> United Kingdom
>
> Tel: +44 (0) 20 8722 4109
> Fax: +44 (0) 20 8722 4141
>
> Email: daniel.brewer at icr.ac.uk
>
> **************************************************************
>
> The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.
>
> This e-mail message is confidential and for use by the...{{dropped:3}}



More information about the Bioconductor mailing list