[BioC] stuck with affy/ limma

James MacDonald jmacdon at med.umich.edu
Wed Mar 31 19:44:05 CEST 2010



Maxim wrote:
> Hi,
> 
> I have a question concerning the analysis of some affymetrix chips. I
> downloaded some of the data from GEO GSE11324 (see below). In doing so I'm
> stuck after I identified the probesets with significant changes. I have
> problems in assigning probeset specific gene names as well as getting the
> genomic coordinates. Furthermore I have no clue how to deal with the fact,
> that most genes have different probesets with differential transcriptional
> outcomes.
> 
> 
> I did this based on the affy and limma manuals like:
> 
> targets file:
> Name FileName Target
> 0h1 GSM286031.CEL control
> 0h2 GSM286032.CEL control
> 0h3 GSM286033.CEL control
> 3h1 GSM286034.CEL three
> 3h2 GSM286035.CEL three
> 3h3 GSM286036.CEL three
> 6h1 GSM286037.CEL six
> 6h2 GSM286038.CEL six
> 6h3 GSM286039.CEL six
> 
> 
> library(affy)
> library(limma)
> library(vsn)
> 
> pd <- read.AnnotatedDataFrame("er_for_affy.txt", header = TRUE, row.names =
> 2)
> pData(pd)
> #### load
> a <- ReadAffy(filenames = rownames(pData(pd)), phenoData = pd, verbose =
> TRUE)
> #### normalize
> x <- expresso(a, bg.correct = FALSE, normalize.method = "vsn",
> normalize.param
> = list(subsample = 1000), pmcorrect.method = "pmonly", summary.method =
> "medianpolish")
> ### genes with highest variation
> library(hgu133plus2.db)
> rsd <- apply(exprs(x), 1, sd)
> sel <- order(rsd, decreasing = TRUE)[1:250]
> 
> 
> u<-mget(row.names(exprs(x)[sel,]),hgu133plus2SYMBOL)
> heatmap(exprs(x)[sel, ], labRow=u)
> ### in this case it works to extract the gene symbol
> 
> 
> ### limma contrasts
> design <- model.matrix(~ -1+factor(c(1,1,1,2,2,2,3,3,3)))
> colnames(design) <- c("control", "three", "six")
> fit <- lmFit(x, design)
> meanSdPlot(x)
> contrast.matrix <- makeContrasts(three-control, six-control, levels=design)
> fit2 <- contrasts.fit(fit, contrast.matrix)
> fit2 <- eBayes(fit2)
> #### top list
> topTable(fit2, coef=1, adjust="BH", number=20, sort.by="M")
> library(hgu133plus2.db)
> u<-mget(row.names(fit2),hgu133plus2SYMBOL)
> 
> How can I produce a topTable result with according gene names, somehow I do
> not understand the genelist argument?

The genelist argument expects you to supply a vector of gene names 
(which is what it says in the help page for this function).

genelist <- unlist(mget(featureNames(x), hgu133plus2SYMBOL))

topTable(fit2, coef = 1, number = 20, sort.by = "M", genelist = genelist)

> 
> Next, I would like to produce a standard clustering of the "fold changes"
> observed within (averaged) contrasts 1 (three - control) and 2 (six -
> control) and a heatmap presentation of the results. How to extract for
> example all fold-changes of those genes with a p-value<0.001 in at least one
> of the two contrasts?

rslt <- decideTests(fit2, p.value = 0.001)
ind <- apply(rslt, 1, function(x) any(x != 0))
fit2$coef[ind,]

> 
> The coordinates of the genes I seem to get with
> v<-mget(row.names(fit2),hgu133plus2CHRLOC)
> v<-mget(row.names(fit2),hgu133plus2CHRLOCEND)
> But again I do not know, how to implement it into my fit2 object or topTable
> results. Furthermore there are many probesets with multiple mappings, should
> these not be excluded from the analysis?

That's up to you.

> 
> Actually, in the end I'd like to get the corresponding genes' coordinates in
> a way, that the maximum region size is given, eg in case of genes with
> multiple TSSs.
> 
> As mentioned above, I do not know how to deal with the fact, that many genes
> are represented with mutliple probesets, often with different fold changes -
> is there a general recipe to deal with this question? Furthermore there are
> many probesets with multiple mappings, should these not be excluded from the
> analysis?

I believe the BioC case studies cover some of these points. One method 
you might consider is to select the reporters with the largest value for 
whatever test statistic you are using. See the findLargest() function in 
the genefilter package.

Another alternative is to use the MBNI remapped cdfs, which essentially 
removes these redundancies.

Best,

Jim


> 
> I know it's a lot of questions, so is there a general source of information,
> that may help me to overcome the hurdles?
> 
> Maxim
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-5646
734-936-8662
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 



More information about the Bioconductor mailing list