[BioC] Reading Affy CEL files

James W. MacDonald jmacdon at uw.edu
Mon Jun 3 17:37:59 CEST 2013


Hi Ranjani,

On 6/3/2013 10:52 AM, Ranjani R wrote:
> Ranjani R [guest]<guest at ...>  writes:
>
>>
>> I am a newbie to Affy. Thanks for your help.
>>
>> I am processing CEL files through R (Affy package) and am having some
> basic issues that I am not finding
>> satisfactory answers to (have googled).
>> The chip used is hugene11stv1. I also am using the hugene11stprobeset.db
> to try to do probeset –>
>> Symbol translation.
>> Essentially, I want to create a file with gene expression data, with
> genes * samples as my final matrix.
>> Code:
>> setwd(wDir);
>> Data<- ReadAffy();
>> eset<- rma(Data);
>> write.exprs(eset,file="geneExpData.txt", sep="\t", quote = F);
>>
>> When I analyze the file written, I see that the number of columns is as I
> expect(number samples) but there are
>> 33,297 genes.
>> Please help me understand a few fundamental aspects here:
>>
>> 1. I tried translating these Affy IDs to gene symbols to see if that would
> make my analysis easier.
>>      Here are some things I tried
>>
>>      Try 1:
>>      symbols<- getSYMBOL(as.character(expr.matrix[,1]),
> "hugene11stprobeset"); –>   Not quite
>> working. Only ~175 of the probeset IDs are getting translated.
>>      Try 2:
>>      symbs<- mget(featureNames(eset), hugene11stprobesetSYMBOL, ifnotfound
> =NA);
>>      symbs<- unlist(symbs)
>>      mat<- eset; # make a copy
>>      featureNames(mat)<- ifelse(!is.na(symbs), symbs, featureNames(mat))
>>
>>      Many NAs.
>>
>> Can you please help me understand what is happening here.
>>
>>   -- output of sessionInfo():
>>
>> R version 2.15.3 (2013-03-01)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=C                 LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] hugene11stv1cdf_2.3.0 affy_1.36.1           Biobase_2.18.0
>> [4] BiocGenerics_0.4.0
>>
>> loaded via a namespace (and not attached):
>> [1] affyio_1.26.0         BiocInstaller_1.8.3   preprocessCore_1.20.0
>> [4] tools_2.15.3          zlibbioc_1.4.0
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at ...
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> Hi,
> Thanks for the response. I tried using oligo +
> hugene11sttranscriptcluster.db to annotate my data. Here is what I've tried
> to do (after some googling).
>
> With the code I executed, 33,197 genes got collapsed to about ~21,000+
> genes.
>
> I had some questions:
> I found no vignette for how to do this 'summing up of probesets'. Is this
> approach taken right (code attached)?
> If anybody has a better approach/code snippet they can share - that would be
> great too.

I don't have a better approach, as 'better' is dependent on the goals at 
hand, and what assumptions you are willing to make.

I personally tend not to do what you have done, for the simple reason 
that the array is attempting to measure transcripts (which are different 
from genes), and two different probesets may well measure differentially 
spliced transcripts. If you naively collapse to genes, you may well be 
averaging two very different things together. Or maybe not. I just don't 
know, and am not willing to make the assumption that it is OK.

Anyway, if you really want to collapse to genes (and genes with gene 
symbols at that), there is already some infrastructure in place that 
will do something similar. The only difference is that the existing 
infrastructure will take duplicate Entrez Gene IDs and then select that 
one with the largest statistic (where you have to decide what that 
statistic is, but usually one would use a t-stat if you have two groups, 
or an F-stat if you have more than that).

library(genefilter)
eset <- nsFilter(eset)$eset
ind <- findLargest(eset, <some statistic here>, 
"hugene11sttranscriptcluster")
eset <- eset[ind,]

see ?nsFilter and ?findLargest for more information.

Best,

Jim



>
> Essentially, I need to take the data, run RMA on it - so that I can then do
> my differential expression analysis (amongst other approaches)
>
> Can you please guide me here. I have attached my code snippets below.
>
> Thanks much!
>
> Ranjani
> _________________________________________
>
> Code executed:
> --------------
>
> 1. Convert the probeset to gene Names
>
> gMap<- hugene11sttranscriptclusterSYMBOL
> mProbes<- mappedkeys(gMap)
> probNames<- rownames(expr.matrix)
> gMap1<- as.list(gMap[mProbes])
>
> 2. Get an expression matrix of mapped probes
>
> mappedProbes<- sort(which( (probNames %in% names(gMap1)) == T))
> any( mappedProbes != sort(mappedProbes)) # FALSE
> notMappedProbes<- which( (probNames %in% names(gMap1)) == F)
> mappedNames<- as.character(gMap1[ which(names(gMap1) %in%
> rownames(expr.matrix)[mappedProbes]) ])
> rownames(expr.matrix)[mappedProbes]<- mappedNames
> expr.matrix<- expr.matrix[-notMappedProbes,,drop=F]
>
> 3. Get unique genes. Here, if a gene occurs more than once, I take the mean
> expression value (is this right)?
>
> # get unique genes matrix by taking mean expression of gene names that
> appear more than one time
> gn.nms<- rownames(expr.matrix)<- toupper(rownames(expr.matrix))
> expr.matrix.unique<- matrix(0, nrow = length(unique(gn.nms)), ncol =
> ncol(expr.matrix))
> seen.gns<- character();
> cnt<- 0;
> for (i in 1:dim(expr.matrix)[1]){
> 	cr.gn<- gn.nms[i]
> 	if (cr.gn %in% seen.gns){
> 		next	
> 	} else{
> 		seen.gns<- c(seen.gns,cr.gn)
> 		ix<- which(gn.nms %in% cr.gn)
> 		if(length(ix)==0) {
> 			stop("length(ix) = 0 at i=",ix[1]," gene name is:",
> cr.gn)
> 		}
> 		if(length(ix)>1){
> 			cnt<- cnt + 1;
> 			print("Taking column means for gene");
> 			print(cr.gn);
> 			expr.matrix.unique[cnt,]<- colMeans(d[ix,])
> 		} else {
> 			cnt<- cnt + 1
> 			expr.matrix.unique[cnt,]<- d[ix,]
> 		}
> 	}
> }
> rownames(expr.matrix.unique)<- seen.gns
>
>
> 4. Now expr.matrix.uniqe should have what I want. Genes * Samples.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list