[BioC] Reading Affy CEL files

Mon Jun 3 18:00:20 CEST 2013

The oligo package has it's own implementation of RMA (?rma) that takes an
argument ("target") which allows you to change the level at which it
summarizes probesets on an exon array by either "probeset" (exon level) or
"core" (transcript level), as long as you have annotation to support it.

Ben

On 03/06/2013 15:52, "Ranjani R" <ranjanir at uw.edu> wrote:

>
>Ranjani R [guest] <guest at ...> writes:
>
>> 
>> 
>> I am a newbie to Affy. Thanks for your help.
>> 
>> I am processing CEL files through R (Affy package) and am having some
>basic issues that I am not finding
>> satisfactory answers to (have googled).
>> The chip used is hugene11stv1. I also am using the
>>hugene11stprobeset.db
>to try to do probeset â€“>
>> Symbol translation.
>> Essentially, I want to create a file with gene expression data, with
>genes * samples as my final matrix.
>> 
>> Code:
>> setwd(wDir);
>> Data <- ReadAffy();
>> eset <- rma(Data);
>> write.exprs(eset,file="geneExpData.txt", sep="\t", quote = F);
>> 
>> When I analyze the file written, I see that the number of columns is as
>>I 
>expect(number samples) but there are
>> 33,297 genes. 
>> Please help me understand a few fundamental aspects here:
>> 
>> 1. I tried translating these Affy IDs to gene symbols to see if that
>>would 
>make my analysis easier.
>>     Here are some things I tried
>> 
>>     Try 1:
>>     symbols <- getSYMBOL(as.character(expr.matrix[,1]),
>"hugene11stprobeset"); â€“>  Not quite
>> working. Only ~175 of the probeset IDs are getting translated.
>>     Try 2:
>>     symbs <- mget(featureNames(eset), hugene11stprobesetSYMBOL,
>>ifnotfound 
>=NA);
>>     symbs <- unlist(symbs)
>>     mat <- eset; # make a copy
>>     featureNames(mat) <- ifelse(!is.na(symbs), symbs, featureNames(mat))
>> 
>>     Many NAs.
>> 
>> Can you please help me understand what is happening here.
>> 
>>  -- output of sessionInfo():
>> 
>> R version 2.15.3 (2013-03-01)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> 
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] hugene11stv1cdf_2.3.0 affy_1.36.1           Biobase_2.18.0
>> [4] BiocGenerics_0.4.0
>> 
>> loaded via a namespace (and not attached):
>> [1] affyio_1.26.0         BiocInstaller_1.8.3   preprocessCore_1.20.0
>> [4] tools_2.15.3          zlibbioc_1.4.0
>> 
>> --
>> Sent via the guest posting facility at bioconductor.org.
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at ...
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> 
>
>
>Hi,
>Thanks for the response. I tried using oligo +
>hugene11sttranscriptcluster.db to annotate my data. Here is what I've
>tried 
>to do (after some googling).
>
>With the code I executed, 33,197 genes got collapsed to about ~21,000+
>genes.
>
>I had some questions:
>I found no vignette for how to do this 'summing up of probesets'. Is this
>approach taken right (code attached)?
>If anybody has a better approach/code snippet they can share - that would
>be 
>great too.
>
>Essentially, I need to take the data, run RMA on it - so that I can then
>do 
>my differential expression analysis (amongst other approaches)
>
>Can you please guide me here. I have attached my code snippets below.
>
>Thanks much!
>
>Ranjani
>_________________________________________
>
>Code executed:
>--------------
>
>1. Convert the probeset to gene Names
>
>gMap <- hugene11sttranscriptclusterSYMBOL
>mProbes <- mappedkeys(gMap)
>probNames <- rownames(expr.matrix)
>gMap1 <- as.list(gMap[mProbes])
>
>2. Get an expression matrix of mapped probes
>
>mappedProbes <- sort(which( (probNames %in% names(gMap1)) == T))
>any( mappedProbes != sort(mappedProbes)) # FALSE
>notMappedProbes <- which( (probNames %in% names(gMap1)) == F)
>mappedNames <- as.character(gMap1[ which(names(gMap1) %in%
>rownames(expr.matrix)[mappedProbes]) ])
>rownames(expr.matrix)[mappedProbes] <- mappedNames
>expr.matrix <- expr.matrix[-notMappedProbes,,drop=F]
>
>3. Get unique genes. Here, if a gene occurs more than once, I take the
>mean 
>expression value (is this right)?
>
># get unique genes matrix by taking mean expression of gene names that
>appear more than one time
>gn.nms <- rownames(expr.matrix) <- toupper(rownames(expr.matrix))
>expr.matrix.unique <- matrix(0, nrow = length(unique(gn.nms)), ncol =
>ncol(expr.matrix))
>seen.gns <- character();
>cnt <- 0;
>for (i in 1:dim(expr.matrix)[1]){
>	cr.gn <- gn.nms[i]
>	if (cr.gn %in% seen.gns){
>		next	
>	} else{
>		seen.gns <- c(seen.gns,cr.gn)
>		ix <- which(gn.nms %in% cr.gn)
>		if(length(ix)==0) {
>			stop("length(ix) = 0 at i=",ix[1]," gene name is:",
>cr.gn)
>		}
>		if(length(ix)>1){
>			cnt <- cnt + 1;
>			print("Taking column means for gene");
>			print(cr.gn);
>			expr.matrix.unique[cnt,] <- colMeans(d[ix,])
>		} else {
>			cnt <- cnt + 1
>			expr.matrix.unique[cnt,] <- d[ix,]
>		}
>	}
>}
>rownames(expr.matrix.unique) <- seen.gns
>
>
>4. Now expr.matrix.uniqe should have what I want. Genes * Samples.
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor