[BioC] Reading Affy CEL files

Ranjani R ranjanir at uw.edu
Mon Jun 3 16:52:03 CEST 2013


Ranjani R [guest] <guest at ...> writes:

> 
> 
> I am a newbie to Affy. Thanks for your help.
> 
> I am processing CEL files through R (Affy package) and am having some 
basic issues that I am not finding
> satisfactory answers to (have googled).
> The chip used is hugene11stv1. I also am using the hugene11stprobeset.db 
to try to do probeset –>
> Symbol translation.
> Essentially, I want to create a file with gene expression data, with  
genes * samples as my final matrix.  
> 
> Code:
> setwd(wDir);
> Data <- ReadAffy();
> eset <- rma(Data);
> write.exprs(eset,file="geneExpData.txt", sep="\t", quote = F);
> 
> When I analyze the file written, I see that the number of columns is as I 
expect(number samples) but there are
> 33,297 genes. 
> Please help me understand a few fundamental aspects here:
> 
> 1. I tried translating these Affy IDs to gene symbols to see if that would 
make my analysis easier.
>     Here are some things I tried
> 
>     Try 1:
>     symbols <- getSYMBOL(as.character(expr.matrix[,1]), 
"hugene11stprobeset"); –>  Not quite
> working. Only ~175 of the probeset IDs are getting translated.
>     Try 2:
>     symbs <- mget(featureNames(eset), hugene11stprobesetSYMBOL, ifnotfound 
=NA);
>     symbs <- unlist(symbs)
>     mat <- eset; # make a copy
>     featureNames(mat) <- ifelse(!is.na(symbs), symbs, featureNames(mat))
> 
>     Many NAs.
> 
> Can you please help me understand what is happening here. 
> 
>  -- output of sessionInfo(): 
> 
> R version 2.15.3 (2013-03-01)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] hugene11stv1cdf_2.3.0 affy_1.36.1           Biobase_2.18.0
> [4] BiocGenerics_0.4.0
> 
> loaded via a namespace (and not attached):
> [1] affyio_1.26.0         BiocInstaller_1.8.3   preprocessCore_1.20.0
> [4] tools_2.15.3          zlibbioc_1.4.0
> 
> --
> Sent via the guest posting facility at bioconductor.org.
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at ...
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 


Hi,
Thanks for the response. I tried using oligo + 
hugene11sttranscriptcluster.db to annotate my data. Here is what I've tried 
to do (after some googling).

With the code I executed, 33,197 genes got collapsed to about ~21,000+ 
genes.

I had some questions:
I found no vignette for how to do this 'summing up of probesets'. Is this 
approach taken right (code attached)? 
If anybody has a better approach/code snippet they can share - that would be 
great too.

Essentially, I need to take the data, run RMA on it - so that I can then do 
my differential expression analysis (amongst other approaches)

Can you please guide me here. I have attached my code snippets below.

Thanks much!

Ranjani
_________________________________________

Code executed:
--------------

1. Convert the probeset to gene Names

gMap <- hugene11sttranscriptclusterSYMBOL
mProbes <- mappedkeys(gMap)
probNames <- rownames(expr.matrix)
gMap1 <- as.list(gMap[mProbes]) 

2. Get an expression matrix of mapped probes

mappedProbes <- sort(which( (probNames %in% names(gMap1)) == T))
any( mappedProbes != sort(mappedProbes)) # FALSE
notMappedProbes <- which( (probNames %in% names(gMap1)) == F)
mappedNames <- as.character(gMap1[ which(names(gMap1) %in% 
rownames(expr.matrix)[mappedProbes]) ])
rownames(expr.matrix)[mappedProbes] <- mappedNames
expr.matrix <- expr.matrix[-notMappedProbes,,drop=F] 

3. Get unique genes. Here, if a gene occurs more than once, I take the mean 
expression value (is this right)?

# get unique genes matrix by taking mean expression of gene names that 
appear more than one time
gn.nms <- rownames(expr.matrix) <- toupper(rownames(expr.matrix))
expr.matrix.unique <- matrix(0, nrow = length(unique(gn.nms)), ncol = 
ncol(expr.matrix))
seen.gns <- character();
cnt <- 0;
for (i in 1:dim(expr.matrix)[1]){
	cr.gn <- gn.nms[i]
	if (cr.gn %in% seen.gns){
		next	
	} else{
		seen.gns <- c(seen.gns,cr.gn)
		ix <- which(gn.nms %in% cr.gn)
		if(length(ix)==0) {
			stop("length(ix) = 0 at i=",ix[1]," gene name is:", 
cr.gn)
		}
		if(length(ix)>1){
			cnt <- cnt + 1;
			print("Taking column means for gene");
			print(cr.gn);
			expr.matrix.unique[cnt,] <- colMeans(d[ix,])
		} else {
			cnt <- cnt + 1
			expr.matrix.unique[cnt,] <- d[ix,]
		}
	}
}
rownames(expr.matrix.unique) <- seen.gns


4. Now expr.matrix.uniqe should have what I want. Genes * Samples.



More information about the Bioconductor mailing list