[BioC] affy 2.0 (fwd)

Fri, 23 Aug 2002 00:36:45 -0400 (EDT)

Forwarded on request of Rafael ....

Given the heavy usage of affy by members of this list, it might be of
interest.

---------- Forwarded message ----------
Date: Fri, 23 Aug 2002 00:07:09 -0400 (EDT)
From: Rafael A. Irizarry <ririzarr@jhsph.edu>
Reply-To: rafa@jhu.edu
To: biocore@stat.math.ethz.ch
Subject: affy 2.0

hi! for the next version of affy i would like to have just one main class. 
because the pkg is a merge of two, we have redundancy, there are two 
approaches for storing probe level data. this is extra work because we 
have to make sure methods work for both.
regardless of the 
approach we decide on, we will have the same methods so the user should 
not see the difference. i need help deciding which approach is more 
convenient.

ill use chips instead of arrays so that we dont get confused with what R 
calls arrays.

approach 1: for each chip we keep a matrix (Cel) where the row 10 ,column 
12 entry represents the probe intensity read from the physical row 10, 
column 12 position on the chip. we then keep three dimensional arrays to 
represent multiple chip experiments. to know what position goes with what 
gene a separate class (Cdf) is defined that contains a matrix with the 
gene names for each entry in the probe intensity matrix. so the row 10, 
column 12 entry in the Cdf matrix gives the  genename for the probe in 
the row 10, column 12 entry in the Cel matrix.the Cdf class contains the 
necesary info to know whats PM and whats MM

approach 2: keeps the pm data in a matrix with rows representing probes 
and columns representing chips. similarly for mm. to know what row goes 
with what gene we keep a vector with the genenames. to know what gene is 
in column, say, 10 we simply look to the 10th entry in the name vector. 
similarly we have vectors with the probe numbers, x positions, and y 
positions,

an advantage of approach 1 is that we dont need to keep the x,y (physical 
position on the chip) 
information. a disadvantage is that subsetting by genes and creating 
"fake" instances can be confusing because we need to control 2 classes 
(cel,cdf).

an advantage of approach 2 is that the pms and mms are readily available 
and subsetting by genes is easy. as a consequence creating "fake" 
instances is easy. a disadvantage is that we need extra slots to keep the  
physical position information and that the we are a bit farther away from 
the raw data.

at first i was leaning toward approach 1 because its closer to the 
raw data... now im a bit worried 
about difficulties with subsetting by genes, and how it affects "genes 
for hire". 

any opinions? suggestions?

rafael

_______________________________________________
Biocore mailing list
Biocore@stat.math.ethz.ch
http://www.stat.math.ethz.ch/mailman/listinfo/biocore