[BioC] normalization of a microarray like dataframe and removing missing data by % missing

ALAN SMITH alansmith2 at gmail.com
Tue Feb 13 22:10:50 CET 2007


Hello,
I have several questions about data normalization of a large matrix of
intensity data (21269,72) (non-microarray data).

summary(MYdata) #### example of data NOTE many NAs ########
       ID                          a                       b
          c
 Min.   :    1       Min.   :    2003      Min.   :    2008      Min.
 :    2001
 1st Qu.: 5318   1st Qu.:    4027    1st Qu.:    4155     1st Qu.:    4331
 Median :10635  Median :    7635   Median :    7570    Median :    8006
 Mean   :10635   Mean   :   57586  Mean   :   73246    Mean   :  101309
 3rd Qu.:15952   3rd Qu.:   17191   3rd Qu.:   18076    3rd Qu.:   18843
 Max.   :21269    Max.   :20335320 Max.   :30073282  Max.   :27649912
                         NA's   :   18323     NA's   :    18467
NA's   :   18471
##########################################################

What would be the best way to normalize or preprocess this type of
data (80%+ missing)? A log2 transformation creates nice "similar"
shaped distributions with different medians. Currently I do this to
normalize
divide column data by (column median/min column meidan)
divide row data by (row median/min row meidan)
Repeat 1 more time
*the method I am using turns the distribution into a spike shape.*

Is the above method acceptable for for future statistical
applications?  Are there better normalization methods I can use?

NA question
I would like to remove all of the rows with less than 30% missing
values before continuing with normalization, but I cannot figure out
how.  Is there a way to remove all rows that have say more than 30%
missing data?  If i could just count the number of NAs in a row and
divide it by the number of columns i would be in good shape, but I
cant figure out how to do this.  NA.OMIT is too harsh and remove most
of the rows.


Thanks much,
Alan



More information about the Bioconductor mailing list