[BioC] normalization of a microarray like dataframe and removingmissing data by % missing

Tue Feb 13 22:59:41 CET 2007

This is a dirty example of subsetting via the NA's

# create some dummy NA data
mat <- matrix(rnorm(50), nrow=10, ncol=5)
lmat <- log(mat) 

# convert the output of is.na() to integers and coerce to the same structure as the data
naint <- matrix(as.integer(is.na(lmat)), nrow=10, ncol=5)

# sum the rows, thus counting the NA's (now 1's) and divide by the number of columns
# if this is less than equal to 0.3, we like the row
rows.want <- rowSums(naint) / ncol(naint) <= 0.3

# subset the data
lmat[rows.want,]

________________________________

From: bioconductor-bounces at stat.math.ethz.ch on behalf of ALAN SMITH
Sent: Tue 13/02/2007 9:10 PM
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] normalization of a microarray like dataframe and removingmissing data by % missing

Hello,
I have several questions about data normalization of a large matrix of
intensity data (21269,72) (non-microarray data).

summary(MYdata) #### example of data NOTE many NAs ########
       ID                          a                       b
          c
 Min.   :    1       Min.   :    2003      Min.   :    2008      Min.
 :    2001
 1st Qu.: 5318   1st Qu.:    4027    1st Qu.:    4155     1st Qu.:    4331
 Median :10635  Median :    7635   Median :    7570    Median :    8006
 Mean   :10635   Mean   :   57586  Mean   :   73246    Mean   :  101309
 3rd Qu.:15952   3rd Qu.:   17191   3rd Qu.:   18076    3rd Qu.:   18843
 Max.   :21269    Max.   :20335320 Max.   :30073282  Max.   :27649912
                         NA's   :   18323     NA's   :    18467
NA's   :   18471
##########################################################

What would be the best way to normalize or preprocess this type of
data (80%+ missing)? A log2 transformation creates nice "similar"
shaped distributions with different medians. Currently I do this to
normalize
divide column data by (column median/min column meidan)
divide row data by (row median/min row meidan)
Repeat 1 more time
*the method I am using turns the distribution into a spike shape.*

Is the above method acceptable for for future statistical
applications?  Are there better normalization methods I can use?

NA question
I would like to remove all of the rows with less than 30% missing
values before continuing with normalization, but I cannot figure out
how.  Is there a way to remove all rows that have say more than 30%
missing data?  If i could just count the number of NAs in a row and
divide it by the number of columns i would be in good shape, but I
cant figure out how to do this.  NA.OMIT is too harsh and remove most
of the rows.

Thanks much,
Alan

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor