[BioC] Subsetting expression sets for mass spec data - second ask

McGee, Monnie mmcgee at mail.smu.edu
Tue Nov 27 04:55:18 CET 2012


Dear BioC Users,

I would like to be able to subset a mass spectrometry data set by the biomarkers that were chosen as 
important biomarkers. I followed the code in the PROcess vignette to obtain the biomarkers as follows: 

testNorm is a normalized matrix of m/z values from 253 samples
> bmkfile <- paste(getwd(), "testbiomarker.csv", sep = "/")
> testBio = pk2bmkr(peakfile, testNorm, bmkfile)
> mzs = as.numeric(rownames(testNorm))
> bks = getMzs(testBio) ## Should be "important" biomarkers for the Mass Spec data
> bks
 [1]  308.497  350.487  378.092  396.084  676.031 3994.780 4597.540 7046.840 7965.760 8128.160 8351.810 9184.330

I created the expression set in the following way
> treat = ifelse(colnames(testNorm) < 300,"Control","Cancer")
> treatdf = as.data.frame(treat)
> rownames(treatdf)=colnames(testNorm)
> pdt = new("AnnotatedDataFrame",treatdf)
> mzdf = as.data.frame(rownames(testNorm))
> rownames(mzdf)=rownames(testNorm)
> mzfeat = new("AnnotatedDataFrame",mzdf)
> testES = new("ExpressionSet",exprs=testNorm,phenoData=pdt,featureData=mzfeat)
> varLabels(testES)
[1] "treat"
> table(pData(testES))
 Cancer Control 
    162      91 
> featureData(testES)
An object of class "AnnotatedDataFrame"
  featureNames: 300.033 300.356 ... 19995.5 (13297 total)
  varLabels: V1
  varMetadata: labelDescription

Figuring out how to obtain the eSet took at least an hour. By the way, the purpose of the eSet is to obtain an object 
that is an input into an MLearn function for classification purposes, such as: 
dldFS = MLearn(treat ~.,testES2,dldaI,)), where testES2 is the eset containing only the information for the 
important biomarkers. Clearly, I can't run MLearn (especially with CV) with all 13K features in testES. Therefore, 
I would like to run MLearn using the biomarkers to determine whether these biomarkers actually discriminate between 
 the cancer and control samples. And, yes, this is the Petricoin ovarian cancer data set, for those of you who know 
your Mass Spec data.

Now I have an eSet with the rows labeled by the mass to charge ratios and the columns labeled by the samples
I would like to obtain a subset of testES using the 10 biomarkers (bks) found above. Ideally, the following 
would work: 
>testES2 =  testES[featureData(testES) == bks,]

But I get the following error:
Error in testES[featureData(testES) == bks, ] : 
  error in evaluating the argument 'i' in selecting a method for function '[': Error in featureData(testES) == bks : 
  comparison (1) is possible only for atomic and list types

I tried making bks a character vector, but to no avail.  I also tried the following:
> testES2 =  testES[featureData(testES) %in% bks,]  ##(where bks is a character vector or not)
Error in testES[featureData(testES) %in% bks, ] : 
  error in evaluating the argument 'i' in selecting a method for function '[': Error in match(x, table, nomatch = 0L) : 
  'match' requires vector arguments

Part of the problem is (probably) that I am not using the correct syntax for subsetting an eSet on the basis of featureData. Another part is that the 
biomarkers do not have exact matches in featureData(testES) because they were obtained using a peak finding 
algorithm that is supposed to align peaks across all 253 samples. So, how do I obtain the m/z ratios for the important features (the biomarkers) from this eSet? 
Is there another (saner) way to use the biomarkers in a classification algorithm in order to determine the misclassification rate with this particular 
set of biomarkers?

And, finally, the session Info:
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] tools     grid      splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] PROcess_1.32.0        Icens_1.28.0          survival_2.36-14      flowStats_1.14.0      flowWorkspace_1.2.0  
 [6] hexbin_1.26.0         IDPmisc_1.1.16        flowViz_1.20.0        XML_3.95-0            RBGL_1.32.1          
[11] graph_1.34.0          Cairo_1.5-2           cluster_1.14.2        mvoutlier_1.9.8       sgeostat_1.0-24      
[16] robCompositions_1.6.0 car_2.0-15            nnet_7.3-4            compositions_1.20-1   energy_1.4-0         
[21] MASS_7.3-21           boot_1.3-5            tensorA_0.36          rgl_0.92.892          fda_2.3.2            
[26] RCurl_1.95-0.1.2      bitops_1.0-4.1        Matrix_1.0-9          lattice_0.20-10       zoo_1.7-9            
[31] flowCore_1.22.3       rrcov_1.3-02          pcaPP_1.9-48          mvtnorm_0.9-9992      robustbase_0.9-4     
[36] Biobase_2.16.0        BiocGenerics_0.2.0   

loaded via a namespace (and not attached):
[1] feature_1.2.8       KernSmooth_2.23-8   ks_1.8.10           latticeExtra_0.6-24 RColorBrewer_1.0-5 
[6] stats4_2.15.1      


Thank you!
Monnie

Monnie McGee, PhD
Associate Professor
Statistical Science
Southern Methodist University
Office: 214-768-2462
Fax: 214-768-4035
Website: http://faculty.smu.edu/mmcgee


More information about the Bioconductor mailing list