[BioC] Biobase ExpressionSet: metadata on assayData

Martin Morgan mtmorgan at fhcrc.org
Fri Dec 14 17:41:04 CET 2007


Hi Eric --

* ExpressionSet

ExpressionSet itself is meant for gene expression data.

The 'assay' data is essentially a matrix of 'features' (genes /
probes) x 'phenotypes' (samples). The assay data is annotated on both
features and phenotypes.

The phenotypes are annotated with the AnnotatedDataFrame in the
phenoData slot. This would typically include all the information about
experimental design relevant to the samples.

The features _can_ be annotated with the AnnotatedDataFrame in the
featureData slot.  However, for expression data, features and their
annotations are usually common across chips. For this reason the
annotations are usually stored independently of the assay data, in the
so-called 'annotation' packages named after the chip and referenced by
the 'annotation' slot in the expression set.

Finally, information about the overall experiment summarized in the
assay data can stored in the container in the experimentData slot.

* Actually used?

A typical single-channel microarray work flow starts with ReadAffy
followed by pre-processing. The output is an ExpressionSet. The main
downstream analytic pathways either expect or work with
ExpressionSet. Many users probably rely implicitly or explicitly on
ExpressionSet, and there are dozens of data sets from actual analyses
on the Bioconductor web site. So yes, they're actually used.

It is not hard to use a rudimentary expression set starting from
scratch,

> library(Biobase)
> m <- matrix(runif(100000), ncol=10)
> e <- new("ExpressionSet", exprs=m)

Of course there is no metadata, but that can be added either at
construction or subsequently (as described in one of the Biobase
vignettes, An Introduction to Biobase and Expression Sets).

* Data other than microarrays

ExpressionSet is meant for summarized gene expression
data. ExpressionSet is derived from an underlying class eSet. Projects
interested in other types of data have used eSet (and
AnnotatedDataFrame) as a basis for packaging other data types (e.g.,
the flowCore projects looking at flow cytometry). This is great,
because the adoption of common data structures can greatly facilitate
interoperability.

Hope that helps,

Martin

"Eric Lecoutre" <ericlecoutre at gmail.com> writes:

> Hi,
>
> I am new to Bioconductor and am studying both biobase and biostatistics for
> a small project.
> My client wants to know wether he should use ExpressionSet for part of its
> assay R&D process.
> For a experiment, I understand there is a lot of common metadata like
> compound, dose level, replicate,...
> I have seen phylo and feature dataframe class AnnotatedDataFrame and already
> said to the client he could use that.
> Fact is that those metadata (if I have weell understand) also could be used
> for gene expression (so addayData).
> What is the standard BioConductor way to handle those metadata? : there is
> no metadata argument associated to assayData.
> Should I use an AnnotatedDataFrame for feature repeting gene expression with
> such metadata?
>
> btw, are there people here who really use ExpressionSet in their processes?
>
> Thanks for any insight.
>
>
> Eric
>
>
> PS: as I looked at AnnotatedDataFrame class, I missed a helper function to
> exploit metadata.
> Here is such a little function and a sample use, where one requests for
> variables in AnnotatedDataFrame with conditions on metadata (arbitrary ones,
> handled by dots ...)
>
>
>
>
> selectVariables <- function(x,logic=all,drop=FALSE,...){
>   listCriteria <- list(...)
>   metadata <- varMetadata(x)
>   retainedCriteria <- list()
>   sapply(names(listCriteria), function(critname) {
>     if(!critname %in% colnames(metadata)){
>       cat("\n Dropped criteria:",critname, "not in AnnotatedDataFrame\n")
>     }else{
>       if(is.null(listCriteria[critname])) listCriteria[[critname]]<-
> unique(metadata[,critname])
>        retainedCriteria[[critname]] <<-  metadata[,critname] %in%
>         listCriteria[critname]
>     }
>     })
>    criteriaValues <- do.call("cbind",retainedCriteria)
>    selectedColumns <<- apply(criteriaValues,1,logic)
>    cat('\n',sum(selectedColumns),' columns selected.\n',sep='')
>    return(selectedColumns)
> }
>
>
>
>
> library(Biobase)
> # prepating metadata
> treatment=c("D","192","233","192","233")
> control=c(1,0,0,0,0)
> dose=c(NA,30,10,10,0.3)
> replicate=rep(1,5)
> metadata <- data.frame
> (cbind(treatment=treatment,control=control,dose=dose,replicate=replicate,
>   labelDescription=paste("treatment: ",treatment, ifelse(control==1, "
> [control]","")," dose:",dose,"(",replicate,")",sep='')))
>
>   data1=data.frame(cbind(v1=1:2,v2=2:3,v3=3:4,v4=4:5,v5=5:6))
> anData1 = new("AnnotatedDataFrame",data=data1,varMetadata=metadata)
>
>
> # use little function to create an subset data.frame
>
> anData1[,selectVariables(anData1,dose=10, dummy=0)]
>
>
>
>
>
> -- 
> Eric Lecoutre
> Consultant - Business & Decision
> Business Intelligence & Customer Intelligence
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioconductor mailing list