[BioC] Combining expressionSets from GEO

Wed Jan 30 20:57:16 CET 2008

On Jan 30, 2008 2:44 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Francois Pepin <fpepin at cs.mcgill.ca> writes:
>
> > Hi Martin,
> >
> > Thanks for the help. I managed to fix the issue by resetting all of the
> > levels on both side (having everything as characters should work too):
> >
> > for (i in 1:length(pData(phenoData(tmp[[1]]))))
> >   levels(pData(phenoData(tmp[[1]]))[,i])<-levels(pData(phenoData(tmp
> > [[2]]))[,i]) <- c(unique(as.character(pData(phenoData(tmp
> > [[1]]))[,i])),unique(as.character(pData(phenoData(tmp[[2]]))[,i])))
> >
> > The next question would be to see where it would best be taken care of.
> > I really don't see why this should not be taken care of behind the
> > scene.
> >
> > The two main options I see would be that getGEO() returns characters of
> > phenoData instead of factors or having combine() know to deal with
> > factors properly for expressionSet.

I have thought about doing just this.  However, downstream analyses
based on ExpressionSets will probably rely on having factors for
grouping, so I haven't done so.  This could also be done at the level
of the ExpressionSet by enforcing that all factors are converted to
character on creation of new ExpressionSets.  Again, I don't think
this is an optimal solution.

> combine does know how to deal with factors properly -- the levels are
> different, so the columns (usually) can't be combined. But I
> appreciate the sentiment, and the issue has come up on the mailing
> list three times since 2.1, so is a common occurrence. I've tried some
> more at making the documentation better, and will work on a better set
> of warnings for the next release of Bioconductor.

It seems like a compromise solution might be to generate warnings on
differing factor levels, but to go ahead and rectify the differences
within combine() by creating a new factor based on the combined
character representations of the offending columns.  Are there any
intrinsic problems with doing that?  This would maintain factor
columns as factors, but allow combine to "do the right thing" with
regard to those columns.  And, in case "do the right thing" isn't the
"right thing", warnings will be generated, alerting the user of the
issue.  The alternative is to ask the user to do all this manually.

> > If the former is chosen, I think it would probably be worth adjusting
> > the documentation about combine to mention this issue. As an unrelated
> > note, the ExpressionSet documentation refers to the eSet's. Since eSet
> > is going away at some point, that might be worth changing.
>
> Actually, 'eSet' is a class that 'ExpressionSet' extends; 'eSet' is
> not going to away, and many of the data slots and methods on
> ExpressionSet are inherited from eSet so it's appropriate to
> reference the eSet documentation for these. The 'exprSet' class is no
> longer supported.
>
> Thanks for your input,
>
> Martin
>
>
> > Francois
> >
> > On Wed, 2008-01-30 at 10:54 -0800, Martin Morgan wrote:
> >> So part of the bug fix was an attempt to make the error message more
> >> informative, and it's not really clear that I've done that!
> >>
> >> The traceback makes it's clear that the problem is with the pData (and
> >> not, for instance varMetadata or featureData) of the two arrays.
> >>
> >> Some hints are provided by the warnings, by the ?combine help page,
> >>
> >>      'combine(data.frame, data.frame)' Combines two 'data.frame'
> >>           objects so that the resulting 'data.frame' contains all rows
> >>           and columns of the original objects. Rows and columns in the
> >>           returned value are unique, that is, a row or column
> >>           represented in both arguments is represented only once in the
> >>           result. To perform this operation, 'combine' makes sure that
> >>           data in shared rows and columns is identical in the two
> >>           data.frames. Data diffrences in shared rows and columns cause
> >>           an error. 'combine' issues a warning when a column is a
> >>           'factor' and the levels of the factor in the two
> >>           'data.frame's are different; the returned value may be
> >>           recoded.
> >>
> >> and by the results of
> >>
> >> > example(combine)
> >>
> >> particularly the last lines which are trying to illustrate your
> >> problem:
> >>
> >> combin>   # y is converted to 'factor' with different levels
> >> combin>   x <- data.frame(x=1:5,y=letters[1:5], row.names=letters[1:5])
> >>
> >> combin>   y <- data.frame(z=3:7,y=letters[3:7], row.names=letters[3:7])
> >>
> >> combin>   try(combine(x,y))
> >> Error in combine(x, y) : data.frames contain conflicting data:
> >>      non-conforming colname(s): y
> >> In addition: Warning messages:
> >> 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) : 5 string mismatches
> >> 2: In switch(class(x[[nm]])[[1]], factor = { :
> >>   data frame column 'y' levels not all.equal
> >>
> >> The data.frame column 'y' is a 'factor' (rather than character
> >> vectors) and combine doesn't know how to resolve a column that has 'c'
> >> encoded as level 3 of a factor with one that has 'c' encoded as level
> >> 1.
> >>
> >> One solution is to enusre that columns that are really character
> >> vectors are stored as such
> >>
> >> > x <- data.frame(x=1:5,y=I(letters[1:5]), row.names=letters[1:5])
> >> > y <- data.frame(z=3:7,y=I(letters[3:7]), row.names=letters[3:7])
> >> > combine(x,y)
> >>    x y  z
> >> a  1 a NA
> >> b  2 b NA
> >> c  3 c  3
> >> d  4 d  4
> >> e  5 e  5
> >> f NA f  6
> >> g NA g  7
> >>
> >> or that factors have the same levels
> >>
> >> > y1 <- factor(letters[1:5], levels=letters[1:7])
> >> > y2 <- factor(letters[3:7], levels=letters[1:7])
> >> > x <- data.frame(x=1:5, y=y1, row.names=letters[1:5])
> >> > y <- data.frame(z=3:7, y=y2, row.names=letters[3:7])
> >> > combine(x,y)
> >>    x y  z
> >> a  1 a NA
> >> b  2 b NA
> >> c  3 c  3
> >> d  4 d  4
> >> e  5 e  5
> >> f NA f  6
> >> g NA g  7
> >>
> >> Martin
> >>
> >> Francois Pepin <fpepin at cs.mcgill.ca> writes:
> >>
> >> > Hi Martin,
> >> >
> >> > I think it is related, as I now have a different error message along
> >> > with a series of warnings. 255 and 98 refer to the number of samples in
> >> > each ExpressionSet. 66 and 21 refer to the number of unique elements in
> >> > source_name_ch1 in the phenodata.
> >> >
> >> >> tmp2<-combine(tmp[[1]],tmp[[2]])
> >> > Error in .local(x, y, ...) :
> >> >   data.frames contain conflicting data:
> >> >         non-conforming colname(s): title, geo_accession,
> >> > source_name_ch1, description, supplementary_file
> >> > In addition: Warning messages:
> >> > 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >> >   Lengths (255, 98) differ (string compare on first 98)98 string
> >> > mismatches
> >> > 2: In switch(class(x[[nm]])[[1]], factor = { :
> >> >   data frame column 'title' levels not all.equal
> >> > 3: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >> >   Lengths (255, 98) differ (string compare on first 98)98 string
> >> > mismatches
> >> > 4: In switch(class(x[[nm]])[[1]], factor = { :
> >> >   data frame column 'geo_accession' levels not all.equal
> >> > 5: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >> >   Lengths (66, 21) differ (string compare on first 21)21 string
> >> > mismatches
> >> > 6: In switch(class(x[[nm]])[[1]], factor = { :
> >> >   data frame column 'source_name_ch1' levels not all.equal
> >> > 7: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >> >   Lengths (255, 98) differ (string compare on first 98)98 string
> >> > mismatches
> >> > 8: In switch(class(x[[nm]])[[1]], factor = { :
> >> >   data frame column 'description' levels not all.equal
> >> > 9: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >> >   Lengths (255, 98) differ (string compare on first 98)98 string
> >> > mismatches
> >> > 10: In switch(class(x[[nm]])[[1]], factor = { :
> >> >   data frame column 'supplementary_file' levels not all.equal
> >> >
> >> >> traceback()
> >> > 9: stop("data.frames contain conflicting data:", "\n\tnon-conforming
> >> > colname(s): ",
> >> >        paste(sharedCols[!ok], collapse = ", "))
> >> > 8: .local(x, y, ...)
> >> > 7: combine(pDataX, pDataY)
> >> > 6: combine(pDataX, pDataY)
> >> > 5: .local(x, y, ...)
> >> > 4: combine(phenoData(x), phenoData(y))
> >> > 3: combine(phenoData(x), phenoData(y))
> >> > 2: combine(tmp[[1]], tmp[[2]])
> >> > 1: combine(tmp[[1]], tmp[[2]])
> >> >
> >> >> sessionInfo()
> >> > R version 2.6.0 (2007-10-03)
> >> > x86_64-unknown-linux-gnu
> >> >
> >> > locale:
> >> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> >> >
> >> > attached base packages:
> >> > [1] tools     stats     graphics  grDevices utils     datasets  methods
> >> > [8] base
> >> >
> >> > other attached packages:
> >> > [1] GEOquery_2.2.0 RCurl_0.8-1    Biobase_1.16.2
> >> >
> >> > loaded via a namespace (and not attached):
> >> > [1] rcompgen_0.1-15
> >> >
> >> > Francois
> >> >
> >> > On Wed, 2008-01-30 at 10:03 -0800, Martin Morgan wrote:
> >> >> Hi Francois -- this might be related to a bug in Biobase that has been
> >> >> fixed. Can you try to update your Biobase, either biocLite('Biobase')
> >> >> or following the directions at http://bioconductor.org/download ? If
> >> >> not, can you provide the output of traceback() after the error occurs?
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Martin
> >> >>
> >> >> Francois Pepin <fpepin at cs.mcgill.ca> writes:
> >> >>
> >> >> > Hi everyone,
> >> >> >
> >> >> > I'm getting an error message when trying to combine two parts of a GSE
> >> >> > object:
> >> >> >
> >> >> >>tmp<-getGEO('GSE3526',GSEMatrix=T)
> >> >> >> tmp2<-combine(tmp[[1]],tmp[[2]])
> >> >> > Error in alleq(levels(x[[nm]]), levels(y[[nm]])) && alleq(x
> >> >> > [sharedRows,  :
> >> >> >   invalid 'x' type in 'x && y'
> >> >> >
> >> >> > Checking to make sure that I should be able to combine them (from the
> >> >> > eSet documentation):
> >> >> >
> >> >> > #eSets must have identical numbers of 'featureNames'
> >> >> >> all(featureNames(tmp[[2]])==featureNames(tmp[[2]]))
> >> >> > [1] TRUE
> >> >> >
> >> >> > #must have distinct 'sampleNames'
> >> >> >> any(sampleNames(tmp[[1]])%in%sampleNames(tmp[[2]]))
> >> >> > [1] FALSE
> >> >> >
> >> >> > #and must have identical 'annotation'.
> >> >> >> annotation(tmp[[2]])==annotation(tmp[[2]])
> >> >> > [1] TRUE
> >> >> >
> >> >> >> sessionInfo()
> >> >> > R version 2.6.0 (2007-10-03)
> >> >> > x86_64-unknown-linux-gnu
> >> >> >
> >> >> > locale:
> >> >> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> >> >> >
> >> >> > attached base packages:
> >> >> > [1] tools     stats     graphics  grDevices utils     datasets  methods
> >> >> > [8] base
> >> >> >
> >> >> > other attached packages:
> >> >> > [1] GEOquery_2.2.0 RCurl_0.8-1    Biobase_1.16.0
> >> >> >
> >> >> > loaded via a namespace (and not attached):
> >> >> > [1] rcompgen_0.1-15
> >> >> >
> >> >> > Does anyone know why that is happening and if there would be any way
> >> >> > around it?
> >> >> >
> >> >> > Francois
> >> >> >
> >> >> > _______________________________________________
> >> >> > Bioconductor mailing list
> >> >> > Bioconductor at stat.math.ethz.ch
> >> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> >> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >> >>
> >> >
> >>
> >
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
> _______________________________________________
>
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>