[BioC] Combining expressionSets from GEO

Martin Morgan mtmorgan at fhcrc.org
Wed Jan 30 20:44:25 CET 2008


Francois Pepin <fpepin at cs.mcgill.ca> writes:

> Hi Martin,
>
> Thanks for the help. I managed to fix the issue by resetting all of the
> levels on both side (having everything as characters should work too):
>
> for (i in 1:length(pData(phenoData(tmp[[1]]))))
>   levels(pData(phenoData(tmp[[1]]))[,i])<-levels(pData(phenoData(tmp
> [[2]]))[,i]) <- c(unique(as.character(pData(phenoData(tmp
> [[1]]))[,i])),unique(as.character(pData(phenoData(tmp[[2]]))[,i])))
>
> The next question would be to see where it would best be taken care of.
> I really don't see why this should not be taken care of behind the
> scene.
>
> The two main options I see would be that getGEO() returns characters of
> phenoData instead of factors or having combine() know to deal with
> factors properly for expressionSet.

combine does know how to deal with factors properly -- the levels are
different, so the columns (usually) can't be combined. But I
appreciate the sentiment, and the issue has come up on the mailing
list three times since 2.1, so is a common occurrence. I've tried some
more at making the documentation better, and will work on a better set
of warnings for the next release of Bioconductor.

> If the former is chosen, I think it would probably be worth adjusting
> the documentation about combine to mention this issue. As an unrelated
> note, the ExpressionSet documentation refers to the eSet's. Since eSet
> is going away at some point, that might be worth changing.

Actually, 'eSet' is a class that 'ExpressionSet' extends; 'eSet' is
not going to away, and many of the data slots and methods on
ExpressionSet are inherited from eSet so it's appropriate to
reference the eSet documentation for these. The 'exprSet' class is no
longer supported.

Thanks for your input,

Martin

> Francois
>
> On Wed, 2008-01-30 at 10:54 -0800, Martin Morgan wrote:
>> So part of the bug fix was an attempt to make the error message more
>> informative, and it's not really clear that I've done that!
>> 
>> The traceback makes it's clear that the problem is with the pData (and
>> not, for instance varMetadata or featureData) of the two arrays.
>> 
>> Some hints are provided by the warnings, by the ?combine help page, 
>> 
>>      'combine(data.frame, data.frame)' Combines two 'data.frame'
>>           objects so that the resulting 'data.frame' contains all rows
>>           and columns of the original objects. Rows and columns in the
>>           returned value are unique, that is, a row or column
>>           represented in both arguments is represented only once in the
>>           result. To perform this operation, 'combine' makes sure that
>>           data in shared rows and columns is identical in the two
>>           data.frames. Data diffrences in shared rows and columns cause
>>           an error. 'combine' issues a warning when a column is a
>>           'factor' and the levels of the factor in the two
>>           'data.frame's are different; the returned value may be
>>           recoded.
>> 
>> and by the results of
>> 
>> > example(combine)
>> 
>> particularly the last lines which are trying to illustrate your
>> problem: 
>> 
>> combin>   # y is converted to 'factor' with different levels
>> combin>   x <- data.frame(x=1:5,y=letters[1:5], row.names=letters[1:5])
>> 
>> combin>   y <- data.frame(z=3:7,y=letters[3:7], row.names=letters[3:7])
>> 
>> combin>   try(combine(x,y))
>> Error in combine(x, y) : data.frames contain conflicting data:
>> 	non-conforming colname(s): y
>> In addition: Warning messages:
>> 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) : 5 string mismatches
>> 2: In switch(class(x[[nm]])[[1]], factor = { :
>>   data frame column 'y' levels not all.equal
>> 
>> The data.frame column 'y' is a 'factor' (rather than character
>> vectors) and combine doesn't know how to resolve a column that has 'c'
>> encoded as level 3 of a factor with one that has 'c' encoded as level
>> 1.
>> 
>> One solution is to enusre that columns that are really character
>> vectors are stored as such
>> 
>> > x <- data.frame(x=1:5,y=I(letters[1:5]), row.names=letters[1:5])
>> > y <- data.frame(z=3:7,y=I(letters[3:7]), row.names=letters[3:7])
>> > combine(x,y)
>>    x y  z
>> a  1 a NA
>> b  2 b NA
>> c  3 c  3
>> d  4 d  4
>> e  5 e  5
>> f NA f  6
>> g NA g  7
>> 
>> or that factors have the same levels
>> 
>> > y1 <- factor(letters[1:5], levels=letters[1:7])
>> > y2 <- factor(letters[3:7], levels=letters[1:7])
>> > x <- data.frame(x=1:5, y=y1, row.names=letters[1:5])
>> > y <- data.frame(z=3:7, y=y2, row.names=letters[3:7])
>> > combine(x,y)
>>    x y  z
>> a  1 a NA
>> b  2 b NA
>> c  3 c  3
>> d  4 d  4
>> e  5 e  5
>> f NA f  6
>> g NA g  7
>> 
>> Martin
>> 
>> Francois Pepin <fpepin at cs.mcgill.ca> writes:
>> 
>> > Hi Martin,
>> >
>> > I think it is related, as I now have a different error message along
>> > with a series of warnings. 255 and 98 refer to the number of samples in
>> > each ExpressionSet. 66 and 21 refer to the number of unique elements in
>> > source_name_ch1 in the phenodata.
>> >
>> >> tmp2<-combine(tmp[[1]],tmp[[2]])
>> > Error in .local(x, y, ...) :
>> >   data.frames contain conflicting data:
>> >         non-conforming colname(s): title, geo_accession,
>> > source_name_ch1, description, supplementary_file
>> > In addition: Warning messages:
>> > 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>> >   Lengths (255, 98) differ (string compare on first 98)98 string
>> > mismatches
>> > 2: In switch(class(x[[nm]])[[1]], factor = { :
>> >   data frame column 'title' levels not all.equal
>> > 3: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>> >   Lengths (255, 98) differ (string compare on first 98)98 string
>> > mismatches
>> > 4: In switch(class(x[[nm]])[[1]], factor = { :
>> >   data frame column 'geo_accession' levels not all.equal
>> > 5: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>> >   Lengths (66, 21) differ (string compare on first 21)21 string
>> > mismatches
>> > 6: In switch(class(x[[nm]])[[1]], factor = { :
>> >   data frame column 'source_name_ch1' levels not all.equal
>> > 7: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>> >   Lengths (255, 98) differ (string compare on first 98)98 string
>> > mismatches
>> > 8: In switch(class(x[[nm]])[[1]], factor = { :
>> >   data frame column 'description' levels not all.equal
>> > 9: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>> >   Lengths (255, 98) differ (string compare on first 98)98 string
>> > mismatches
>> > 10: In switch(class(x[[nm]])[[1]], factor = { :
>> >   data frame column 'supplementary_file' levels not all.equal
>> >
>> >> traceback()
>> > 9: stop("data.frames contain conflicting data:", "\n\tnon-conforming
>> > colname(s): ",
>> >        paste(sharedCols[!ok], collapse = ", "))
>> > 8: .local(x, y, ...)
>> > 7: combine(pDataX, pDataY)
>> > 6: combine(pDataX, pDataY)
>> > 5: .local(x, y, ...)
>> > 4: combine(phenoData(x), phenoData(y))
>> > 3: combine(phenoData(x), phenoData(y))
>> > 2: combine(tmp[[1]], tmp[[2]])
>> > 1: combine(tmp[[1]], tmp[[2]])
>> >
>> >> sessionInfo()
>> > R version 2.6.0 (2007-10-03)
>> > x86_64-unknown-linux-gnu
>> >
>> > locale:
>> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>> >
>> > attached base packages:
>> > [1] tools     stats     graphics  grDevices utils     datasets  methods
>> > [8] base
>> >
>> > other attached packages:
>> > [1] GEOquery_2.2.0 RCurl_0.8-1    Biobase_1.16.2
>> >
>> > loaded via a namespace (and not attached):
>> > [1] rcompgen_0.1-15
>> >
>> > Francois
>> >
>> > On Wed, 2008-01-30 at 10:03 -0800, Martin Morgan wrote:
>> >> Hi Francois -- this might be related to a bug in Biobase that has been
>> >> fixed. Can you try to update your Biobase, either biocLite('Biobase')
>> >> or following the directions at http://bioconductor.org/download ? If
>> >> not, can you provide the output of traceback() after the error occurs?
>> >> 
>> >> Thanks,
>> >> 
>> >> Martin
>> >> 
>> >> Francois Pepin <fpepin at cs.mcgill.ca> writes:
>> >> 
>> >> > Hi everyone,
>> >> >
>> >> > I'm getting an error message when trying to combine two parts of a GSE
>> >> > object:
>> >> >
>> >> >>tmp<-getGEO('GSE3526',GSEMatrix=T)
>> >> >> tmp2<-combine(tmp[[1]],tmp[[2]])
>> >> > Error in alleq(levels(x[[nm]]), levels(y[[nm]])) && alleq(x
>> >> > [sharedRows,  :
>> >> >   invalid 'x' type in 'x && y'
>> >> >
>> >> > Checking to make sure that I should be able to combine them (from the
>> >> > eSet documentation):
>> >> >
>> >> > #eSets must have identical numbers of 'featureNames'
>> >> >> all(featureNames(tmp[[2]])==featureNames(tmp[[2]]))
>> >> > [1] TRUE
>> >> >
>> >> > #must have distinct 'sampleNames'
>> >> >> any(sampleNames(tmp[[1]])%in%sampleNames(tmp[[2]]))
>> >> > [1] FALSE
>> >> >
>> >> > #and must have identical 'annotation'.
>> >> >> annotation(tmp[[2]])==annotation(tmp[[2]])
>> >> > [1] TRUE
>> >> >
>> >> >> sessionInfo()
>> >> > R version 2.6.0 (2007-10-03)
>> >> > x86_64-unknown-linux-gnu
>> >> >
>> >> > locale:
>> >> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>> >> >
>> >> > attached base packages:
>> >> > [1] tools     stats     graphics  grDevices utils     datasets  methods
>> >> > [8] base
>> >> >
>> >> > other attached packages:
>> >> > [1] GEOquery_2.2.0 RCurl_0.8-1    Biobase_1.16.0
>> >> >
>> >> > loaded via a namespace (and not attached):
>> >> > [1] rcompgen_0.1-15
>> >> >
>> >> > Does anyone know why that is happening and if there would be any way
>> >> > around it?
>> >> >
>> >> > Francois
>> >> >
>> >> > _______________________________________________
>> >> > Bioconductor mailing list
>> >> > Bioconductor at stat.math.ethz.ch
>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >> 
>> >
>> 
>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioconductor mailing list