[BioC] Combining expressionSets from GEO

Martin Morgan mtmorgan at fhcrc.org
Wed Jan 30 19:54:28 CET 2008


So part of the bug fix was an attempt to make the error message more
informative, and it's not really clear that I've done that!

The traceback makes it's clear that the problem is with the pData (and
not, for instance varMetadata or featureData) of the two arrays.

Some hints are provided by the warnings, by the ?combine help page, 

     'combine(data.frame, data.frame)' Combines two 'data.frame'
          objects so that the resulting 'data.frame' contains all rows
          and columns of the original objects. Rows and columns in the
          returned value are unique, that is, a row or column
          represented in both arguments is represented only once in the
          result. To perform this operation, 'combine' makes sure that
          data in shared rows and columns is identical in the two
          data.frames. Data diffrences in shared rows and columns cause
          an error. 'combine' issues a warning when a column is a
          'factor' and the levels of the factor in the two
          'data.frame's are different; the returned value may be
          recoded.

and by the results of

> example(combine)

particularly the last lines which are trying to illustrate your
problem: 

combin>   # y is converted to 'factor' with different levels
combin>   x <- data.frame(x=1:5,y=letters[1:5], row.names=letters[1:5])

combin>   y <- data.frame(z=3:7,y=letters[3:7], row.names=letters[3:7])

combin>   try(combine(x,y))
Error in combine(x, y) : data.frames contain conflicting data:
	non-conforming colname(s): y
In addition: Warning messages:
1: In alleq(levels(x[[nm]]), levels(y[[nm]])) : 5 string mismatches
2: In switch(class(x[[nm]])[[1]], factor = { :
  data frame column 'y' levels not all.equal

The data.frame column 'y' is a 'factor' (rather than character
vectors) and combine doesn't know how to resolve a column that has 'c'
encoded as level 3 of a factor with one that has 'c' encoded as level
1.

One solution is to enusre that columns that are really character
vectors are stored as such

> x <- data.frame(x=1:5,y=I(letters[1:5]), row.names=letters[1:5])
> y <- data.frame(z=3:7,y=I(letters[3:7]), row.names=letters[3:7])
> combine(x,y)
   x y  z
a  1 a NA
b  2 b NA
c  3 c  3
d  4 d  4
e  5 e  5
f NA f  6
g NA g  7

or that factors have the same levels

> y1 <- factor(letters[1:5], levels=letters[1:7])
> y2 <- factor(letters[3:7], levels=letters[1:7])
> x <- data.frame(x=1:5, y=y1, row.names=letters[1:5])
> y <- data.frame(z=3:7, y=y2, row.names=letters[3:7])
> combine(x,y)
   x y  z
a  1 a NA
b  2 b NA
c  3 c  3
d  4 d  4
e  5 e  5
f NA f  6
g NA g  7

Martin

Francois Pepin <fpepin at cs.mcgill.ca> writes:

> Hi Martin,
>
> I think it is related, as I now have a different error message along
> with a series of warnings. 255 and 98 refer to the number of samples in
> each ExpressionSet. 66 and 21 refer to the number of unique elements in
> source_name_ch1 in the phenodata.
>
>> tmp2<-combine(tmp[[1]],tmp[[2]])
> Error in .local(x, y, ...) :
>   data.frames contain conflicting data:
>         non-conforming colname(s): title, geo_accession,
> source_name_ch1, description, supplementary_file
> In addition: Warning messages:
> 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>   Lengths (255, 98) differ (string compare on first 98)98 string
> mismatches
> 2: In switch(class(x[[nm]])[[1]], factor = { :
>   data frame column 'title' levels not all.equal
> 3: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>   Lengths (255, 98) differ (string compare on first 98)98 string
> mismatches
> 4: In switch(class(x[[nm]])[[1]], factor = { :
>   data frame column 'geo_accession' levels not all.equal
> 5: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>   Lengths (66, 21) differ (string compare on first 21)21 string
> mismatches
> 6: In switch(class(x[[nm]])[[1]], factor = { :
>   data frame column 'source_name_ch1' levels not all.equal
> 7: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>   Lengths (255, 98) differ (string compare on first 98)98 string
> mismatches
> 8: In switch(class(x[[nm]])[[1]], factor = { :
>   data frame column 'description' levels not all.equal
> 9: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
>   Lengths (255, 98) differ (string compare on first 98)98 string
> mismatches
> 10: In switch(class(x[[nm]])[[1]], factor = { :
>   data frame column 'supplementary_file' levels not all.equal
>
>> traceback()
> 9: stop("data.frames contain conflicting data:", "\n\tnon-conforming
> colname(s): ",
>        paste(sharedCols[!ok], collapse = ", "))
> 8: .local(x, y, ...)
> 7: combine(pDataX, pDataY)
> 6: combine(pDataX, pDataY)
> 5: .local(x, y, ...)
> 4: combine(phenoData(x), phenoData(y))
> 3: combine(phenoData(x), phenoData(y))
> 2: combine(tmp[[1]], tmp[[2]])
> 1: combine(tmp[[1]], tmp[[2]])
>
>> sessionInfo()
> R version 2.6.0 (2007-10-03)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] tools     stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] GEOquery_2.2.0 RCurl_0.8-1    Biobase_1.16.2
>
> loaded via a namespace (and not attached):
> [1] rcompgen_0.1-15
>
> Francois
>
> On Wed, 2008-01-30 at 10:03 -0800, Martin Morgan wrote:
>> Hi Francois -- this might be related to a bug in Biobase that has been
>> fixed. Can you try to update your Biobase, either biocLite('Biobase')
>> or following the directions at http://bioconductor.org/download ? If
>> not, can you provide the output of traceback() after the error occurs?
>> 
>> Thanks,
>> 
>> Martin
>> 
>> Francois Pepin <fpepin at cs.mcgill.ca> writes:
>> 
>> > Hi everyone,
>> >
>> > I'm getting an error message when trying to combine two parts of a GSE
>> > object:
>> >
>> >>tmp<-getGEO('GSE3526',GSEMatrix=T)
>> >> tmp2<-combine(tmp[[1]],tmp[[2]])
>> > Error in alleq(levels(x[[nm]]), levels(y[[nm]])) && alleq(x
>> > [sharedRows,  :
>> >   invalid 'x' type in 'x && y'
>> >
>> > Checking to make sure that I should be able to combine them (from the
>> > eSet documentation):
>> >
>> > #eSets must have identical numbers of 'featureNames'
>> >> all(featureNames(tmp[[2]])==featureNames(tmp[[2]]))
>> > [1] TRUE
>> >
>> > #must have distinct 'sampleNames'
>> >> any(sampleNames(tmp[[1]])%in%sampleNames(tmp[[2]]))
>> > [1] FALSE
>> >
>> > #and must have identical 'annotation'.
>> >> annotation(tmp[[2]])==annotation(tmp[[2]])
>> > [1] TRUE
>> >
>> >> sessionInfo()
>> > R version 2.6.0 (2007-10-03)
>> > x86_64-unknown-linux-gnu
>> >
>> > locale:
>> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>> >
>> > attached base packages:
>> > [1] tools     stats     graphics  grDevices utils     datasets  methods
>> > [8] base
>> >
>> > other attached packages:
>> > [1] GEOquery_2.2.0 RCurl_0.8-1    Biobase_1.16.0
>> >
>> > loaded via a namespace (and not attached):
>> > [1] rcompgen_0.1-15
>> >
>> > Does anyone know why that is happening and if there would be any way
>> > around it?
>> >
>> > Francois
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at stat.math.ethz.ch
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioconductor mailing list