[BioC] Combining expressionSets from GEO

Francois Pepin fpepin at cs.mcgill.ca
Wed Jan 30 20:31:11 CET 2008


Hi Martin,

Thanks for the help. I managed to fix the issue by resetting all of the
levels on both side (having everything as characters should work too):

for (i in 1:length(pData(phenoData(tmp[[1]]))))
  levels(pData(phenoData(tmp[[1]]))[,i])<-levels(pData(phenoData(tmp
[[2]]))[,i]) <- c(unique(as.character(pData(phenoData(tmp
[[1]]))[,i])),unique(as.character(pData(phenoData(tmp[[2]]))[,i])))

The next question would be to see where it would best be taken care of.
I really don't see why this should not be taken care of behind the
scene.

The two main options I see would be that getGEO() returns characters of
phenoData instead of factors or having combine() know to deal with
factors properly for expressionSet.

If the former is chosen, I think it would probably be worth adjusting
the documentation about combine to mention this issue. As an unrelated
note, the ExpressionSet documentation refers to the eSet's. Since eSet
is going away at some point, that might be worth changing.

Francois

On Wed, 2008-01-30 at 10:54 -0800, Martin Morgan wrote:
> So part of the bug fix was an attempt to make the error message more
> informative, and it's not really clear that I've done that!
> 
> The traceback makes it's clear that the problem is with the pData (and
> not, for instance varMetadata or featureData) of the two arrays.
> 
> Some hints are provided by the warnings, by the ?combine help page, 
> 
>      'combine(data.frame, data.frame)' Combines two 'data.frame'
>           objects so that the resulting 'data.frame' contains all rows
>           and columns of the original objects. Rows and columns in the
>           returned value are unique, that is, a row or column
>           represented in both arguments is represented only once in the
>           result. To perform this operation, 'combine' makes sure that
>           data in shared rows and columns is identical in the two
>           data.frames. Data diffrences in shared rows and columns cause
>           an error. 'combine' issues a warning when a column is a
>           'factor' and the levels of the factor in the two
>           'data.frame's are different; the returned value may be
>           recoded.
> 
> and by the results of
> 
> > example(combine)
> 
> particularly the last lines which are trying to illustrate your
> problem: 
> 
> combin>   # y is converted to 'factor' with different levels
> combin>   x <- data.frame(x=1:5,y=letters[1:5], row.names=letters[1:5])
> 
> combin>   y <- data.frame(z=3:7,y=letters[3:7], row.names=letters[3:7])
> 
> combin>   try(combine(x,y))
> Error in combine(x, y) : data.frames contain conflicting data:
> 	non-conforming colname(s): y
> In addition: Warning messages:
> 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) : 5 string mismatches
> 2: In switch(class(x[[nm]])[[1]], factor = { :
>   data frame column 'y' levels not all.equal
> 
> The data.frame column 'y' is a 'factor' (rather than character
> vectors) and combine doesn't know how to resolve a column that has 'c'
> encoded as level 3 of a factor with one that has 'c' encoded as level
> 1.
> 
> One solution is to enusre that columns that are really character
> vectors are stored as such
> 
> > x <- data.frame(x=1:5,y=I(letters[1:5]), row.names=letters[1:5])
> > y <- data.frame(z=3:7,y=I(letters[3:7]), row.names=letters[3:7])
> > combine(x,y)
>    x y  z
> a  1 a NA
> b  2 b NA
> c  3 c  3
> d  4 d  4
> e  5 e  5
> f NA f  6
> g NA g  7
> 
> or that factors have the same levels
> 
> > y1 <- factor(letters[1:5], levels=letters[1:7])
> > y2 <- factor(letters[3:7], levels=letters[1:7])
> > x <- data.frame(x=1:5, y=y1, row.names=letters[1:5])
> > y <- data.frame(z=3:7, y=y2, row.names=letters[3:7])
> > combine(x,y)
>    x y  z
> a  1 a NA
> b  2 b NA
> c  3 c  3
> d  4 d  4
> e  5 e  5
> f NA f  6
> g NA g  7
> 
> Martin
> 
> Francois Pepin <fpepin at cs.mcgill.ca> writes:
> 
> > Hi Martin,
> >
> > I think it is related, as I now have a different error message along
> > with a series of warnings. 255 and 98 refer to the number of samples in
> > each ExpressionSet. 66 and 21 refer to the number of unique elements in
> > source_name_ch1 in the phenodata.
> >
> >> tmp2<-combine(tmp[[1]],tmp[[2]])
> > Error in .local(x, y, ...) :
> >   data.frames contain conflicting data:
> >         non-conforming colname(s): title, geo_accession,
> > source_name_ch1, description, supplementary_file
> > In addition: Warning messages:
> > 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >   Lengths (255, 98) differ (string compare on first 98)98 string
> > mismatches
> > 2: In switch(class(x[[nm]])[[1]], factor = { :
> >   data frame column 'title' levels not all.equal
> > 3: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >   Lengths (255, 98) differ (string compare on first 98)98 string
> > mismatches
> > 4: In switch(class(x[[nm]])[[1]], factor = { :
> >   data frame column 'geo_accession' levels not all.equal
> > 5: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >   Lengths (66, 21) differ (string compare on first 21)21 string
> > mismatches
> > 6: In switch(class(x[[nm]])[[1]], factor = { :
> >   data frame column 'source_name_ch1' levels not all.equal
> > 7: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >   Lengths (255, 98) differ (string compare on first 98)98 string
> > mismatches
> > 8: In switch(class(x[[nm]])[[1]], factor = { :
> >   data frame column 'description' levels not all.equal
> > 9: In alleq(levels(x[[nm]]), levels(y[[nm]])) :
> >   Lengths (255, 98) differ (string compare on first 98)98 string
> > mismatches
> > 10: In switch(class(x[[nm]])[[1]], factor = { :
> >   data frame column 'supplementary_file' levels not all.equal
> >
> >> traceback()
> > 9: stop("data.frames contain conflicting data:", "\n\tnon-conforming
> > colname(s): ",
> >        paste(sharedCols[!ok], collapse = ", "))
> > 8: .local(x, y, ...)
> > 7: combine(pDataX, pDataY)
> > 6: combine(pDataX, pDataY)
> > 5: .local(x, y, ...)
> > 4: combine(phenoData(x), phenoData(y))
> > 3: combine(phenoData(x), phenoData(y))
> > 2: combine(tmp[[1]], tmp[[2]])
> > 1: combine(tmp[[1]], tmp[[2]])
> >
> >> sessionInfo()
> > R version 2.6.0 (2007-10-03)
> > x86_64-unknown-linux-gnu
> >
> > locale:
> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] tools     stats     graphics  grDevices utils     datasets  methods
> > [8] base
> >
> > other attached packages:
> > [1] GEOquery_2.2.0 RCurl_0.8-1    Biobase_1.16.2
> >
> > loaded via a namespace (and not attached):
> > [1] rcompgen_0.1-15
> >
> > Francois
> >
> > On Wed, 2008-01-30 at 10:03 -0800, Martin Morgan wrote:
> >> Hi Francois -- this might be related to a bug in Biobase that has been
> >> fixed. Can you try to update your Biobase, either biocLite('Biobase')
> >> or following the directions at http://bioconductor.org/download ? If
> >> not, can you provide the output of traceback() after the error occurs?
> >> 
> >> Thanks,
> >> 
> >> Martin
> >> 
> >> Francois Pepin <fpepin at cs.mcgill.ca> writes:
> >> 
> >> > Hi everyone,
> >> >
> >> > I'm getting an error message when trying to combine two parts of a GSE
> >> > object:
> >> >
> >> >>tmp<-getGEO('GSE3526',GSEMatrix=T)
> >> >> tmp2<-combine(tmp[[1]],tmp[[2]])
> >> > Error in alleq(levels(x[[nm]]), levels(y[[nm]])) && alleq(x
> >> > [sharedRows,  :
> >> >   invalid 'x' type in 'x && y'
> >> >
> >> > Checking to make sure that I should be able to combine them (from the
> >> > eSet documentation):
> >> >
> >> > #eSets must have identical numbers of 'featureNames'
> >> >> all(featureNames(tmp[[2]])==featureNames(tmp[[2]]))
> >> > [1] TRUE
> >> >
> >> > #must have distinct 'sampleNames'
> >> >> any(sampleNames(tmp[[1]])%in%sampleNames(tmp[[2]]))
> >> > [1] FALSE
> >> >
> >> > #and must have identical 'annotation'.
> >> >> annotation(tmp[[2]])==annotation(tmp[[2]])
> >> > [1] TRUE
> >> >
> >> >> sessionInfo()
> >> > R version 2.6.0 (2007-10-03)
> >> > x86_64-unknown-linux-gnu
> >> >
> >> > locale:
> >> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> >> >
> >> > attached base packages:
> >> > [1] tools     stats     graphics  grDevices utils     datasets  methods
> >> > [8] base
> >> >
> >> > other attached packages:
> >> > [1] GEOquery_2.2.0 RCurl_0.8-1    Biobase_1.16.0
> >> >
> >> > loaded via a namespace (and not attached):
> >> > [1] rcompgen_0.1-15
> >> >
> >> > Does anyone know why that is happening and if there would be any way
> >> > around it?
> >> >
> >> > Francois
> >> >
> >> > _______________________________________________
> >> > Bioconductor mailing list
> >> > Bioconductor at stat.math.ethz.ch
> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >> 
> >
>



More information about the Bioconductor mailing list