[BioC] WARNING: difference in sorting order depending on computer platform?!?

Thu Jan 28 22:14:09 CET 2010

Jenny,

Say you run some stuff in your linux machine and some other on a
server in Denmark... It's likely you're going to get something like
the following:

x <- c("a", "aaa", "z", "aa")
Sys.setlocale(locale="C") ## a machine in the US
sort(x) ## a aa aaa z
Sys.setlocale(locale="da_DK") ## a machine in Denmark
sort(x) ## a z aa aaa

There may be something more elegant, but when I had to handle this, I
started using match() a lot to ensure the probesets were properly
aligned.

You could force locale to be the same in whatever machine you use, but
I'm not sure this is a good idea.

You can check the locales you get on both machines using Sys.getlocale().

cheers,

b

On Thu, Jan 28, 2010 at 8:16 PM, Jenny Drnevich <drnevich at illinois.edu> wrote:
> Hi all,
>
> I just found a problem/discrepancy in running R on PC vs. Unix/Linux server.
> Maybe it's widely known, but I didn't know about it and it caused me big
> problems. I mostly use my desktop PC for running microarray analyses, but
> occasionally I have projects that require more memory. Then I run some of
> the memory-intensive steps on our Linux server, (which has a lot more memory
> but is REALLY slow), save the objects, and go back to my PC to finish the
> analysis. Well, it turns out that the order of probe set IDs as returned by
> featureNames() is slightly different between the computer platforms. I first
> thought it might be do to a difference in the chipnamecdf library Windows
> binary vs. *nix compilation of the source file, but I think it's just a
> difference in the way the computer platforms sort character data that have
> numbers. I've put a full, reproducible example below (our sys admin hasn't
> upgraded R on the server yet, but I doubt that's the problem), but in short,
> my PC puts 177_at before 1773_at, but the server puts 1773_at before 177_at.
>
> I guess this really isn't a "bug" that can be fixed, and I know it's not a
> good idea to run part of your R code on one computer and part on another
> computer, but don't you agree that this is undesirable behavior?  Maybe I'm
> not computer-literate enough to have known that this is a well-known issue,
> so in part I'm posting this as a warning to others like me - I don't
> remember seeing anything like this in the 4+ years I've been following the
> BioC list. I also wondering in addition to however many of my analyses that
> may have been messed up slightly (ARRRGGHH!!), would this possibly cause
> problems in things like public repositories? I know databases don't depend
> on order, but I'd be surprised if it hasn't caused problems somewhere else.
> In this case, there's only 117 probe sets out of 22,277 that don't match up,
> so it would be hard to notice!
>
> Thanks,
> Jenny
>
>
>> library(affy)
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
>  Vignettes contain introductory material. To view, type
>  'openVignette()'. To cite Bioconductor, see
>  'citation("Biobase")' and for packages 'citation(pkgname)'.
>
>> library(ArrayExpress)
>>
>> rawset = ArrayExpress("E-MEXP-1422")
> trying URL
> 'http://www.ebi.ac.uk/microarray-as/ae/files/E-MEXP-1422/index.html'
> Content type 'text/html;charset=ISO-8859-1' length unknown
> opened URL
> downloaded 7746 bytes
>
> trying URL
> 'http://www.ebi.ac.uk/microarray-as/ae/files/E-MEXP-1422/E-MEXP-1422.raw.1.zip'
> Content type 'application/zip' length 11200346 bytes (10.7 Mb)
> opened URL
> downloaded 10.7 Mb
>
> Read 1 item
> trying URL
> 'http://www.ebi.ac.uk/microarray-as/ae/files/E-MEXP-1422/E-MEXP-1422.sdrf.txt'
> Content type 'text/plain' length 6679 bytes
> opened URL
> downloaded 6679 bytes
>
> trying URL
> 'http://www.ebi.ac.uk/microarray-as/ae/files/A-AFFY-37/A-AFFY-37.adf.txt'
> Content type 'text/plain' length 3590863 bytes (3.4 Mb)
> opened URL
> downloaded 3.4 Mb
>
> trying URL
> 'http://www.ebi.ac.uk/microarray-as/ae/files/E-MEXP-1422/E-MEXP-1422.idf.txt'
> Content type 'text/plain' length 5378 bytes
> opened URL
> downloaded 5378 bytes
>
> Read 49 items
>
>  The object containing experiment  E-MEXP-1422  has been built.
>
>> rawset
> AffyBatch object
> size of arrays=732x732 features (8499 kb)
> cdf=HG-U133A_2 (22277 affyids)
> number of samples=6
> number of genes=22277
> annotation=hgu133a2
> notes=E-MEXP-1422
>        E-MEXP-1422
>        RNAi
>        c("cellular_modification_design", "co-expression_design",
> "in_vitro_design", "RNAi")
>        NULL
>>
>> PSnames.PC <- featureNames(rawset)
>>
>> all.equal(PSnames.PC, featureNames(rawset))
> [1] TRUE
>>
>> save.image("NameOrderTest.RData")
>>
>> sessionInfo()
> R version 2.10.1 (2009-12-14)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] hgu133a2cdf_2.5.0  ArrayExpress_1.6.1 affy_1.24.2        Biobase_2.6.1
>
> loaded via a namespace (and not attached):
> [1] affyio_1.14.0        limma_3.2.1          preprocessCore_1.8.0
> [4] tools_2.10.1         XML_2.6-0
>>
>> q()
>
>
> # now move to Linux server:
>
>
>> library(affy)
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
>  Vignettes contain introductory material. To view, type
>  'openVignette()'. To cite Bioconductor, see
>  'citation("Biobase")' and for packages 'citation(pkgname)'.
>
>>
>>
>>
>> load("NameOrderTest.RData")
>>
>>
>>
>> all.equal(PSnames.PC, featureNames(rawset))
> [1] "117 string mismatches"
>>
>>
>> x <- data.frame(PC=PSnames.PC, Linux=featureNames(rawset),
>> stringsAsFactors=F)
>>
>> x[ x[,1] != x[,2] , ][ 1:5 , ]
>            PC     Linux
> 17      177_at   1773_at
> 18     1773_at    177_at
> 2328 2028_s_at 202800_at
> 2329 202800_at 202801_at
> 2330 202801_at 202802_at
>>
>>
>> all.equal(sort(PSnames.PC), featureNames(rawset))
> [1] TRUE
>>
>>
>> PSnames.linux <- featureNames(rawset)
>>
>> save.image("NameOrderTest.RData")
>>
>> sessionInfo()
> R version 2.9.0 (2009-04-17)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] hgu133a2cdf_2.4.0 affy_1.22.0       Biobase_2.4.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.8.1         preprocessCore_1.6.0 tools_2.9.0
>>
>> q()
>
>
> # now move back to PC:
>
>> library(affy)
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
>  Vignettes contain introductory material. To view, type
>  'openVignette()'. To cite Bioconductor, see
>  'citation("Biobase")' and for packages 'citation(pkgname)'.
>
>> load("NameOrderTest.RData")
>>
>> all.equal(PSnames.PC, featureNames(rawset))
> [1] TRUE
>>
>> all.equal(PSnames.linux, featureNames(rawset))
> [1] "117 string mismatches"
>>
>> all.equal(sort(PSnames.linux), featureNames(rawset))
> [1] TRUE
>
>
>
>
>
>
>
>
>
>
>
> Jenny Drnevich, Ph.D.
>
> Functional Genomics Bioinformatics Specialist
> W.M. Keck Center for Comparative and Functional Genomics
> Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
>
> 330 ERML
> 1201 W. Gregory Dr.
> Urbana, IL 61801
> USA
>
> ph: 217-244-7355
> fax: 217-265-5066
> e-mail: drnevich at illinois.edu
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>