[BioC] GEOquery returns error "scan() expected 'an integer'"

Sean Davis sdavis2 at mail.nih.gov
Mon Oct 3 13:18:42 CEST 2011


2011/10/2 Timothée Flutre <timflutre at gmail.com>:
> Hello,
>
> I downloaded a dataset from the GEO at the NCBI and launched the following
> commands:
>> library(GEOquery)
>> gse <- getGEO(filename="GSE25935_family.soft.gz")
>
> Here is the error message I got:
> Parsing....
> Found 465 entities...
> GPL4133 (1 of 465 entities)
> GSM636943 (2 of 465 entities)
> ...
> GSM637180 (239 of 465 entities)
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
>  :
>  scan() expected 'an integer', got '5.845752745'
> Calls: getGEO ... .parseGSMWithLimits -> fastTabRead -> read.delim ->
> read.table -> scan
>
> Is the input file badly formatted?

Sorry for the bug.  In order to read some of the larger files in GEO,
I borrowed a trick from the limma package to just the first part of
the file to get the column types, then read the entire file after
telling R about the column types.  This ends up speeding up reading
large files by an order of magnitude sometimes.  That is the
background.

In this case, the problem arises from a sample (GSM637180) that
contains 178 missing values as the first records.  Since I read only
the first 100, R assumes that this column is full of integers.  I'll
need to fix the code for table reading, but in the meantime, I would
suggest this as the workaround:

gse = getGEO('GSE25935',destdir='.')
gse = combine(gse[[1]],gse[[2]]

Using destdir in the getGEO call will allow you to reuse the
downloaded files (they are cached in the current directory, in other
words) in the case of having to run the code more than once.  The
combine() call is needed because NCBI GEO built the original series
matrix format to have at most 255 columns per file, so two such files
are needed to capture all the samples.

Hope that helps,
Sean


> Thanks for any help,
> TF
>
>> sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] GEOquery_2.19.4 Biobase_2.10.0
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.5-0 XML_3.2-0
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list