[BioC] GEOquery: getGEO() doesn\'t work (error \"invalid \'nlines\' argument\")

James W. MacDonald jmacdon at uw.edu
Tue May 29 17:03:35 CEST 2012


Hi Simone,

On 5/29/2012 10:25 AM, ecsi at gmx.net wrote:
> Hi Jim,
>
>> Why are you using system.file() in this context? 
>
> Because there is an example in the GEOquery vignette ("2 Getting 
> Started using GEOquery") which does it like this.

I see. That is one of the downsides of the vignette system - in order to 
have a vignette work correctly, using some external data, those data 
have to be parked somewhere in the package directory. An alternative 
would be to have a separate data package, but that means end users have 
to download one additional thing.

So the reason the vignette uses that paradigm is because the data being 
used are in the package directory. However, as you note below, you 
_haven't_ downloaded data to the package directory, so system.file() 
isn't the way to go. In other words, system.file() is only designed to 
help people easily detect where a given install of R has its package 
directory - it is not intended for reading files in general.

>
>> Did you really download the soft file to your GEOquery library 
>> directory? That seems odd to me.
>
> I downloaded it to a local data repository in our network (it is 
> obligatory to do it this way in this case).
>
> Why does it seem odd to you? Because I downloaded the soft file? 

No, not that you downloaded the file, what seemed odd was that you were 
using system.file(), which implies that you had downloaded the soft file 
to a very specific place. Let me give you an example:

On my Linux box
 > system.file(package="GEOquery")
[1] "/misc/staff/jmacdon/R-devel/library/GEOquery"

On my Windows box
 > system.file(package="GEOquery")
[1] "C:/Users/bioinf_admin/R/win-library/2.14/GEOquery"

So when you use system.file() you are specifically telling GEOquery to 
look for a file that is in your GEOquery library directory, rather than 
telling GEOquery the actual directory. That is what Sean was getting at 
in his response to you.

> This was a recommendation of a colleague who works a lot with GEO, we 
> thought the soft files would be the best option because they contain 
> all the information available and furthermore they are available for 
> all the GEO series I have analyze. As I already wrote in reply to the 
> answer of Sean, if there is any better way to do it, I will be happy 
> to hear about it!

Sean already gave it to you. To further elaborate:

 > mypath <- "C:/Users/bioinf_admin/Desktop/"
 > GSE19711 <- getGEO('GSE19711',destdir=mypath)

This will result in a list of ExpressionSets

 > length(GSE19711)
[1] 3
 > GSE19711[[1]]
ExpressionSet (storageMode: lockedEnvironment)
assayData: 27578 features, 255 samples
   element names: exprs
protocolData: none
phenoData
   sampleNames: GSM491937 GSM491938 ... GSM492191 (255 total)
   varLabels: title geo_accession ... data_row_count (44 total)
   varMetadata: labelDescription
featureData
   featureNames: cg00000292 cg00002426 ... cg27665659 (27578 total)
   fvarLabels: ID Name ... ORF (38 total)
   fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL8490

I doubt you will be able to automate too much of this, as the phenoData 
slots for these ExpressionSets can contain whatever the experimenter 
thought was interesting, in addition to what is required by GEO:

 > names(pData(phenoData(GSE19711[[1]])))
  [1] "title"                   "geo_accession"
  [3] "status"                  "submission_date"
  [5] "last_update_date"        "type"
  [7] "channel_count"           "source_name_ch1"
  [9] "organism_ch1"            "characteristics_ch1"
[11] "characteristics_ch1.1"   "characteristics_ch1.2"
[13] "characteristics_ch1.3"   "characteristics_ch1.4"
[15] "characteristics_ch1.5"   "characteristics_ch1.6"
[17] "characteristics_ch1.7"   "characteristics_ch1.8"
[19] "characteristics_ch1.9"   "characteristics_ch1.10"
[21] "characteristics_ch1.11"  "characteristics_ch1.12"
[23] "characteristics_ch1.13"  "molecule_ch1"
[25] "extract_protocol_ch1"    "label_ch1"
[27] "label_protocol_ch1"      "taxid_ch1"
[29] "hyb_protocol"            "scan_protocol"
[31] "description"             "data_processing"
[33] "platform_id"             "contact_name"
[35] "contact_email"           "contact_phone"
[37] "contact_department"      "contact_institute"
[39] "contact_address"         "contact_city"
[41] "contact_zip/postal_code" "contact_country"
[43] "supplementary_file"      "data_row_count"

And we can then see what the characteristics are:

 > head(pData(phenoData(GSE19711[[1]])), 2)[,11:23]
                    characteristics_ch1.1 characteristics_ch1.2
GSM491937 agegroupatsampledraw: 65 to 70  ageatrecruitment: 68
GSM491938  agegroupatsampledraw: Over 75  ageatrecruitment: 81
           characteristics_ch1.3     characteristics_ch1.4 
characteristics_ch1.5
GSM491937    ageatdiagnosis: 68   histology: Endometrioid             
stage: Ic
GSM491938    ageatdiagnosis: 80 histology: Carcinosarcoma           
stage: IIIb
           characteristics_ch1.6     characteristics_ch1.7
GSM491937        grade: Grade 2 pre-treatment sample: Yes
GSM491938        grade: Grade 3  pre-treatment sample: No
                characteristics_ch1.8 characteristics_ch1.9
GSM491937  post-treatment sample: No           ca125: 1717
GSM491938 post-treatment sample: Yes          ca125: 32.89
           characteristics_ch1.10      characteristics_ch1.11
GSM491937               batch: 1 beadchip_well: 4447820175_A
GSM491938               batch: 1 beadchip_well: 4447820175_B
               characteristics_ch1.12     characteristics_ch1.13
GSM491937 bs conversion c1: Grn 5706 bs conversion c2: Grn 5538
GSM491938 bs conversion c1: Grn 6861 bs conversion c2: Grn 6141

Does that help?

Best,

Jim

>
> Best,
> Simone
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list