[BioC] getGEO vs GEOmetadb

Jack Zhu zhujack at mail.nih.gov
Thu Jun 13 16:12:46 CEST 2013


Hi Thomas,

Sorry that I missed your posts on the bioconductor mailing list. We
did have issues with updating recent GEO data and that seem has been
fixed:

-----------------------
> con <- dbConnect(SQLite(), "GEOmetadb.sqlite")
> dat <- dbGetQuery(con, "select * from gds where gds = 'GDS4252'")
> dat
    ID     gds
                              title
1 3354 GDS4252 Cystic fibrosis bronchial epithelial cells exposure to
Pseudomonas aeruginosa PA01 biofilms




description
1 Analysis of cystic fibrosis (CF) bronchial epithelial CFBE41o- cells
exposed to Pseudomonas aeruginosa PA01 biofilms. Cells overexpressing
508del-CFTR and cells rescued with wild type CFTR were examined. CFTR
mutations enhance the inflammatory response in the lung to PA01
infection.
                           type pubmed_id    gpl platform_organism
platform_technology_type feature_count sample_organism sample_type
channel_count
1 Expression profiling by array  22821996 GPL570      Homo sapiens  in
situ oligonucleotide         54675    Homo sapiens         RNA
    1
  sample_count        value_type      gse order update_date
1           16 transformed count GSE30439  none  2013-04-23
----------------------------

Could you redonwload the GEOmetadb.sqlite.gz and try again?  Please
don't hesitate to contact me directly if you still see any problems.
Thanks.


Jack


On Thu, Jun 13, 2013 at 9:05 AM, Thomas H. Hampton
<Thomas.H.Hampton at dartmouth.edu> wrote:
> Hi Sean and Jack,
>
> Sorry to pester you with this. I posted it to BioC twice and got no response so I thought I should try contacting you more directly.
>
>
> The following getGEO query retrieves data files and meta data for a recent GEO submission of mine,
> one that has been curated:
>
> GDS4252 <- getGEO("GDS4252")
> Columns(GDS4252)
>> str(Columns(GDS4252))
> 'data.frame': 16 obs. of  4 variables:
> $ sample            : Factor w/ 16 levels "GSM754979","GSM754980",..: 5 6 7 8 1 2 3 4 13 14 ...
> $ genotype/variation: Factor w/ 2 levels "CFTR  mutant",..: 1 1 1 1 1 1 1 1 2 2 ...
> $ agent             : Factor w/ 2 levels "PA01","unexposed": 1 1 1 1 2 2 2 2 1 1 ...
>
> The folks at NCBI have correctly created two factors with two levels to describe the 16 samples in my experiment.
>
> I am interested in retrieving similar information using GEOmetadb, but this has proved problematic.
>
> getSQLiteFile(destdir = getwd(), destfile = "GEOmetadb.sqlite.gz")
>
> con <- dbConnect(SQLite(), "GEOmetadb.sqlite")
> dat <- dbGetQuery(con, "select * from gds where gds = 'GDS4252'")
>
>> dat
> [1] ID                       gds                      title
> [4] description              type                     pubmed_id
> [7] gpl                      platform_organism        platform_technology_type
> [10] feature_count            sample_organism          sample_type
> [13] channel_count            sample_count             value_type
> [16] gse                      order                    update_date
> <0 rows> (or 0-length row.names)
>
> It seems, for starters, that this GDS identifier for my particular submission isn't accounted for in the current
> database.
>
> Others are, so it looks like my syntax and so forth is ok:
>
>> dat <- dbGetQuery(con, "select gds from gds limit 10")
>> dat
>     gds
> 1   GDS5
> 2   GDS6
> 3  GDS10
> 4  GDS12
> 5  GDS15
> 6  GDS16
> 7  GDS17
> 8  GDS18
> 9  GDS19
> 10 GDS20
>
>
> There is also the question of where a set of fields (variable in number) describing sample factors and their levels would actually "live"
> in the SQLite database.
>
> This information does not seem to be an attribute of the GDS in any case:
>
>> dat <- dbGetQuery(con, "select fieldname from geodb_column_desc where TableName = 'gds'")
>> dat
>                  FieldName
> 1                        ID
> 2             channel_count
> 3               description
> 4             feature_count
> 5                       gds
> 6                     order
> 7                  platform
> 8         platform_organism
> 9  platform_technology_type
> 10                pubmed_id
> 11         reference_series
> 12             sample_count
> 13          sample_organism
> 14              sample_type
> 15                    title
> 16                     type
> 17              update_date
> 18               value_type
>
> Nor does it seem to be a feature stored in the samples:
>
>> dat <- dbGetQuery(con, "select fieldname from geodb_column_desc where TableName = 'gsm'")
>> dat
>                FieldName
> 1                      ID
> 2           channel_count
> 3     characteristics_ch1
> 4     characteristics_ch2
> 5                 contact
> 6         data_processing
> 7          data_row_count
> 8             description
> 9    extract_protocol_ch1
> 10   extract_protocol_ch2
> 11                    gpl
> 12                    gse
> 13                    gsm
> 14           hyb_protocol
> 15              label_ch1
> 16              label_ch2
> 17     label_protocol_ch1
> 18     label_protocol_ch2
> 19       last_update_date
> 20           molecule_ch1
> 21           molecule_ch2
> 22           organism_ch1
> 23           organism_ch2
> 24        source_name_ch1
> 25        source_name_ch2
> 26                 status
> 27        submission_date
> 28     supplementary_file
> 29                  title
> 30 treatment_protocol_ch1
> 31 treatment_protocol_ch2
> 32                   type
>
>
> Any advice greatly appreciated.
>
>
> Tom
>



More information about the Bioconductor mailing list