[BioC] R: How to use GEOquery to extract more than the default information from a GSE

Mon Jul 27 10:16:27 CEST 2009

Dear Sean,

good morning.

Thank you for your patience and support here.

Could I assume that your suggested command

gse <- getGEO('GSE9820")[[1]]

is somewhat similar to (but more condensated than) that which James gave me?

Indeed I now have the annotations on my ExprSet but I am still somewhat dissatisfied. In fact, there are plenty of interesting metadata kindly published in each GEO object and I would like to learn how to include or exclude them when downloading/creating the ExprSet object.

Just to mention, in GSE9820 I have of course every individual GSMs' metadata (I am pasting one example between << >>):

<<
GSM247855
An object of class "GSM"
channel_count 
[1] "1"
characteristics_ch1 
[1] "control"            "patient ID_REF: A4" "age:44"            
[4] "sex:M"             
contact_address 
[1] "Meibergdreef 9"
contact_city 
[1] "Amsterdam"
contact_country 
[1] "Netherlands"
contact_department 
[1] "Cardiology"
contact_email 
[1] "stephan.schirmer at uks.eu"
contact_institute 
[1] "Academic Medical Center"
contact_name 
[1] "Stephan,Henrik,Schirmer"
contact_zip/postal_code 
[1] "1105AZ"
data_processing 
[1] "Array data were extracted using Illumina's BeadStudio software. From 13 controls and 18 patients we analyzed CD14+ monocyte, CD4+ T-cell, LPS-stimulated monocytes and macrophage samples, in total 130 arrays (including 6 technical replicates). From the CD34+ cell samples, only 23 passed quality control and were analyzed by array, giving a grand total of 153 arrays. Normalization and statistical analysis of the bead summary data from the arrays was carried out using the limma package14 and in-house scripts in R/Bioconductor. Bead summary intensities were log2-transformed and then normalized using quantile normalization. To find differentially expressed genes, we performed a linear model analysis. Technical replicates were handled by estimating a common value for the intra-replicate correlation and including it in the linear model. Differential expression between the treatments of interest was assessed using a moderated t-test. This test is similar to a standard t-test for each probe except that the standard errors are moderated across genes to ensure more stable inference for each gene. Resulting p-values were corrected for multiple testing using the Benjamini-Hochberg false discovery rate."
data_row_count 
[1] "20589"
description 
[1] "A4_20h_C"
extract_protocol_ch1 
[1] "Positively isolated monocytes, T-cells and stem cells as well as cultured stimulated monocytes and macrophages were lysed and total RNA was isolated (Absolutely RNA Microprep Kit, Stratagene, La Jolla, CA)."
geo_accession 
[1] "GSM247855"
growth_protocol_ch1 
[1] "Using immunomagnetic beads (Dynabeads, Invitrogen, Carlsbad, CA), CD14+ monocytes, CD4+ T-cells and CD34+ stem cells were positively isolated for direct cell lysis, while negatively isolated monocytes were split into two fractions for stimulation with 10 ng/ml lipopolysaccharide (LPS) for 3h, or for 20h cell culture towards macrophages."
hyb_protocol 
[1] "According to beadchip array manufacturer's protocol"
label_ch1 
[1] "biotin"
label_protocol_ch1 
[1] "Total RNA samples were amplified and biotinylated using the Illumina TotalPrep RNA amplification Kit (Ambion, Austin, TX)."
last_update_date 
[1] "Dec 01 2008"
molecule_ch1 
[1] "total RNA"
organism_ch1 
[1] "Homo sapiens"
platform_id 
[1] "GPL6255"
scan_protocol 
[1] "According to beadchip array manufacturer's protocol"
series_id 
[1] "GSE9820"
source_name_ch1 
[1] "macrophages"
status 
[1] "Public on Dec 01 2008"
submission_date 
[1] "Dec 07 2007"
supplementary_file 
[1] "NONE"
title 
[1] "A4_20h_C"
type 
[1] "RNA"
An object of class "GEODataTable"
****** Column Descriptions ******
  Column                      Description
1 ID_REF                                 
2  VALUE log2 normalized signal intensity
****** Data Table ******
      ID_REF       VALUE
1 ILMN_10000 6.724885805
2 ILMN_10001 11.24398853
3 ILMN_10002   2.4373955
4 ILMN_10004 5.248193616
5 ILMN_10005 2.891016346
20583 more rows ...

Slot "gpls":
$GPL6255
An object of class "GPL"
contact_address 
[1] "9000 Rockville Pike"
contact_city 
[1] "Bethesda"
contact_country 
[1] "MD"
contact_email 
[1] "geo at ncbi.nlm.nih.gov"
contact_institute 
[1] "NCBI/NLM/NIH"
contact_name 
[1] "GEO,,admin"
contact_zip/postal_code 
[1] "20892"
data_row_count 
[1] "20589"
distribution 
[1] "commercial"
geo_accession 
[1] "GPL6255"
last_update_date 
[1] "Feb 10 2009"
manufacture_protocol 
[1] "see manufacturer's website"
manufacturer 
[1] "Illumina Inc."
organism 
[1] "Homo sapiens"
status 
[1] "Public on Dec 07 2007"
submission_date 
[1] "Dec 07 2007"
technology 
[1] "oligonucleotide beads"
title 
[1] "Illumina humanRef-8 v2.0 expression beadchip"
An object of class "GEODataTable"
****** Column Descriptions ******
      Column
1         ID
2     GB_ACC
3     SYMBOL
4 DEFINITION
5   ONTOLOGY
6    SYNONYM
                                                                                                       Description
1                                                           Search_Key; Internal id useful for custom design array
2 GenBank Accession Number LINK_PRE:"http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=Nucleotide&term="
3                                                                             Gene symbol from the source database
4                                                                                 Gene description from the source
5                                                                           annotations from Gene Ontology project
6                                                                                 Gene symbol synonyms from Refseq
****** Data Table ******
          ID      GB_ACC  SYMBOL
1 ILMN_10000 NM_007112.3   THBS3
2 ILMN_10001 NM_018976.3 SLC38A2
3 ILMN_10002 NM_175569.1      XG
4 ILMN_10004 NM_001954.3    DDR1
5 ILMN_10005 NM_031966.2   CCNB1
                                                                                   DEFINITION
1                                                Homo sapiens thrombospondin 3 (THBS3), mRNA.
2                            Homo sapiens solute carrier family 38, member 2 (SLC38A2), mRNA.
3                                                     Homo sapiens Xg blood group (XG), mRNA.
4 Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 2, mRNA.
5                                                       Homo sapiens cyclin B1 (CCNB1), mRNA.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ONTOLOGY
1                                                                                                                                                                                                                                                                                              cell-matrix adhesion [goid 7160] [pmid 8468055] [evidence TAS]; cell motility [goid 6928] [evidence NR ]; calcium ion binding [goid 5509] [pmid 8288588] [evidence TAS]; structural molecule activity [goid 5198] [evidence IEA]; protein binding [goid 5515] [evidence IEA]; heparin binding [goid 8201] [evidence NR ]; extracellular matrix (sensu Metazoa) [goid 5578] [evidence NR ]
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      transport [goid 6810] [evidence IEA]; amino acid transport [goid 6865] [evidence IEA]; amino acid-polyamine transporter activity [goid 5279] [evidence IEA]; membrane [goid 16020] [evidence IEA]
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                       biological process unknown [goid 4] [evidence ND ]; molecular function unknown [goid 5554] [pmid 8054981] [evidence ND ]; membrane [goid 16020] [evidence NAS]; integral to membrane [goid 16021] [evidence IEA]
4 cell adhesion [goid 7155] [pmid 8302582] [evidence TAS]; transmembrane receptor protein tyrosine kinase signaling pathway [goid 7169] [evidence IEA]; protein amino acid phosphorylation [goid 6468] [evidence IEA]; nucleotide binding [goid 166] [evidence IEA]; transmembrane receptor protein tyrosine kinase activity [goid 4714] [pmid 9659899] [evidence TAS]; receptor activity [goid 4872] [evidence IEA]; transferase activity [goid 16740] [evidence IEA]; ATP binding [goid 5524] [evidence IEA]; protein-tyrosine kinase activity [goid 4713] [evidence IEA]; membrane [goid 16020] [evidence IEA]; integral to plasma membrane [goid 5887] [pmid 8390675] [evidence TAS]
5                                                                                                                                                                                                                                                                                                                                                cell division [goid 51301] [evidence IEA]; mitosis [goid 7067] [evidence IEA]; regulation of cell cycle [goid 74] [evidence IEA]; G2/M transition of mitotic cell cycle [goid 86] [evidence NAS]; cell cycle [goid 7049] [evidence IEA]; protein binding [goid 5515] [pmid 10373560] [evidence IPI]; nucleus [goid 5634] [evidence IEA]
                                                             SYNONYM
1                                                               TSP3
2                               ATA2; SAT2; SNAT2; PRO1068; KIAA1382
3                   PBDX; MGC118758; MGC118759; MGC118760; MGC118761
4 CAK; DDR; NEP; PTK3; RTK6; TRKE; CD167; EDDR1; MCK10; NTRK4; PTK3A
5                                                               CCNB
20583 more rows ...

>>

...so I could be itnerested, as instance, in including these information:

characteristics_ch1 
[1] "control"            "patient ID_REF: A4" "age:44"            
[4] "sex:M"   

source_name_ch1 
[1] "macrophages" together with the expression levels and the annotations.

Is there any source of knowledge, other than the vignette, from which I can learn how to use your most powerful GEOquery in the GSEMatrix=FALSE mode?

Thank you in advance. My best regards

Marco

--
Marco Manca, MD
University of Maastricht
Faculty of Health, Medicine and Life Sciences (FHML)
Cardiovascular Research Institute (CARIM)
E-mail: m.manca at path.unimaas.nl
Mobile: +31626441205
Twitter: @markomanka
________________________________________
Da: seandavi at gmail.com [seandavi at gmail.com] per conto di Sean Davis [sdavis2 at mail.nih.gov]
Inviato: venerdì 24 luglio 2009 15.54
A: Manca Marco (PATH)
Cc: bioconductor mailing list
Oggetto: Re: [BioC] How to use GEOquery to extract more than the default        information from a GSE

On Fri, Jul 24, 2009 at 9:11 AM, Manca Marco (PATH) <m.manca at path.unimaas.nl<mailto:m.manca at path.unimaas.nl>> wrote:

Dear Sean and dear bioconductors,

I am writing you to ask a source of inspiration (code pieces, notes, references, whatever you might think appropriate) to import array annotation and other data from the GSE I am trying to work with (namely the GSE9820) into my eset.

I have read on GEOquery's vignette that this is actually possible, despite being a bit tricky:

"So, using a combination of lapply on the GSMList, one can extract as many columns of interest as necessary to build the data structure of choice. Because the GSM data from the GEO website are fully downloaded and included in the GSE object, one can extract foreground and background as well as quality for two-channel arrays, for example. Getting array annotation is also a bit more complicated, but by replacing \platform" in the lapply call to get platform information for each array, one can get other information associated with each array. Future work with this package will likely focus on better tools for manipulating GSE data" From http://www.bioconductor.org/packages/2.4/bioc/vignettes/GEOquery/inst/doc/GEOquery.pdf Page 22 of 22

...but I can't find anywhere any hint.

Thank you in advance for your patience and support.

Hi, Marco.

Have you tried:

gse <- getGEO('GSE9820')[[1]]

This should get you all the annotation and the normalized (according to the original submitter's methods) data in an ExpressionSet.  If that isn't what you need, could you please provide the output of sessionInfo(), the code you have tried, and what shortcomings your code has?

Thanks,
Sean