[BioC] Adding annotations to GSE datasets

Thu May 8 16:15:12 CEST 2014

On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops at gmail.com> wrote:
> That is all because I am interested in the expression values for some pairs
> of genes.
>
> If I had something like this:
>
>       GSM278765 GSM278766 GSM278767 ...
> A1BG   5.459950  5.548725  5.477436 ...
> NAT2   6.728919  6.329578  6.570104 ...
> ADA    6.861095  7.005730  7.235361 ...
> CDH2   9.660035  9.189507  9.740223 ...
> ...    5.644313  5.898675  5.475838 ...
> ...    7.838040  7.564335  8.397569 ...
>
> Then I could extract lines for the genes of interest (for example, 'A1BG'
> and 'ADA'), and then plot scatterplots, compute correlation coefficients,
> etc...

Something like this might work:

plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',])

Sean

> The name of the genes for each line is the only detail that is not present
> in my dataset.
>
> What am I missing here?
>
> Thanks,
> Marcelo
>
>
>
> On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops at gmail.com> wrote:
>>
>> Hello Sean,
>>
>> Thanks for your replies.
>>
>> I used to download all the CEL files, and then load, normalize and
>> generate the ExpressionSet output.  All manually, and it was working fine!
>>
>> Then I found out about doing it automatically using the GEOquery library.
>> And this is what have been taking my hours lately.
>>
>> The output of exprs(gset[[1]]) is the initial point where I got stuck
>> after a few minutes using the GEOquery library, because I have the
>> expression, but not the gene's names.
>>
>>       GSM278765 GSM278766 GSM278767 ...
>> 1      5.459950  5.548725  5.477436 ...
>> 10     6.728919  6.329578  6.570104 ...
>> 100    6.861095  7.005730  7.235361 ...
>> 1000   9.660035  9.189507  9.740223 ...
>> 10000  5.644313  5.898675  5.475838 ...
>> 10001  7.838040  7.564335  8.397569 ...
>>
>> After that, I tried to manipulate the output in order to translate 1, 10,
>> 100, 1000, to the actual names of the genes.  And my last resource was to
>> ask here at the forum.
>>
>> It is looking good already.  I only need to have an extra column, with the
>> names of the genes.
>>
>> Thanks,
>> Marcelo
>>
>>
>> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>>
>>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira <marcelops at gmail.com>
>>> wrote:
>>> > Hi Sean,
>>> >
>>> > Thanks for your answer!
>>> >
>>> > That is great already.
>>> >
>>> > I can see the gene's names now:
>>> >
>>> >> library(GEOquery)
>>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)
>>> >> head(fData(gset[[1]]))$Gene
>>> > [1] A1BG NAT2 ADA  CDH2 AKT3 MED6
>>> > 17098 Levels:  A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3 ADA
>>> > ADAM8 AKT3 ... ZNF254
>>> >
>>> > But the data frame only contains these columns.
>>> >
>>> >> names(fData(gset[[1]]))
>>> >  [1] "ID"           "Gene"         "UniGene"      "Description"
>>> > "Ensembl*
>>> > Chr" "Start (bp)"
>>> >  [7] "End (bp)"     "Strand"       "ORF"          "SPOT_ID"
>>> >
>>> > Where is the expression information for each gene?
>>>
>>> exprs(gset[[1]])
>>>
>>> gset is an ExpressionSet, so you should read a bit about
>>> ExpressionSets in the Biobase vignette as well as the help page.
>>>
>>> Sean
>>>
>>>
>>> >
>>> > Thanks,
>>> > Marcelo
>>> >
>>> >
>>> >
>>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>> > wrote:
>>> >
>>> >> Hi, Marcelo.
>>> >>
>>> >>
>>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira <marcelops at gmail.com>
>>> >> wrote:
>>> >> > Quick question:
>>> >> >
>>> >> > I am trying to import some GEO datasets, and having some issues with
>>> >> > the
>>> >> > annotations:
>>> >> >
>>> >> > I can download the GSE dataset using:
>>> >> >
>>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)
>>> >> >
>>> >> >
>>> >> > However, it will return me a ExpressionSet with the following
>>> >> > format:
>>> >> >
>>> >> >               X1    X10    X100   X1000 ...
>>> >> > GSM278765
>>> >> > GSM278766
>>> >> > GSM278767
>>> >> > GSM278768
>>> >> > GSM278769
>>> >> > ...
>>> >>
>>> >> This is not what is returned by GEOquery, so you have done some
>>> >> manipulation (looks like you did a transpose on the expression
>>> >> matrix), it seems.
>>> >>
>>> >> > This is pretty much what I need, but I still need to translate (X1,
>>> >> > X10,
>>> >> > X100, X1000, etc...) to the actual names of the genes.
>>> >>
>>> >> library(GEOquery)
>>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]]
>>> >> head(fData(gset))
>>> >>
>>> >> The gene symbols are in the "Gene" column:
>>> >>
>>> >> genesymbols = fData(gset)$Gene
>>> >>
>>> >> Sean
>>> >>
>>> >>
>>> >> >
>>> >> > Any suggestions?
>>> >> >
>>> >> > Thanks,
>>> >> > Marcelo
>>> >> >
>>> >> >         [[alternative HTML version deleted]]
>>> >> >
>>> >> > _______________________________________________
>>> >> > Bioconductor mailing list
>>> >> > Bioconductor at r-project.org
>>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >> > Search the archives:
>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >>
>>> >
>>> >         [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > Bioconductor mailing list
>>> > Bioconductor at r-project.org
>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> > Search the archives:
>>> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>