[BioC] Adding annotations to GSE datasets

Thu May 8 17:30:28 CEST 2014

On Thu, May 8, 2014 at 11:22 AM, Marcelo Pereira <marcelops at gmail.com> wrote:
> One last question:
>
>       GSM278765 GSM278766 GSM278767 ...
> A1BG   5.459950  5.548725  5.477436 ...
> NAT2   6.728919  6.329578  6.570104 ...
> ADA    6.861095  7.005730  7.235361 ...
> CDH2   9.660035  9.189507  9.740223 ...
> ...    5.644313  5.898675  5.475838 ...
> ...    7.838040  7.564335  8.397569 ...
>
> Each CEL file has a description, telling which kind of tissue that sample is
> related to.
>
> Is there a direct way of translating the column names  from (GSM278765,
> GSM278766, ...) to the description of the tissue (CC_KIDNEY_1, CC_KIDNEY_2,
> CC_KIDNEY_3, ...) ?
>
>       CC_KIDNEY_1  CC_KIDNEY_2  CC_KIDNEY_3 ...
> A1BG     5.459950     5.548725     5.477436 ...
> NAT2     6.728919     6.329578     6.570104 ...
> ADA      6.861095     7.005730     7.235361 ...
> CDH2     9.660035     9.189507     9.740223 ...
> ...      5.644313     5.898675     5.475838 ...
> ...      7.838040     7.564335     8.397569 ...
>
> Thanks,
> Marcelo

You'll need to do a little work using sub(), but this information is
typically in one of the columns of:

pData(gset[[1]])

This blog post by Rafa Irizarry might be helpful to understand how an
ExpressionSet works:

http://simplystatistics.org/2014/02/03/the-three-tables-for-genomics-collaborations/

Sean

>
> On Thu, May 8, 2014 at 10:21 AM, Marcelo Pereira <marcelops at gmail.com>
> wrote:
>>
>> Thanks Sean,
>>
>> That is exactly what I was looking for!
>>
>> Cheers,
>> Marcelo
>>
>>
>> On Thu, May 8, 2014 at 10:15 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>>
>>> On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops at gmail.com>
>>> wrote:
>>> > That is all because I am interested in the expression values for some
>>> > pairs
>>> > of genes.
>>> >
>>> > If I had something like this:
>>> >
>>> >       GSM278765 GSM278766 GSM278767 ...
>>> > A1BG   5.459950  5.548725  5.477436 ...
>>> > NAT2   6.728919  6.329578  6.570104 ...
>>> > ADA    6.861095  7.005730  7.235361 ...
>>> > CDH2   9.660035  9.189507  9.740223 ...
>>> > ...    5.644313  5.898675  5.475838 ...
>>> > ...    7.838040  7.564335  8.397569 ...
>>> >
>>> > Then I could extract lines for the genes of interest (for example,
>>> > 'A1BG'
>>> > and 'ADA'), and then plot scatterplots, compute correlation
>>> > coefficients,
>>> > etc...
>>>
>>> Something like this might work:
>>>
>>> plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',])
>>>
>>> Sean
>>>
>>>
>>> > The name of the genes for each line is the only detail that is not
>>> > present
>>> > in my dataset.
>>> >
>>> > What am I missing here?
>>> >
>>> > Thanks,
>>> > Marcelo
>>> >
>>> >
>>> >
>>> > On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops at gmail.com>
>>> > wrote:
>>> >>
>>> >> Hello Sean,
>>> >>
>>> >> Thanks for your replies.
>>> >>
>>> >> I used to download all the CEL files, and then load, normalize and
>>> >> generate the ExpressionSet output.  All manually, and it was working
>>> >> fine!
>>> >>
>>> >> Then I found out about doing it automatically using the GEOquery
>>> >> library.
>>> >> And this is what have been taking my hours lately.
>>> >>
>>> >> The output of exprs(gset[[1]]) is the initial point where I got stuck
>>> >> after a few minutes using the GEOquery library, because I have the
>>> >> expression, but not the gene's names.
>>> >>
>>> >>       GSM278765 GSM278766 GSM278767 ...
>>> >> 1      5.459950  5.548725  5.477436 ...
>>> >> 10     6.728919  6.329578  6.570104 ...
>>> >> 100    6.861095  7.005730  7.235361 ...
>>> >> 1000   9.660035  9.189507  9.740223 ...
>>> >> 10000  5.644313  5.898675  5.475838 ...
>>> >> 10001  7.838040  7.564335  8.397569 ...
>>> >>
>>> >> After that, I tried to manipulate the output in order to translate 1,
>>> >> 10,
>>> >> 100, 1000, to the actual names of the genes.  And my last resource was
>>> >> to
>>> >> ask here at the forum.
>>> >>
>>> >> It is looking good already.  I only need to have an extra column, with
>>> >> the
>>> >> names of the genes.
>>> >>
>>> >> Thanks,
>>> >> Marcelo
>>> >>
>>> >>
>>> >> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>> >> wrote:
>>> >>>
>>> >>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira <marcelops at gmail.com>
>>> >>> wrote:
>>> >>> > Hi Sean,
>>> >>> >
>>> >>> > Thanks for your answer!
>>> >>> >
>>> >>> > That is great already.
>>> >>> >
>>> >>> > I can see the gene's names now:
>>> >>> >
>>> >>> >> library(GEOquery)
>>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)
>>> >>> >> head(fData(gset[[1]]))$Gene
>>> >>> > [1] A1BG NAT2 ADA  CDH2 AKT3 MED6
>>> >>> > 17098 Levels:  A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3
>>> >>> > ADA
>>> >>> > ADAM8 AKT3 ... ZNF254
>>> >>> >
>>> >>> > But the data frame only contains these columns.
>>> >>> >
>>> >>> >> names(fData(gset[[1]]))
>>> >>> >  [1] "ID"           "Gene"         "UniGene"      "Description"
>>> >>> > "Ensembl*
>>> >>> > Chr" "Start (bp)"
>>> >>> >  [7] "End (bp)"     "Strand"       "ORF"          "SPOT_ID"
>>> >>> >
>>> >>> > Where is the expression information for each gene?
>>> >>>
>>> >>> exprs(gset[[1]])
>>> >>>
>>> >>> gset is an ExpressionSet, so you should read a bit about
>>> >>> ExpressionSets in the Biobase vignette as well as the help page.
>>> >>>
>>> >>> Sean
>>> >>>
>>> >>>
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Marcelo
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>> >>> > wrote:
>>> >>> >
>>> >>> >> Hi, Marcelo.
>>> >>> >>
>>> >>> >>
>>> >>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira
>>> >>> >> <marcelops at gmail.com>
>>> >>> >> wrote:
>>> >>> >> > Quick question:
>>> >>> >> >
>>> >>> >> > I am trying to import some GEO datasets, and having some issues
>>> >>> >> > with
>>> >>> >> > the
>>> >>> >> > annotations:
>>> >>> >> >
>>> >>> >> > I can download the GSE dataset using:
>>> >>> >> >
>>> >>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > However, it will return me a ExpressionSet with the following
>>> >>> >> > format:
>>> >>> >> >
>>> >>> >> >               X1    X10    X100   X1000 ...
>>> >>> >> > GSM278765
>>> >>> >> > GSM278766
>>> >>> >> > GSM278767
>>> >>> >> > GSM278768
>>> >>> >> > GSM278769
>>> >>> >> > ...
>>> >>> >>
>>> >>> >> This is not what is returned by GEOquery, so you have done some
>>> >>> >> manipulation (looks like you did a transpose on the expression
>>> >>> >> matrix), it seems.
>>> >>> >>
>>> >>> >> > This is pretty much what I need, but I still need to translate
>>> >>> >> > (X1,
>>> >>> >> > X10,
>>> >>> >> > X100, X1000, etc...) to the actual names of the genes.
>>> >>> >>
>>> >>> >> library(GEOquery)
>>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]]
>>> >>> >> head(fData(gset))
>>> >>> >>
>>> >>> >> The gene symbols are in the "Gene" column:
>>> >>> >>
>>> >>> >> genesymbols = fData(gset)$Gene
>>> >>> >>
>>> >>> >> Sean
>>> >>> >>
>>> >>> >>
>>> >>> >> >
>>> >>> >> > Any suggestions?
>>> >>> >> >
>>> >>> >> > Thanks,
>>> >>> >> > Marcelo
>>> >>> >> >
>>> >>> >> >         [[alternative HTML version deleted]]
>>> >>> >> >
>>> >>> >> > _______________________________________________
>>> >>> >> > Bioconductor mailing list
>>> >>> >> > Bioconductor at r-project.org
>>> >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >>> >> > Search the archives:
>>> >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >>> >>
>>> >>> >
>>> >>> >         [[alternative HTML version deleted]]
>>> >>> >
>>> >>> > _______________________________________________
>>> >>> > Bioconductor mailing list
>>> >>> > Bioconductor at r-project.org
>>> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >>> > Search the archives:
>>> >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >>
>>> >>
>>> >
>>
>>
>