[BioC] SQLForge and probes that map to multiple genes

Mon Jul 14 18:16:55 CEST 2008

On Mon, Jul 14, 2008 at 12:07 PM, Cei Abreu-Goodger <cei at sanger.ac.uk> wrote:
> Hi Sean,
>
> Ok, so my example was even worse than I thought. And I had forgot to mention
> that the otherSrc parameter wasn't what I needed. So, to return to my bad
> example, I now have two separate files, the first column in the first file,
> the second in the second file:
>
>> refseqs <- "gnf1m.test.tab"
>> refseqs2 <- "gnf1m.test2.tab"
>>
>> read.table(refseqs)
>              V1        V2
> 1   gnf1m00050_at NM_008929
> 2 gnf1m00051_a_at NM_007487
> 3 gnf1m00052_a_at NM_178939
> 4 gnf1m00053_a_at NM_181666
> 5 gnf1m00054_a_at NM_026430
> 6 gnf1m00055_a_at NM_029916
> 7 gnf1m00056_a_at NM_181666
>> read.table(refseqs2)
>              V1        V2
> 1   gnf1m00050_at NM_172283
> 2 gnf1m00051_a_at NM_172283
> 3 gnf1m00052_a_at NM_172283
> 4 gnf1m00053_a_at NM_172283
> 5 gnf1m00054_a_at NM_172283
> 6 gnf1m00055_a_at NM_172283
> 7 gnf1m00056_a_at NM_172283
>
> I now add the second file as an otherSrc:
>
>> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs,
>> baseMapType="refseq", otherSrc=c(refseqs2),
>                outputDir=".", version="0.9", manufacturer="GNF-Affymetrix",
> chipName="gnf1m")
>
>
> But this till doesn't add the second gene's annotation to all the probes
> (the resulting package's annotation is exactly the same as in the first
> case). Is there any other way?

I think that the way SQLForge works now, it will only use the
additional annotation if the first ID is not successfully mapped.
(Someone else should probably confirm my assertion about this).  Since
it appears that your first column contains all RefSeq IDs, you will
never get to the second column.  So, in short, I don't know how to
make SQLForge do what you want.

Sean

>
> Sean Davis wrote:
>>
>> On Mon, Jul 14, 2008 at 11:20 AM, Cei Abreu-Goodger <cei at sanger.ac.uk>
>> wrote:
>>
>>>
>>> Hi all,
>>>
>>> I'm trying to generate an annotation package for a custom mouse Affy chip
>>> (GNF1M). I'm a bit confused about how the package deals with probes that
>>> are
>>> mapped to multiple genes. Sure, when I have a single column of
>>> identifiers
>>> everything works nicely, but what exactly happens when I have more than
>>> one
>>> gene per probe?
>>>
>>> I tried a mock annotation, code below:
>>>
>>> # Running code to build the annotation package
>>>
>>>>
>>>> library(AnnotationDbi)
>>>> library(mouse.db0)
>>>>
>>>> refseqs <- "gnf1m.test.tab"
>>>> read.table(refseqs)
>>>>
>>>
>>>             V1        V2        V3
>>> 1   gnf1m00050_at NM_008929 NM_172283
>>> 2 gnf1m00051_a_at NM_007487 NM_172283
>>> 3 gnf1m00052_a_at NM_178939 NM_172283
>>> 4 gnf1m00053_a_at NM_181666 NM_172283
>>> 5 gnf1m00054_a_at NM_026430 NM_172283
>>> 6 gnf1m00055_a_at NM_029916 NM_172283
>>> 7 gnf1m00056_a_at NM_181666 NM_172283
>>>
>>>>
>>>> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs,
>>>> baseMapType="refseq",
>>>>
>>>
>>> +                  outputDir=".", version="0.9",
>>> manufacturer="GNF-Affymetrix", chipName="gnf1m")
>>>
>>>
>>> After installing, though, it seems to me that I have something strange.
>>> Although I added the refseq "NM_172283" to all of the probes, in the
>>> annotation it only went to two of them, the last one and another that was
>>> identical (see below). This might not be the best example, but if I do
>>> have
>>> probes that map to different genes, what's the best way of making
>>> SQLForge
>>> aware of this?
>>>
>>> Thanks!
>>>
>>> Cei
>>>
>>>
>>> # loading and accessing the annotation package
>>>
>>>>
>>>> library(test.db)
>>>> as.list(testREFSEQ)
>>>>
>>>
>>> $gnf1m00050_at
>>> [1] "NM_008929" "NP_032955"
>>>
>>> $gnf1m00051_a_at
>>> [1] "NM_001039515" "NM_007487"    "NP_001034604" "NP_031513"
>>> $gnf1m00052_a_at
>>> [1] "NM_178939" "NP_849270"
>>>
>>> $gnf1m00053_a_at
>>> [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052"
>>>
>>> $gnf1m00054_a_at
>>> [1] "NM_026430" "NP_080706"
>>>
>>> $gnf1m00055_a_at
>>> [1] "NM_029916" "NP_084192"
>>>
>>> $gnf1m00056_a_at
>>> [1] "NM_172283" "NM_181666" "NP_758487" "NP_858052"
>>>
>>
>> If you look up NM_18166 and NM_172283, they are transcripts for the
>> same gene, so one would expect that they will always be included
>> together in the *REFSEQ lookup.  The reason this is important is that,
>> despite the fact that it appears that the third column in your data is
>> being used, it is not.
>>
>> You probably want to look at the otherSrc parameter for specifying
>> additional IDs to map.
>>
>> Sean
>>
>>
>>
>>>>
>>>> sessionInfo()
>>>>
>>>
>>> R version 2.7.0 (2008-04-22)
>>> i386-apple-darwin8.10.1
>>>
>>> locale:
>>> C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices datasets  tools     utils     methods
>>> [8]
>>> base
>>> other attached packages:
>>> [1] test.db_0.9         mouse.db0_2.1.4     AnnotationDbi_1.2.0
>>> [4] RSQLite_0.6-8       DBI_0.2-4           Biobase_2.0.0
>>>
>>>
>>> --
>>> The Wellcome Trust Sanger Institute is operated by Genome Research
>>> Limited,
>>> a charity registered in England with number 1021457 and a company
>>> registered
>>> in England with number 2742969, whose registered office is 215 Euston
>>> Road,
>>> London, NW1 2BE.
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>
>
> --
> Cei Abreu-Goodger, PhD
>
> Wellcome Trust Sanger Institute
> Computational and Functional Genomics
> Wellcome Trust Genome Campus
> Hinxton, Cambridge, CB10 1SA, UK
>
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited,
> a charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.
>