[BioC] SQLForge and probes that map to multiple genes

Sean Davis sdavis2 at mail.nih.gov
Tue Jul 15 19:01:45 CEST 2008

On Mon, Jul 14, 2008 at 8:42 PM, Mark Cowley <m.cowley0 at gmail.com> wrote:
> Hi Marc, Sean and list.
> If I can follow up on Marc's comment:
> "The thing that has me scratching my head is why you would want to map
> multiple genes onto a single probe in your annotation package?"
> The genomics annotation problem (what does this ProbeSet detect, and which
> ProbeSets detect my gene of interest) is inherently many to many, that is,
> one ProbeSet can map to many 'genes' (or at least many different accessions
> that point to the same gene), and that 1 'gene' can map to multiple
> ProbeSets (perhaps different isoforms).

This is true, but the extent to which it needs to be "modeled" is up
to the user.  Our approach is to do everything based on probe
(differential expression, etc.) and, for those probes that look VERY
interesting but have unclear annotation, blast them against all known
transcript databases for hints as to what they represent.  The vast
majority of probes/probesets do not need this special treatment on a
daily basis, I do not think.

> Does SQLforge handle these inevitable situations nicely?

It doesn't sound like SQLForge will handle the situation that you
describe.  I would suggest a custom SQL database for your mappings.
Of course, that will not be useful as an annotation package, but
including the many-to-many issues is not generally possible for
algorithms using the annotation packages, anyway.

Hope that helps, at least practically speaking.


> On 15/07/2008, at 6:57 AM, Marc Carlson wrote:
>> Sean Davis wrote:
>>> On Mon, Jul 14, 2008 at 12:07 PM, Cei Abreu-Goodger <cei at sanger.ac.uk>
>>> wrote:
>>>> Hi Sean,
>>>> Ok, so my example was even worse than I thought. And I had forgot to
>>>> mention
>>>> that the otherSrc parameter wasn't what I needed. So, to return to my
>>>> bad
>>>> example, I now have two separate files, the first column in the first
>>>> file,
>>>> the second in the second file:
>>>>> refseqs <- "gnf1m.test.tab"
>>>>> refseqs2 <- "gnf1m.test2.tab"
>>>>> read.table(refseqs)
>>>>            V1        V2
>>>> 1   gnf1m00050_at NM_008929
>>>> 2 gnf1m00051_a_at NM_007487
>>>> 3 gnf1m00052_a_at NM_178939
>>>> 4 gnf1m00053_a_at NM_181666
>>>> 5 gnf1m00054_a_at NM_026430
>>>> 6 gnf1m00055_a_at NM_029916
>>>> 7 gnf1m00056_a_at NM_181666
>>>>> read.table(refseqs2)
>>>>            V1        V2
>>>> 1   gnf1m00050_at NM_172283
>>>> 2 gnf1m00051_a_at NM_172283
>>>> 3 gnf1m00052_a_at NM_172283
>>>> 4 gnf1m00053_a_at NM_172283
>>>> 5 gnf1m00054_a_at NM_172283
>>>> 6 gnf1m00055_a_at NM_172283
>>>> 7 gnf1m00056_a_at NM_172283
>>>> I now add the second file as an otherSrc:
>>>>> makeMOUSECHIP_DB(affy=FALSE, prefix="test", fileName=refseqs,
>>>>> baseMapType="refseq", otherSrc=c(refseqs2),
>>>>              outputDir=".", version="0.9",
>>>> manufacturer="GNF-Affymetrix",
>>>> chipName="gnf1m")
>>>> But this till doesn't add the second gene's annotation to all the probes
>>>> (the resulting package's annotation is exactly the same as in the first
>>>> case). Is there any other way?
>>> I think that the way SQLForge works now, it will only use the
>>> additional annotation if the first ID is not successfully mapped.
>>> (Someone else should probably confirm my assertion about this).  Since
>>> it appears that your first column contains all RefSeq IDs, you will
>>> never get to the second column.  So, in short, I don't know how to
>>> make SQLForge do what you want.
>>> Sean
>> Hi Guys,
>> Sean is correct about the purpose of the the otherSrc parameter, and about
>> the way that SQLforge currently works.  The thing that has me scratching my
>> head is why you would want to map multiple genes onto a single probe in your
>> annotation package?
>>  Marc
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list