[BioC] SQLForge and probes that map to multiple genes

Fri Jul 18 02:10:52 CEST 2008

Thanks Marc, that does clarify things.
I completely agree with you about the ambiguous mapping problem, and  
that ignoring probesets that may stick to multiple different genes is  
probably the way to go.
However it is exceedingly difficult to determing when this is the  
case. For instance, If you trawl through the latest transcript.csv  
files for say the HuGene 1.0 ST array, each transcript cluster ID is  
annotated to map to many different things. Most of the time, these are  
just the different annotation db's names for the 'same gene', eg  
RefSeq, ENSEMBL, ... In the rare cases though, these heterogenous  
identifiers from heterogeneous databases will be referring to  
different genes. The problem then is identifying these cases.

I'm just making an AffyGenePDInfoPkg for the HuGene and MoGene arrays,  
so i'll see how I go there.

cheers,
Mark

On 17/07/2008, at 3:11 AM, Marc Carlson wrote:

> Mark Cowley wrote:
>> Hi Marc, Sean and list.
>>
>> If I can follow up on Marc's comment:
>> "The thing that has me scratching my head is why you would want to  
>> map multiple genes onto a single probe in your annotation package?"
>>
>> The genomics annotation problem (what does this ProbeSet detect,  
>> and which ProbeSets detect my gene of interest) is inherently many  
>> to many, that is, one ProbeSet can map to many 'genes' (or at least  
>> many different accessions that point to the same gene), and that 1  
>> 'gene' can map to multiple ProbeSets (perhaps different isoforms).
>>
>> Does SQLforge handle these inevitable situations nicely?
>> Having read the SQLForge pdf documentation, and this post, it seems  
>> that you can only provide at most 2 accessions for each ProbeSet,  
>> perhaps a RefSeq accession, and if that is not known, a GenBank  
>> accession.
>>
>> If this has been discussed elsewhere, can someone please point me  
>> in the right direction?
>>
>> Cheers,
>>
>> Mark
>> -----------------------------------------------------
>> Mark Cowley, BSc (Bioinformatics)(Hons)
>>
>> Peter Wills Bioinformatics Centre
>> Garvan Institute of Medical Research, Sydney, Australia
>> -----------------------------------------------------
>> On 15/07/2008, at 6:57 AM, Marc Carlson wrote:
>>
> Hi Mark,
>
> In its current form, SQLForge takes as many IDs as you want to give  
> it, but it currently assumes that you only intended to assign one  
> kind of gene to a given probe at a time.  That is, it assumes that  
> when you made the probe that you really only meant to measure one  
> thing.  It is well understood by all of us who make annotation  
> packages that in practice this may not always work out as you  
> intended.  But what was confusing me was why you would want to deal  
> with ambiguous probes by creating an ambiguous database?  It seems  
> to me that it might really be better to just not make a gene  
> assignment if you really don't know what your probe is measuring.   
> If a probe is known to be sticking to more than one thing, then the  
> interpretation of any measurement from that probe really becomes  
> very speculative since you will have no way of knowing what  
> proportion of the signal belongs to what.  I agree with Sean that in  
> the rare case like this you will really want to look at a recent  
> blast alignment for your mystery probe.  But since a case like that  
> really is (ultimately) a mystery probe, I feel quite hesitant to  
> assign multiple identities to it inside of an annotation package...
>
> Just for the sake of clarification, it is not the case that SQLForge  
> will only take two kinds of IDs at a time for mapping.  One of the  
> parameters (otherSrc) takes a vector of filenames so you can pass  
> several different mappings into that parameter at once if desired.   
> Many major ID types are supported as a way to tell SQLForge what  
> gene to assign, but once it has an assignment it will then go and  
> get all the data for the database from public sources.  So all your  
> mapping files are just a hook to let SQLForge find the rest of the  
> information.  In most cases, your initial mapping will probably be  
> complete enough to render the extra data that is passed into the  
> otherSrc parameter as redundant.
>
> I hope this clarifies things,
>
> Marc