[BioC] How to get position for each gene ID/gene symbol instead of position for each transcript

shirley zhang shirley0818 at gmail.com
Wed Aug 25 03:07:06 CEST 2010


Hi Marc,

Thanks a lot for your detailed reply. It is very helpful.

I tried library "GenomicFeatures" and your code. It works great. I did
get position for each Transcript.  However I am only interested in
obtaining position for each gene ID/gene symbol instead of position
for each transcript. For example, if one gene has multiple
transcripts, I only want the biggest boundary of each gene which
includes all of its transcripts. Does GenomicFeatures or other package
can do this? Or I have to compute the gene position by combining all
of its corresponding transcripts?

Thanks again for your suggestions.
Shirley

On Thu, Aug 19, 2010 at 2:50 PM, Marc Carlson <mcarlson at fhcrc.org> wrote:
> Hi Shirly,
>
> That is a very good question!  Looking at the latest builds of the
> annotations (which is what we normally would recommend), the manual page
> will now tell you that the source for the latest CHRLOC and CHRLOCEND
> information is from here:
>
>     Mappings were based on data provided by: UCSC Genome
>     Bioinformatics (Homo sapiens)
>     ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19
>
> That means that this information corresponds to the more recent "hg19"
> build which in turn will correspond to "build 37".  So if you want older
> builds you could use older versions of the org packages, but I since I
> would never in good conscience recommend using stale annotations, I
> think that in that case you would be much better off to do the initial
> steps I showed you (to get the entrez gene IDs), and THEN using the
> GenomicFeatures package, which should allow you to get positions for the
> hg18 build mapped onto entrez gene IDs.
>
> You can see a vignette for GenomicFeatures here:
>
> http://www.bioconductor.org/packages/release/bioc/html/GenomicFeatures.html
>
> You would basically just need to use makeTranscriptDbFromUCSC(),
> serialize the result for safekeeping, and then call transcripts() on the
> object it gave you.  Once you had done so, you could simply ask for the
>
> library(GenomicFeatures)
> foo = makeTranscriptDbFromUCSC(genome="hg18",tablename="knownGene")
> bar = transcriptsBy(foo, by="gene")
> ## At this point you will have  GRanges object which is really useful
> for a lot of great stuff,
> ## but since you are only interested in matching the gene_ids to the
> positions,
> ## you can just pare it down to a much simpler object like so:
>
> frm = as.data.frame(bar)
>
> ## The values in "element" will be the names of the elements from the
> GRangesList object bar.
> ## If you check them you will see that for this UCSC track, they are the
> entrez gene IDs that you want to merge with.
>
> Hope this helps,
>
>
>  Marc
>
>
>
> On 08/18/2010 06:26 PM, shirley zhang wrote:
>> Hi Marc,
>>
>> Thanks a lot for your information. It is very very helpful. May I ask
>> one more question?
>>
>> In my case,  the snp_position is based on Human Build 36.  The package
>> org.Hs.eg.db was updated in October 2009, so can I assume it is on
>> Build 36?
>>
>> Thanks again,
>> Shirely
>>
>> On Wed, Aug 18, 2010 at 12:56 PM, Marc Carlson <mcarlson at fhcrc.org> wrote:
>>
>>> Hi Shirley,
>>>
>>> You can find this information in an organism package.  org.Hs.eg.db for
>>> humans (for example).
>>>
>>> What will be harder will be getting all the information from the mixed
>>> bag of IDs that you describe.  But you can still retrieve it for each ID
>>> type separately like this example for refseq IDs:
>>>
>>> ## So for refseq you can start by getting the entrez gene ID
>>> library(org.Hs.eg.db)
>>> a = c("NM_000014", "NM_000015", "foo")
>>> gene_id = unlist2(mget(a, revmap(org.Hs.egREFSEQ), ifnotfound=NA))
>>> gene_id = gene_id[!is.na(gene_id)]
>>>
>>> ## And then once you have the IDs, you can map them to a position
>>> chrloc = toTable(org.Hs.egCHRLOC[gene_id])
>>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id])
>>>
>>> ## And now because we were careful to make sure that we always have an
>>> entrez gene ID,
>>> ## you can just merge all the results together to make this easier to
>>> look at:
>>> merge(merge(data.frame(gene_id=gene_id, refseq=names(gene_id)), chrloc),
>>> chrlocend)
>>>
>>>
>>> ## This process can then be repeated for other kinds of IDs etc:
>>> b = c("Hs.100217", "Hs.100299")
>>> gene_id = unlist2(mget(b, revmap(org.Hs.egUNIGENE), ifnotfound=NA))
>>> gene_id = gene_id[!is.na(gene_id)]
>>>
>>> chrloc = toTable(org.Hs.egCHRLOC[gene_id])
>>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id])
>>> ## etc.
>>>
>>> Does that help?
>>>
>>>
>>>  Marc
>>>
>>>
>>>
>>>
>>>
>>> On 08/17/2010 07:26 PM, shirley zhang wrote:
>>>
>>>> Dear list,
>>>>
>>>> I have a list of human SNPs and genes/transcripts which are obtained
>>>> from different studies.  I would like to calculate the distance
>>>> between these SNPs and genes/transcripts on the same chromosome.
>>>> However, the ID for the gene/transcript across studies  is not
>>>> consistent. For example, the ID could be Gene Symbol/Alias (TRIM5l),
>>>> RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol is
>>>> LOC254266, but its record is an obsolete version), Contig22780_RC,
>>>> HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc.
>>>>
>>>> I would like to retrieve the start and end position for those
>>>> gene/transcript. Can anybody help me how to get the position
>>>> information based on these different gene/transcript IDs. If I also
>>>> need to match all of other IDs to gene symbol/entrezID, what kind of
>>>> database or bioConductor package will allow me to do the mapping as
>>>> perfect as possible? How about biomaRt package?
>>>>
>>>> Thanks
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>>
>>
>
>



More information about the Bioconductor mailing list