[BioC] getting gene ranges from a TranscriptDb (GenomicFeatures package)

Wed Sep 15 04:18:02 CEST 2010

On Tue, Sep 14, 2010 at 7:53 PM, Elizabeth Purdom
<epurdom at stat.berkeley.edu> wrote:
>
>>
>>
>>     Before I go further down this route, is there not a simple way to
>>     get this information?
>>
>>
>> Stating the obvious, probably, but this is not a question with a
>> simple answer.  You may have to settle on some sort of heuristic
>> (choose a random region, the longest region, the one containing the
>> most transcripts, etc.) to sort out such cases or simply ignore them
>> altogether (throw them away).  Unfortunately, these oddities are
>> present in the human genome (and other expanded genomes) and it is not
>> possible to totally avoid them.
>>
> I understand the possible pitfalls (and the biology for why they would
> happen), but at the same time, it is still a pretty standard question
> that many databases give some answer to. Generally, I understand gene
> boundaries as the largest region that contains the transcripts annotated
> to be in that gene. This is the case for Ensembl, I believe (where you
> can definitely query the database and get out gene start and stop points

You could use org.Hs.eg.db for example and acquire the CHRLOC and CHRLOCEND data
for each Gene ID.  Note, however

> mk = mappedkeys(org.Hs.egCHRLOC)
> table(sapply(mget(mk, org.Hs.egCHRLOC), length))

    1     2     3     4     5     6     7     8     9    10    12    13    14
17626  2753   622   196    72    56    65    55    14    20     2     1    11
   15    16    17    18    19    24
    6     5     2     2     3     1

you will need to define a policy for dealing with the multiplicities, e.g.

> which(sapply(mget(mk, org.Hs.egCHRLOC), length)==16)
259197   5696   6891   8859   9374
  6510  14049  16604  20056  20884
> get("5696", org.Hs.egSYMBOL)
[1] "PSMB8"
> get("5696", org.Hs.egCHRLOC)
          6           6  6_apd_hap1  6_apd_hap1  6_cox_hap2  6_cox_hap2
  -32808496   -32808496    -4095502    -4095502    -4253026    -4253026
 6_dbb_hap3  6_dbb_hap3 6_mann_hap4 6_mann_hap4  6_mcf_hap5  6_mcf_hap5
   -4089876    -4089876    -4265694    -4265694    -4145372    -4145372
 6_qbl_hap6  6_qbl_hap6 6_ssto_hap7 6_ssto_hap7
   -4040605    -4040605    -4239268    -4239268
> get("5696", org.Hs.egCHRLOCEND)
          6           6  6_apd_hap1  6_apd_hap1  6_cox_hap2  6_cox_hap2
  -32812712   -32811816    -4099716    -4098820    -4257240    -4256344
 6_dbb_hap3  6_dbb_hap3 6_mann_hap4 6_mann_hap4  6_mcf_hap5  6_mcf_hap5
   -4094090    -4093194    -4269908    -4269012    -4149579    -4148693
 6_qbl_hap6  6_qbl_hap6 6_ssto_hap7 6_ssto_hap7
   -4044819    -4043923    -4243485    -4242589

it is not hard to think of code that will implement the policy you
allude to.  i think it is unwise for
bioconductor core packages to devise implementations of such policies
-- these policies need to be explicitly selected and
justified by the investigators using them

> for a gene id). If there are multiple chromosomes for the same gene id,
> these could be returned as separate entries. I would like to have a
> database query that accomplishes this, like the transcripts and exons
> functions, because my naive implementation (in addition to the problems
> with the multiple chrX and chrY mappings) is extremely slow.

Perhaps it is slow, but it is clear what is being done, and it is a
one-off computation.
You get the GRanges structure you want, and save it until versions
change, and that's that.

Now if we look at the different approaches suggested so far, we see

> library(GenomicFeatures)
> hg19 = makeTranscriptDbFromUCSC("hg19")
Download the knownGene table ... OK
Download the knownToLocusLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TranscriptDb object ... OK
> transcriptsBy(hg19, "gene")$"100"
GRanges with 2 ranges and 2 elementMetadata values
    seqnames               ranges strand |     tx_id     tx_name
       <Rle>            <IRanges>  <Rle> | <integer> <character>
[1]    chr20 [43248164, 43280376]      - |     70158  uc002xmj.2
[2]    chr20 [43248164, 43280376]      - |     70159  uc010ggt.2

seqlengths
                  chr1                  chr2 ... chr18_gl000207_random
             249250621             243199373 ...                  4262
> get("100", org.Hs.egUCSCKG)
[1] "uc002xmj.2" "uc010ggt.2"
> get("100", org.Hs.egCHRLOC)
       20
-43248163
> get("100", org.Hs.egCHRLOCEND)
       20
-43280376

For this example, the transcript table and the org.Hs.egCHRLOC data
seem consistent.  i believe
that programming the gene boundary GRanges that you want from the
org.Hs.eg will be easier than
the ranges manipulations you find slow, and multiplicity handling can
be factored out from range manipulation.

>
> The reason I want this is that I want to pull up all reads from a bam
> file associated with this gene, regardless of the transcript. If I give
> per transcript information, I get reads coming up multiple times,
> because the transcripts from the same gene substantially overlap and the
> reads are pulled up per transcript. This might be useful in some
> setting, but generally I'm going to want to then post-process the reads
> per gene to deal with the multiple transcript problem (yes I know genes
> can overlap too, but this is a problem of smaller magnitude, and I think
> I can deal with that issue separately). Moreover, the number of genes is
> much smaller than the number of transcripts, and I want to reduce the
> number of queries of the bam file, as well as the number of reads I pull
> into memory at a time.
>
> Thanks,
> Elizabeth
>
>
>
>
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>