[BioC] GENEID is missing when LOCATION is non-intergenic in VariantAnnotation package

Wed Feb 20 17:39:10 CET 2013

Dear Valerie, thank you again for your patience and the explanations.

Regards, Adai

On Tue, Feb 19, 2013 at 8:55 PM, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> Yes, evidently C1orf105 is not a 'known gene'. All Bioconductor annotations
> can be found here,
>
> http://bioconductor.org/packages/devel/BiocViews.html#___AnnotationData
>
> The TxDb packages found on this page were made from specific tracks at UCSC
> or elsewhere. The details of how a package was made can be found on the
> package man page,
>
> ?TxDb.Hsapiens.UCSC.hg19.knownGene
>
> If none of the TxDb's meet your need you can create your own. In the
> GenomicFeatures package there are several functions available for creating
> custom annotations.
>
> ?makeTranscriptDbFromBiomart
> ?makeTranscriptDbFromGFF
> ?makeTranscriptDbFromUCSC
>
> I've also cc'd Marc, who handles our annotations, in case he has something
> more to add.
>
> Valerie
>
>
>
> On 02/19/2013 10:46 AM, Adaikalavan Ramasamy wrote:
>>
>> Dear Valerie,
>>
>> Thanks once again for the quick reply. I understand there the
>> different the annotation databases are not consistent and I am trying
>> to understand the reason for some of the noise. If I look up
>> chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I
>> see that it overlaps with C1orf105 and PIGC
>>
>>
>> http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-172411762&hgsid=326905233&pubs=pack
>>
>> http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-172411762
>>
>> so why does the GeneID is blank in this case? On the other hand res[8,
>> ] calls it PIGC only. Is it because C1orf105 is not a "known gene"?
>> BTW, the examples I sent were all non-intergenics. Thank you.
>>
>> Regards, Adai
>>
>>
>>
>> On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at fhcrc.org>
>> wrote:
>>>
>>> Hello,
>>>
>>> All values in the output (GENEID, TXID, etc.) are taken from the
>>> annotation
>>> you are using. If the annotaion does not have a GENEID, TXID, etc. for a
>>> particular range, then none will be reported.
>>>
>>> To take a closer look at the annotation we can extract the transcripts
>>> and
>>> ask for gene_id, cds_id and tx_id as columns.
>>>
>>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>> tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id"))
>>>
>>>
>>> Looking at the first 3 rows, we see none of these have a gene_id,
>>>
>>>>> tx[1:3]
>>>>
>>>>
>>>> GRanges with 3 ranges and 3 metadata columns:
>>>>        seqnames         ranges strand |     tx_id
>>>> gene_id
>>>>           <Rle>      <IRanges>  <Rle> | <integer>
>>>> <CompressedCharacterList>
>>>>    [1]     chr1 [11874, 14409]      + |         1
>>>>    [2]     chr1 [11874, 14409]      + |         2
>>>>    [3]     chr1 [11874, 14409]      + |         3
>>>>                         cds_id
>>>>        <CompressedIntegerList>
>>>>    [1]                      NA
>>>>    [2]                   1,2,3
>>>>    [3]                      NA
>>>
>>>
>>>
>>> To isolate the portion of the annotation that overlaps with your ranges
>>> you
>>> can use subsetByOverlaps(),
>>>
>>> gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"),
>>>                IRanges(c(23803138, 172410967, 60890373,
>>>                        44061025, 153627145), width=1))
>>> res <- subsetByOverlaps(tx, gr)
>>>
>>> Here is an example of a transcript with no gene_id
>>>
>>>>> res[8:10]
>>>>
>>>>
>>>> GRanges with 3 ranges and 3 metadata columns:
>>>>        seqnames                 ranges strand |     tx_id
>>>>           <Rle>              <IRanges>  <Rle> | <integer>
>>>>    [1]     chr1 [172410597, 172413230]      - |      6937
>>>>    [2]     chr1 [172410869, 172411762]      - |      6938
>>>>    [3]    chr17 [ 43971748,  44105699]      + |     59914
>>>>                          gene_id                  cds_id
>>>>        <CompressedCharacterList> <CompressedIntegerList>
>>>>    [1]                      5279                NA,20697
>>>>    [2]                                             20697
>>>>    [3]                      4137    NA,178155,178156,...
>>>
>>>
>>>
>>>
>>> Also remember that if a range is 'intergenic' that it will not have a
>>> GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID.
>>>
>>> Valerie
>>>
>>>
>>>
>>>
>>> On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote:
>>>>
>>>>
>>>> Dear all,
>>>>
>>>> I am finding some unexpected results (to me anyway) with the
>>>> VariantAnnotation package. Basically, there are situations where the
>>>> GENEID is missing when LOCATION is either coding, promoter, intron,
>>>> threeUTR or fiveUTR. Here is an example with five SNPs (among many
>>>> more). I have marked the unexpected results with "##".
>>>>
>>>>
>>>> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>
>>>> tmp <- rbind.data.frame(c("rs10917388",  "chr1",  23803138),
>>>>                                     c("rs1063412",   "chr1", 172410967),
>>>>                                     c("rs78291220",  "chr2",  60890373),
>>>>                                     c("rs116917239", "chr17", 44061025),
>>>>                                     c("rs11593",     "chrX",  153627145)
>>>> )
>>>> colnames(tmp) <- c("rsid", "chr", "pos")
>>>> tmp$pos <- as.numeric( as.character(tmp$pos) )
>>>>
>>>> target <- with(tmp, GRanges(seqnames = Rle(chr),
>>>>                                             ranges   = IRanges(pos,
>>>> end=pos, names=rsid),
>>>>                                             strand   = Rle(strand("*"))
>>>> ) )
>>>>
>>>> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene,
>>>> AllVariants())
>>>> names(loc) <- NULL
>>>> out <- as.data.frame(loc)
>>>> out$rsid <- names(target)[ out$QUERYID ]
>>>> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID",
>>>> "PRECEDEID", "FOLLOWID")]
>>>> out <- unique(out)
>>>> rownames(out) <- NULL
>>>> out
>>>>
>>>>              rsid  seqnames     start   LOCATION GENEID PRECEDEID
>>>> FOLLOWID
>>>> 1   rs10917388     chr1  23803138     intron  55616      <NA>     <NA>
>>>> 2   rs10917388     chr1  23803138   promoter   <NA>      <NA>     <NA>
>>>> ##
>>>>
>>>> 3    rs1063412     chr1 172410967     intron  92346      <NA>     <NA>
>>>> 4    rs1063412     chr1 172410967     intron   5279      <NA>     <NA>
>>>> 5    rs1063412     chr1 172410967     coding   5279      <NA>     <NA>
>>>> 6    rs1063412     chr1 172410967     coding   <NA>      <NA>     <NA>
>>>> ##
>>>>
>>>> 7   rs78291220     chr2  60890373   promoter   <NA>      <NA>     <NA>
>>>> ##
>>>> 8   rs78291220     chr2  60890373 intergenic   <NA>     64895   400957
>>>>
>>>> 9  rs116917239    chr17  44061025     coding   4137      <NA>     <NA>
>>>> 10 rs116917239    chr17  44061025     intron   4137      <NA>     <NA>
>>>> 11 rs116917239    chr17  44061025     coding   <NA>      <NA>     <NA>
>>>> ##
>>>>
>>>> 12     rs11593     chrX 153627145     intron   6134      <NA>     <NA>
>>>> 13     rs11593     chrX 153627145   promoter   6134      <NA>     <NA>
>>>> 14     rs11593     chrX 153627145   promoter  26778      <NA>     <NA>
>>>> 15     rs11593     chrX 153627145   promoter   <NA>      <NA>     <NA>
>>>> ##
>>>> 16     rs11593     chrX 153627145    fiveUTR    <NA>      <NA>     <NA>
>>>> ##
>>>> 17     rs11593     chrX 153627145   threeUTR   <NA>      <NA>     <NA>
>>>> ##
>>>>
>>>> Can anyone help explain what is happening please? Is this to be
>>>> expected? Thank you.
>>>>
>>>> Regards, Adai
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>