[BioC] GENEID is missing when LOCATION is non-intergenic in VariantAnnotation package

Tue Feb 19 21:55:58 CET 2013

Yes, evidently C1orf105 is not a 'known gene'. All Bioconductor 
annotations can be found here,

http://bioconductor.org/packages/devel/BiocViews.html#___AnnotationData

The TxDb packages found on this page were made from specific tracks at 
UCSC or elsewhere. The details of how a package was made can be found on 
the package man page,

?TxDb.Hsapiens.UCSC.hg19.knownGene

If none of the TxDb's meet your need you can create your own. In the 
GenomicFeatures package there are several functions available for 
creating custom annotations.

?makeTranscriptDbFromBiomart
?makeTranscriptDbFromGFF
?makeTranscriptDbFromUCSC

I've also cc'd Marc, who handles our annotations, in case he has 
something more to add.

Valerie

On 02/19/2013 10:46 AM, Adaikalavan Ramasamy wrote:
> Dear Valerie,
>
> Thanks once again for the quick reply. I understand there the
> different the annotation databases are not consistent and I am trying
> to understand the reason for some of the noise. If I look up
> chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I
> see that it overlaps with C1orf105 and PIGC
>
> http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-172411762&hgsid=326905233&pubs=pack
> http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-172411762
>
> so why does the GeneID is blank in this case? On the other hand res[8,
> ] calls it PIGC only. Is it because C1orf105 is not a "known gene"?
> BTW, the examples I sent were all non-intergenics. Thank you.
>
> Regards, Adai
>
>
>
> On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at fhcrc.org> wrote:
>> Hello,
>>
>> All values in the output (GENEID, TXID, etc.) are taken from the annotation
>> you are using. If the annotaion does not have a GENEID, TXID, etc. for a
>> particular range, then none will be reported.
>>
>> To take a closer look at the annotation we can extract the transcripts and
>> ask for gene_id, cds_id and tx_id as columns.
>>
>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>> tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id"))
>>
>>
>> Looking at the first 3 rows, we see none of these have a gene_id,
>>
>>>> tx[1:3]
>>>
>>> GRanges with 3 ranges and 3 metadata columns:
>>>        seqnames         ranges strand |     tx_id                   gene_id
>>>           <Rle>      <IRanges>  <Rle> | <integer> <CompressedCharacterList>
>>>    [1]     chr1 [11874, 14409]      + |         1
>>>    [2]     chr1 [11874, 14409]      + |         2
>>>    [3]     chr1 [11874, 14409]      + |         3
>>>                         cds_id
>>>        <CompressedIntegerList>
>>>    [1]                      NA
>>>    [2]                   1,2,3
>>>    [3]                      NA
>>
>>
>> To isolate the portion of the annotation that overlaps with your ranges you
>> can use subsetByOverlaps(),
>>
>> gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"),
>>                IRanges(c(23803138, 172410967, 60890373,
>>                        44061025, 153627145), width=1))
>> res <- subsetByOverlaps(tx, gr)
>>
>> Here is an example of a transcript with no gene_id
>>
>>>> res[8:10]
>>>
>>> GRanges with 3 ranges and 3 metadata columns:
>>>        seqnames                 ranges strand |     tx_id
>>>           <Rle>              <IRanges>  <Rle> | <integer>
>>>    [1]     chr1 [172410597, 172413230]      - |      6937
>>>    [2]     chr1 [172410869, 172411762]      - |      6938
>>>    [3]    chr17 [ 43971748,  44105699]      + |     59914
>>>                          gene_id                  cds_id
>>>        <CompressedCharacterList> <CompressedIntegerList>
>>>    [1]                      5279                NA,20697
>>>    [2]                                             20697
>>>    [3]                      4137    NA,178155,178156,...
>>
>>
>>
>> Also remember that if a range is 'intergenic' that it will not have a
>> GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID.
>>
>> Valerie
>>
>>
>>
>>
>> On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote:
>>>
>>> Dear all,
>>>
>>> I am finding some unexpected results (to me anyway) with the
>>> VariantAnnotation package. Basically, there are situations where the
>>> GENEID is missing when LOCATION is either coding, promoter, intron,
>>> threeUTR or fiveUTR. Here is an example with five SNPs (among many
>>> more). I have marked the unexpected results with "##".
>>>
>>>
>>> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>
>>> tmp <- rbind.data.frame(c("rs10917388",  "chr1",  23803138),
>>>                                     c("rs1063412",   "chr1", 172410967),
>>>                                     c("rs78291220",  "chr2",  60890373),
>>>                                     c("rs116917239", "chr17", 44061025),
>>>                                     c("rs11593",     "chrX",  153627145) )
>>> colnames(tmp) <- c("rsid", "chr", "pos")
>>> tmp$pos <- as.numeric( as.character(tmp$pos) )
>>>
>>> target <- with(tmp, GRanges(seqnames = Rle(chr),
>>>                                             ranges   = IRanges(pos,
>>> end=pos, names=rsid),
>>>                                             strand   = Rle(strand("*")) ) )
>>>
>>> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene,
>>> AllVariants())
>>> names(loc) <- NULL
>>> out <- as.data.frame(loc)
>>> out$rsid <- names(target)[ out$QUERYID ]
>>> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID",
>>> "PRECEDEID", "FOLLOWID")]
>>> out <- unique(out)
>>> rownames(out) <- NULL
>>> out
>>>
>>>              rsid  seqnames     start   LOCATION GENEID PRECEDEID FOLLOWID
>>> 1   rs10917388     chr1  23803138     intron  55616      <NA>     <NA>
>>> 2   rs10917388     chr1  23803138   promoter   <NA>      <NA>     <NA> ##
>>>
>>> 3    rs1063412     chr1 172410967     intron  92346      <NA>     <NA>
>>> 4    rs1063412     chr1 172410967     intron   5279      <NA>     <NA>
>>> 5    rs1063412     chr1 172410967     coding   5279      <NA>     <NA>
>>> 6    rs1063412     chr1 172410967     coding   <NA>      <NA>     <NA> ##
>>>
>>> 7   rs78291220     chr2  60890373   promoter   <NA>      <NA>     <NA> ##
>>> 8   rs78291220     chr2  60890373 intergenic   <NA>     64895   400957
>>>
>>> 9  rs116917239    chr17  44061025     coding   4137      <NA>     <NA>
>>> 10 rs116917239    chr17  44061025     intron   4137      <NA>     <NA>
>>> 11 rs116917239    chr17  44061025     coding   <NA>      <NA>     <NA> ##
>>>
>>> 12     rs11593     chrX 153627145     intron   6134      <NA>     <NA>
>>> 13     rs11593     chrX 153627145   promoter   6134      <NA>     <NA>
>>> 14     rs11593     chrX 153627145   promoter  26778      <NA>     <NA>
>>> 15     rs11593     chrX 153627145   promoter   <NA>      <NA>     <NA> ##
>>> 16     rs11593     chrX 153627145    fiveUTR    <NA>      <NA>     <NA> ##
>>> 17     rs11593     chrX 153627145   threeUTR   <NA>      <NA>     <NA> ##
>>>
>>> Can anyone help explain what is happening please? Is this to be
>>> expected? Thank you.
>>>
>>> Regards, Adai
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>