[BioC] GENEID is missing when LOCATION is non-intergenic in VariantAnnotation package

Tue Feb 19 18:21:32 CET 2013

Hello,

All values in the output (GENEID, TXID, etc.) are taken from the 
annotation you are using. If the annotaion does not have a GENEID, TXID, 
etc. for a particular range, then none will be reported.

To take a closer look at the annotation we can extract the transcripts 
and ask for gene_id, cds_id and tx_id as columns.

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id"))

Looking at the first 3 rows, we see none of these have a gene_id,

>> tx[1:3]
> GRanges with 3 ranges and 3 metadata columns:
>       seqnames         ranges strand |     tx_id                   gene_id
>          <Rle>      <IRanges>  <Rle> | <integer> <CompressedCharacterList>
>   [1]     chr1 [11874, 14409]      + |         1
>   [2]     chr1 [11874, 14409]      + |         2
>   [3]     chr1 [11874, 14409]      + |         3
>                        cds_id
>       <CompressedIntegerList>
>   [1]                      NA
>   [2]                   1,2,3
>   [3]                      NA

To isolate the portion of the annotation that overlaps with your ranges 
you can use subsetByOverlaps(),

gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"),
               IRanges(c(23803138, 172410967, 60890373,
                       44061025, 153627145), width=1))
res <- subsetByOverlaps(tx, gr)

Here is an example of a transcript with no gene_id

>> res[8:10]
> GRanges with 3 ranges and 3 metadata columns:
>       seqnames                 ranges strand |     tx_id
>          <Rle>              <IRanges>  <Rle> | <integer>
>   [1]     chr1 [172410597, 172413230]      - |      6937
>   [2]     chr1 [172410869, 172411762]      - |      6938
>   [3]    chr17 [ 43971748,  44105699]      + |     59914
>                         gene_id                  cds_id
>       <CompressedCharacterList> <CompressedIntegerList>
>   [1]                      5279                NA,20697
>   [2]                                             20697
>   [3]                      4137    NA,178155,178156,...

Also remember that if a range is 'intergenic' that it will not have a 
GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID.

Valerie

On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote:
> Dear all,
>
> I am finding some unexpected results (to me anyway) with the
> VariantAnnotation package. Basically, there are situations where the
> GENEID is missing when LOCATION is either coding, promoter, intron,
> threeUTR or fiveUTR. Here is an example with five SNPs (among many
> more). I have marked the unexpected results with "##".
>
>
> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>
> tmp <- rbind.data.frame(c("rs10917388",  "chr1",  23803138),
>                                    c("rs1063412",   "chr1", 172410967),
>                                    c("rs78291220",  "chr2",  60890373),
>                                    c("rs116917239", "chr17", 44061025),
>                                    c("rs11593",     "chrX",  153627145) )
> colnames(tmp) <- c("rsid", "chr", "pos")
> tmp$pos <- as.numeric( as.character(tmp$pos) )
>
> target <- with(tmp, GRanges(seqnames = Rle(chr),
>                                            ranges   = IRanges(pos,
> end=pos, names=rsid),
>                                            strand   = Rle(strand("*")) ) )
>
> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants())
> names(loc) <- NULL
> out <- as.data.frame(loc)
> out$rsid <- names(target)[ out$QUERYID ]
> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID",
> "PRECEDEID", "FOLLOWID")]
> out <- unique(out)
> rownames(out) <- NULL
> out
>
>             rsid  seqnames     start   LOCATION GENEID PRECEDEID FOLLOWID
> 1   rs10917388     chr1  23803138     intron  55616      <NA>     <NA>
> 2   rs10917388     chr1  23803138   promoter   <NA>      <NA>     <NA> ##
>
> 3    rs1063412     chr1 172410967     intron  92346      <NA>     <NA>
> 4    rs1063412     chr1 172410967     intron   5279      <NA>     <NA>
> 5    rs1063412     chr1 172410967     coding   5279      <NA>     <NA>
> 6    rs1063412     chr1 172410967     coding   <NA>      <NA>     <NA> ##
>
> 7   rs78291220     chr2  60890373   promoter   <NA>      <NA>     <NA> ##
> 8   rs78291220     chr2  60890373 intergenic   <NA>     64895   400957
>
> 9  rs116917239    chr17  44061025     coding   4137      <NA>     <NA>
> 10 rs116917239    chr17  44061025     intron   4137      <NA>     <NA>
> 11 rs116917239    chr17  44061025     coding   <NA>      <NA>     <NA> ##
>
> 12     rs11593     chrX 153627145     intron   6134      <NA>     <NA>
> 13     rs11593     chrX 153627145   promoter   6134      <NA>     <NA>
> 14     rs11593     chrX 153627145   promoter  26778      <NA>     <NA>
> 15     rs11593     chrX 153627145   promoter   <NA>      <NA>     <NA> ##
> 16     rs11593     chrX 153627145    fiveUTR    <NA>      <NA>     <NA> ##
> 17     rs11593     chrX 153627145   threeUTR   <NA>      <NA>     <NA> ##
>
> Can anyone help explain what is happening please? Is this to be
> expected? Thank you.
>
> Regards, Adai
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>