[BioC] GENEID is missing when LOCATION is non-intergenic in VariantAnnotation package
Adaikalavan Ramasamy
adaikalavan.ramasamy at gmail.com
Wed Feb 20 17:39:10 CET 2013
Dear Valerie, thank you again for your patience and the explanations.
Regards, Adai
On Tue, Feb 19, 2013 at 8:55 PM, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> Yes, evidently C1orf105 is not a 'known gene'. All Bioconductor annotations
> can be found here,
>
> http://bioconductor.org/packages/devel/BiocViews.html#___AnnotationData
>
> The TxDb packages found on this page were made from specific tracks at UCSC
> or elsewhere. The details of how a package was made can be found on the
> package man page,
>
> ?TxDb.Hsapiens.UCSC.hg19.knownGene
>
> If none of the TxDb's meet your need you can create your own. In the
> GenomicFeatures package there are several functions available for creating
> custom annotations.
>
> ?makeTranscriptDbFromBiomart
> ?makeTranscriptDbFromGFF
> ?makeTranscriptDbFromUCSC
>
> I've also cc'd Marc, who handles our annotations, in case he has something
> more to add.
>
> Valerie
>
>
>
> On 02/19/2013 10:46 AM, Adaikalavan Ramasamy wrote:
>>
>> Dear Valerie,
>>
>> Thanks once again for the quick reply. I understand there the
>> different the annotation databases are not consistent and I am trying
>> to understand the reason for some of the noise. If I look up
>> chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I
>> see that it overlaps with C1orf105 and PIGC
>>
>>
>> http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-172411762&hgsid=326905233&pubs=pack
>>
>> http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-172411762
>>
>> so why does the GeneID is blank in this case? On the other hand res[8,
>> ] calls it PIGC only. Is it because C1orf105 is not a "known gene"?
>> BTW, the examples I sent were all non-intergenics. Thank you.
>>
>> Regards, Adai
>>
>>
>>
>> On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at fhcrc.org>
>> wrote:
>>>
>>> Hello,
>>>
>>> All values in the output (GENEID, TXID, etc.) are taken from the
>>> annotation
>>> you are using. If the annotaion does not have a GENEID, TXID, etc. for a
>>> particular range, then none will be reported.
>>>
>>> To take a closer look at the annotation we can extract the transcripts
>>> and
>>> ask for gene_id, cds_id and tx_id as columns.
>>>
>>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>> tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id"))
>>>
>>>
>>> Looking at the first 3 rows, we see none of these have a gene_id,
>>>
>>>>> tx[1:3]
>>>>
>>>>
>>>> GRanges with 3 ranges and 3 metadata columns:
>>>> seqnames ranges strand | tx_id
>>>> gene_id
>>>> <Rle> <IRanges> <Rle> | <integer>
>>>> <CompressedCharacterList>
>>>> [1] chr1 [11874, 14409] + | 1
>>>> [2] chr1 [11874, 14409] + | 2
>>>> [3] chr1 [11874, 14409] + | 3
>>>> cds_id
>>>> <CompressedIntegerList>
>>>> [1] NA
>>>> [2] 1,2,3
>>>> [3] NA
>>>
>>>
>>>
>>> To isolate the portion of the annotation that overlaps with your ranges
>>> you
>>> can use subsetByOverlaps(),
>>>
>>> gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"),
>>> IRanges(c(23803138, 172410967, 60890373,
>>> 44061025, 153627145), width=1))
>>> res <- subsetByOverlaps(tx, gr)
>>>
>>> Here is an example of a transcript with no gene_id
>>>
>>>>> res[8:10]
>>>>
>>>>
>>>> GRanges with 3 ranges and 3 metadata columns:
>>>> seqnames ranges strand | tx_id
>>>> <Rle> <IRanges> <Rle> | <integer>
>>>> [1] chr1 [172410597, 172413230] - | 6937
>>>> [2] chr1 [172410869, 172411762] - | 6938
>>>> [3] chr17 [ 43971748, 44105699] + | 59914
>>>> gene_id cds_id
>>>> <CompressedCharacterList> <CompressedIntegerList>
>>>> [1] 5279 NA,20697
>>>> [2] 20697
>>>> [3] 4137 NA,178155,178156,...
>>>
>>>
>>>
>>>
>>> Also remember that if a range is 'intergenic' that it will not have a
>>> GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID.
>>>
>>> Valerie
>>>
>>>
>>>
>>>
>>> On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote:
>>>>
>>>>
>>>> Dear all,
>>>>
>>>> I am finding some unexpected results (to me anyway) with the
>>>> VariantAnnotation package. Basically, there are situations where the
>>>> GENEID is missing when LOCATION is either coding, promoter, intron,
>>>> threeUTR or fiveUTR. Here is an example with five SNPs (among many
>>>> more). I have marked the unexpected results with "##".
>>>>
>>>>
>>>> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>
>>>> tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138),
>>>> c("rs1063412", "chr1", 172410967),
>>>> c("rs78291220", "chr2", 60890373),
>>>> c("rs116917239", "chr17", 44061025),
>>>> c("rs11593", "chrX", 153627145)
>>>> )
>>>> colnames(tmp) <- c("rsid", "chr", "pos")
>>>> tmp$pos <- as.numeric( as.character(tmp$pos) )
>>>>
>>>> target <- with(tmp, GRanges(seqnames = Rle(chr),
>>>> ranges = IRanges(pos,
>>>> end=pos, names=rsid),
>>>> strand = Rle(strand("*"))
>>>> ) )
>>>>
>>>> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene,
>>>> AllVariants())
>>>> names(loc) <- NULL
>>>> out <- as.data.frame(loc)
>>>> out$rsid <- names(target)[ out$QUERYID ]
>>>> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID",
>>>> "PRECEDEID", "FOLLOWID")]
>>>> out <- unique(out)
>>>> rownames(out) <- NULL
>>>> out
>>>>
>>>> rsid seqnames start LOCATION GENEID PRECEDEID
>>>> FOLLOWID
>>>> 1 rs10917388 chr1 23803138 intron 55616 <NA> <NA>
>>>> 2 rs10917388 chr1 23803138 promoter <NA> <NA> <NA>
>>>> ##
>>>>
>>>> 3 rs1063412 chr1 172410967 intron 92346 <NA> <NA>
>>>> 4 rs1063412 chr1 172410967 intron 5279 <NA> <NA>
>>>> 5 rs1063412 chr1 172410967 coding 5279 <NA> <NA>
>>>> 6 rs1063412 chr1 172410967 coding <NA> <NA> <NA>
>>>> ##
>>>>
>>>> 7 rs78291220 chr2 60890373 promoter <NA> <NA> <NA>
>>>> ##
>>>> 8 rs78291220 chr2 60890373 intergenic <NA> 64895 400957
>>>>
>>>> 9 rs116917239 chr17 44061025 coding 4137 <NA> <NA>
>>>> 10 rs116917239 chr17 44061025 intron 4137 <NA> <NA>
>>>> 11 rs116917239 chr17 44061025 coding <NA> <NA> <NA>
>>>> ##
>>>>
>>>> 12 rs11593 chrX 153627145 intron 6134 <NA> <NA>
>>>> 13 rs11593 chrX 153627145 promoter 6134 <NA> <NA>
>>>> 14 rs11593 chrX 153627145 promoter 26778 <NA> <NA>
>>>> 15 rs11593 chrX 153627145 promoter <NA> <NA> <NA>
>>>> ##
>>>> 16 rs11593 chrX 153627145 fiveUTR <NA> <NA> <NA>
>>>> ##
>>>> 17 rs11593 chrX 153627145 threeUTR <NA> <NA> <NA>
>>>> ##
>>>>
>>>> Can anyone help explain what is happening please? Is this to be
>>>> expected? Thank you.
>>>>
>>>> Regards, Adai
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>
More information about the Bioconductor
mailing list