[BioC] VariantAnnotation and distance to genes

Valerie Obenchain vobencha at fhcrc.org
Mon May 13 18:05:34 CEST 2013


Hello,

In the devel branch I've added a distance() method that takes a GRanges 
and TxDb for this exact purpose. An example is at the bottom of the 
?locateVariants man page, I've pasted the main steps below.

If you're using devel, you may want to give the new distance() a try. If 
in release then Jim's solution should work for you.

Valerie

## start example from man page:

   ## Using the 'vcf' and 'txdb' from ?locateVariants man page:
   region <- IntergenicVariants(upstream=70000, downstream=70000)
   loc_int <- locateVariants(vcf, txdb, region)

   ## Each variant can have multiple flanking id's so first
   ## expand PRECEDEID and the corresponding variant ranges.
   p_ids <- unlist(loc_int$PRECEDEID, use.names=FALSE)
   exp_ranges <- rep(loc_int,  elementLengths(loc_int$PRECEDEID))

   ## Provide the variant GRanges as 'x' and the TxDb annotation as 'y'
   ## to distance(). The 'id' and 'type' arguments describe an id found
   ## in the TxDb. See ?'nearest-methods' for details on the method.
   p_dist <- distance(exp_ranges, txdb, id=p_ids, type="gene")
   head(p_dist)

   ## Expanded view of ranges, gene id and distance:
   exp_ranges$PRECEDE_DIST <- p_dist
   exp_ranges

   ## Collapsed view of ranges, gene id and distance:
   loc_int$PRECEDE_DIST <- relist(p_dist, loc_int$PRECEDEID)
   loc_int

## end example


On 05/13/2013 08:48 AM, James W. MacDonald wrote:
> Hi Gad,
>
> On 5/12/2013 11:41 PM, Gad Abraham wrote:
>> Hi,
>>
>> I'm using VariantAnnotation, which is a very nice package, to annotate
>> some
>> SNPs, based on
>> http://adairama.wordpress.com/2013/02/15/functionally-annotate-snps-and-indels-in-bioconductor
>>
>> .
>>
>> For SNPs annotated as intergenic and therefore falling between two genes
>> (PRECEDEID and FOLLOWID), is there a way of getting the actual distance
>> between the SNPs and these two genes?
>>
>> I'm using:
>>
>> head(m, 3)
>>           RS  Chr      Pos
>> 1 rs2187668 chr6 32713862
>> 2 rs9357152 chr6 32772938
>> 3 rs3099844 chr6 31556955
>>
>> target<- with(m,
>>        GRanges(seqnames=Rle(Chr),
>>           ranges=IRanges(Pos, end=Pos, names=RS),
>>           strand=Rle(strand("*"))
>>        )
>> )
>>
>> loc<- locateVariants(target, TxDb.Hsapiens.UCSC.hg18.knownGene,
>> AllVariants())
>> names(loc)<- NULL
>> out<- as.data.frame(loc)
>>
>> head(out, 3)
>>        names seqnames    start      end   LOCATION GENEID PRECEDEID
>> FOLLOWID
>> GeneSymbol PrecedeSymbol FollowSymbol
>> 1 rs2187668     chr6 32713862 32713862     intron   3117<NA>
>> <NA>    HLA-DQA1<NA>          <NA>
>> 4 rs9357152     chr6 32772938 32772938 intergenic<NA>       3119
>> 3117<NA>       HLA-DQB1     HLA-DQA1
>> 5 rs3099844     chr6 31556955 31556955 intergenic<NA>       4277
>> 352961<NA>           MICB        HCG26
>>
>> I'd like to get the distance from rs9357152 to HLA-DQB1 and to HLA-DQA1.
>
> There might be a far superior way to do this, but one way is to assume
> that cds == genes and go from there:
>
>  > gns <- cds(TxDb.Hsapiens.UCSC.hg19.knownGene,
> columns=c("cds_id","gene_id"))
>  > ind <- sapply(values(gns)$gene_id, function(x) any(x %in%
> c("3117","3119")))
>  > sum(ind)
> [1] 28
>  > gns[ind,]
> GRanges with 28 ranges and 2 metadata columns:
>              seqnames               ranges strand   |    cds_id gene_id
> <Rle> <IRanges> <Rle>   | <integer> <CharacterList>
>     [1]          chr6 [32605236, 32605317]      +   | 73476            3117
>     [2]          chr6 [32609087, 32609335]      +   | 73477            3117
>     [3]          chr6 [32609749, 32610030]      +   | 73478            3117
>     [4]          chr6 [32609749, 32610164]      +   | 73479            3117
>     [5]          chr6 [32610387, 32610541]      +   | 73480            3117
>     ...           ...                  ...    ... ... ...             ...
>    [24] chr6_qbl_hap6   [3860872, 3860982]      -   | 233860
> 3119
>    [25] chr6_qbl_hap6   [3861407, 3861780]      -   | 233861
> 3119
>    [26] chr6_qbl_hap6   [3861499, 3861780]      -   | 233862
> 3119
>    [27] chr6_qbl_hap6   [3864670, 3864939]      -   | 233863
> 3119
>    [28] chr6_qbl_hap6   [3866378, 3866486]      -   | 233864
> 3119
>
> Note that these two genes also map to the extra Chr6 haplotypes.
>
>  > z <- cbind(as.data.frame(gns[ind,]), distance(GRanges("chr6",
> IRanges(start=32772938, end=32772938)), gns[ind,]))[,c(1,2,4,6,7,8)]
> Warning message:
> In .local(x, y, ...) :
>    The behavior of distance() has changed in Bioconductor 2.12. See
> ?distance for details.
>  > colnames(z) <- c(colnames(z)[1:5], "Distance")
>  > z
>          seqnames    start width cds_id gene_id Distance
> 1           chr6 32605236    82  73476    3117   167620
> 2           chr6 32609087   249  73477    3117   163602
> 3           chr6 32609749   282  73478    3117   162907
> 4           chr6 32609749   416  73479    3117   162773
> 5           chr6 32610387   155  73480    3117   162396
> 6           chr6 32628013    14  78928    3119   144911
> 7           chr6 32628636    22  78929    3119   144280
> 8           chr6 32629124   111  78930    3119   143703
> 9           chr6 32629125   110  78931    3119   143703
> 10          chr6 32629744   282  78932    3119   142912
> 11          chr6 32629951    44  78933    3119   142943
> 12          chr6 32632575   244  78934    3119   140119
> 13          chr6 32632575   270  78935    3119   140093
> 14          chr6 32634276   109  78936    3119   138553
> 15 chr6_cox_hap2  4073284    14 227891    3119       NA
> 16 chr6_cox_hap2  4074344   185 227892    3119       NA
> 17 chr6_cox_hap2  4074418   111 227893    3119       NA
> 18 chr6_cox_hap2  4074953   374 227894    3119       NA
> 19 chr6_cox_hap2  4075045   282 227895    3119       NA
> 20 chr6_cox_hap2  4078216   270 227896    3119       NA
> 21 chr6_cox_hap2  4079924   109 227897    3119       NA
> 22 chr6_qbl_hap6  3859738    14 233858    3119       NA
> 23 chr6_qbl_hap6  3860798   185 233859    3119       NA
> 24 chr6_qbl_hap6  3860872   111 233860    3119       NA
> 25 chr6_qbl_hap6  3861407   374 233861    3119       NA
> 26 chr6_qbl_hap6  3861499   282 233862    3119       NA
> 27 chr6_qbl_hap6  3864670   270 233863    3119       NA
> 28 chr6_qbl_hap6  3866378   109 233864    3119       NA
>
> Best,
>
> Jim
>
>
>
>
>>
>> Thanks,
>> Gad
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list