[BioC] overlap GRangesList and vcf

Valerie Obenchain vobencha at fhcrc.org
Sat Aug 11 22:10:34 CEST 2012


Hi Georg,

The readVcf() function in the VariantAnnotation package reads data from 
a VCF file into a VCF-class object,

library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
 > vcf
class: VCF
dim: 10376 5
genome: hg19
exptData(1): header
fixed(4): REF ALT QUAL FILTER
info(22): LDAF AVGPOST ... VT SNPSOURCE
geno(3): GT DS GL
rownames(10376): rs7410291 rs147922003 ... rs144055359 rs114526001
rowData values names(1): paramRangeID
colnames(5): HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1): Samples

For details of the class slots and data accessors,
?'VCF-class'

Before counting, make sure the chromosome names are consistent between 
the vcf and annotation. Here the annotation precedes chromosome numbers 
with 'chr' and the vcf file does not.
 > intersect(seqlevels(txdb), seqlevels(vcf))
character(0)

  We modify the vcf seqlevels to match the txdb using renameSeqlevels(),
old <- seqlevels(vcf)
new <- paste("chr", old, sep="")
names(new) <- old
vcf <- renameSeqlevels(vcf, new)
 > intersect(seqlevels(txdb), seqlevels(vcf))
[1] "chr22"

The rowData slot contains a GRanges of positions which can be overlapped 
with your exons by genes,
rd <- rowData(vcf)

You wanted a list of exons hit by variants so we will unlist the 
GRangesList. If instead you want the hits grouped by gene, don't 
unlist(exbygn).
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exbygn <- exonsBy(txdb, "gene")
exons <- unlist(exbygn)

Counting can be done with findOverlaps() or summarizeOverlaps(). See the 
man pages for details on how the counting is performed. 
summarizedOverlaps() is written with read gaps in mind and therefore 
requires the 'reads' argument to be a Bam file or GappedAlignements 
object. Since you are interested in SNPs for this example we'll use 
findOverlaps().
fo <- findOverlaps(exons, rd)

Exons hit by variants are extracted from 'exons' with the 'queryHits' 
column,
exonsHit <- exons[queryHits(fo)]

and the corresponding variants are subset from the VCF-class object with 
the 'subjectHits' column.
vcfHit <- vcf[subjectHits(fo)]

The positions of the SNPs are in the GRanges in the rowData slot,
rowData(vcfHit)


If you wanted your hits grouped by gene, don't unlist(exbygn) before 
counting.

Valerie

On 08/10/12 09:55, Georg Otto wrote:
> Hi,
>
>
> I apologise if this is trivial, but I haven't found a straight forward
> way to do this so far.
>
> Given a GRangesList with exons
>
> ## exons by gene
> range.exon<- exonsBy(txdb.ensgene, "gene")
>
> and a vcf file with SNP data (positions, alleles, etc), how can I
> generate a list of exons that contain SNPs (along with SNP positions)?
>
> Any hint will be appreciated!
>
> Georg
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list