[BioC] Bioc Interface for 1,000 Genomes SNP Frequencies

Valerie Obenchain vobencha at fhcrc.org
Tue Jul 17 17:36:47 CEST 2012


Hi Tim,

1000 Genomes data can be read in with readVcf(). The 'param' argument 
allows you to specify chromosomes and/or genome positions you are 
interested in
Using a sample file from 1000 genomes,

   library(VariantAnnotation)
   fl <- 
"ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz"
   genomes <- readVcf(TabixFile(fl), "hg19", param=GRanges("12", 
IRanges(101266000, 101422000)))

The data are read into a VCF obejct. See ?VCF for class details. If a 
snp has an rsid, it will appear as the rowname. If not, the rowname is a 
concatenation of chromosome:position.

 > genomes
class: VCF
dim: 1325 174
genome: hg19
exptData(1): header
fixed(4): REF ALT QUAL FILTER
info(6): AF DP ... AFR_R2 ASN_R2
geno(7): AD DP ... GD OG
rownames(1325): rs75462666 rs78597949 ... 12:101421722 rs67962959
rowData values names(1): paramRangeID
colnames(174): HG00553 HG00554 ... NA19982 NA20414
colData names(1): Samples

We don't have a function that computes the frequencies but that can be 
accomplished fairly easily. The genotype data can be accessed with the 
'geno' function.

 > geno(genomes)
SimpleList of length 7
names(7): AD DP GL GQ GT GD OG

 > geno(genomes)$GT[1:5,1:3]
              HG00553 HG00554 HG00637
rs75462666   "0|0"   "0|0"   "0|0"
rs78597949   "0|0"   "0|0"   "0|0"
rs78693793   "0|0"   "0|0"   "0|0"
12:101266725 "0|0"   "0|0"   "0|0"
12:101266729 "0|0"   "0|0"   "0|0"

The frequency of the alternate allele is computed as  [ 0|1 + 2*(1|1)] / 
2*(total number of individuals).  Maybe something like,

   heterozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% c("0|1", 
"1|0"))
   homozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% "1|1")
   nsub <- ncol(geno(genomes)$GT)
    freq <- (heterozyg + 2*homozyg) / 2*nsub

Valerie

On 07/17/2012 08:09 AM, Vincent Carey wrote:
> Val Obenchain can reply more definitively, but, briefly, VariantAnnotation
> package has a scanVcf capability that
> allows survey of vcf archives with managed memory consumption.  If you do
> not see enough details in the vignette
> to solve this use case efficiently, let us know.
>
> On Tue, Jul 17, 2012 at 10:53 AM, Timothy Duff<timothy.duff at ncf.edu>  wrote:
>
>> Hi. I have a set of rsIDs for which I'm interested in obtaining allele
>> frequencies for (in a particular population surveyed by 1kGP.) The
>> corresponding .vcf files are pretty memory consumptive. Do there currently
>> exist basic BC tools that could help with this sort of thing? Thank you all
>> for your consideration.
>>
>> --
>> Tim
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list