[BioC] Complementing alleles in a VCF

Alex Gutteridge alexg at ruggedtextile.com
Thu Apr 19 11:39:12 CEST 2012


Hi All,

I'm having some difficulty with VCF handling and feeding 1000 genome 
VCFs into predictCoding from the VariantAnnotation package. Can anyone 
help?

When I read 1000 genomes VCFs the SNPs come through with no strand 
specified e.g (where vcf.file and param are a 1000 genomes VCF file and 
a GRanges respectively):

> vcf = readVcf(vcf.file, "hg19", param)
> fixed(vcf)
GRanges with 2610 ranges and 5 elementMetadata cols:
               seqnames                 ranges strand   | paramRangeID
                  <Rle>              <IRanges>  <Rle>   |     <factor>
   rs186838828     chr2 [167051756, 167051756]      *   |        SCN9A
   rs191667986     chr2 [167051865, 167051865]      *   |        SCN9A
   rs139483482     chr2 [167051900, 167051900]      *   |        SCN9A
   rs182687583     chr2 [167052080, 167052080]      *   |        SCN9A
   rs115766730     chr2 [167052144, 167052144]      *   |        SCN9A
   rs186613025     chr2 [167052168, 167052168]      *   |        SCN9A
    rs73017538     chr2 [167052328, 167052328]      *   |        SCN9A
   rs114327563     chr2 [167052375, 167052375]      *   |        SCN9A
   rs191401619     chr2 [167052418, 167052418]      *   |        SCN9A
           ...      ...                    ...    ... ...          ...
    rs73025590     chr2 [167231750, 167231750]      *   |        SCN9A
   rs141453198     chr2 [167231812, 167231812]      *   |        SCN9A
    rs16852069     chr2 [167231890, 167231890]      *   |        SCN9A
   rs181276399     chr2 [167231932, 167231932]      *   |        SCN9A
   rs185839773     chr2 [167232251, 167232251]      *   |        SCN9A
   rs191091185     chr2 [167232439, 167232439]      *   |        SCN9A
   rs148362057     chr2 [167232446, 167232446]      *   |        SCN9A
   rs141521157     chr2 [167232450, 167232450]      *   |        SCN9A
     rs1881440     chr2 [167232463, 167232463]      *   |        SCN9A
                          REF                ALT      QUAL      FILTER
               <DNAStringSet> <DNAStringSetList> <numeric> <character>
   rs186838828              T           ########       100        PASS
   rs191667986              T           ########       100        PASS
   rs139483482              T           ########       100        PASS
   rs182687583              G           ########       100        PASS
   rs115766730              G           ########       100        PASS
   rs186613025              A           ########       100        PASS
    rs73017538              A           ########       100        PASS
   rs114327563              A           ########       100        PASS
   rs191401619              C           ########       100        PASS
           ...            ...                ...       ...         ...
    rs73025590              A           ########       100        PASS
   rs141453198              C           ########       100        PASS
    rs16852069              A           ########       100        PASS
   rs181276399              G           ########       100        PASS
   rs185839773              A           ########       100        PASS
   rs191091185              C           ########       100        PASS
   rs148362057              A           ########       100        PASS
   rs141521157              A           ########       100        PASS
     rs1881440              C           ########       100        PASS
   ---
   seqlengths:
    chr2
      NA

When I feed these to predictCoding(), genes on the complement strand 
are not dealt with correctly - the ALT alleles given in the VCF are not 
complemented first, so the variant codons are not translated right.

> txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
> aa = predictCoding(vcf.filt, txdb, Hsapiens)

It seems like predictCoding should be able to deal with genes on the 
complement strand, but maybe it requires the strand to be set correctly 
in the VCF? I've tried and failed to find a way of editing the VCF - if 
nothing else to just complement the ALT alleles before running 
predictcoding, but it doesn't seem to work. Though the manual seems to 
suggest it is possible. I didn't find a way to actually set the strand 
directly either:

‘alt(x)’, ‘alt(x) <- value’: Returns or sets the alternate allele data
           from the ALT column of the VCF file. ‘value’ can be a
           ‘DNAStringSet’ or a ‘CharacterList’ (for a structural VCF
           file).

> tmp.alt = complement(unlist(values(alt(vcf))[["ALT"]]))
> tmp.alt
   A DNAStringSet instance of length 2610
        width seq
    [1]     1 T
    [2]     1 G
    [3]     1 G
    [4]     1 T
    [5]     1 T
    [6]     1 C
    [7]     1 C
    [8]     1 C
    [9]     1 C
    ...   ... ...
[2602]     1 A
[2603]     1 T
[2604]     1 C
[2605]     1 T
[2606]     1 C
[2607]     1 C
[2608]     1 C
[2609]     1 C
[2610]     1 T
> alt(vcf) = tmp.alt
Error in function (classes, fdef, mtable)  :
   unable to find an inherited method for function "alt<-", for 
signature "VCF", "DNAStringSet"

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
  [1] RColorBrewer_1.0-5
  [2] caTools_1.12
  [3] bitops_1.0-4.1
  [4] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
  [5] GenomicFeatures_1.8.1
  [6] org.Hs.eg.db_2.7.1
  [7] RSQLite_0.11.1
  [8] DBI_0.2-5
  [9] AnnotationDbi_1.18.0
[10] Biobase_2.16.0
[11] VariantAnnotation_1.2.5
[12] Rsamtools_1.8.3
[13] BSgenome.Hsapiens.UCSC.hg19_1.3.17
[14] BSgenome_1.24.0
[15] Biostrings_2.24.1
[16] GenomicRanges_1.8.3
[17] IRanges_1.14.2
[18] BiocGenerics_0.2.0

loaded via a namespace (and not attached):
  [1] biomaRt_2.12.0     grid_2.15.0        lattice_0.20-6     
Matrix_1.0-6
  [5] RCurl_1.91-1       rtracklayer_1.16.1 snpStats_1.6.0     
splines_2.15.0
  [9] stats4_2.15.0      survival_2.36-12   tools_2.15.0       XML_3.9-4
[13] zlibbioc_1.2.0

-- 
Alex Gutteridge



More information about the Bioconductor mailing list