[BioC] Problem locating SNP by rsID for SNPlocs.Hsapiens.dbSNP.20120608 package Bioconductor x

Hervé Pagès hpages at fhcrc.org
Wed Jan 16 08:00:56 CET 2013


Hi Christina,

According to the official announcement:

 
http://www.ncbi.nlm.nih.gov/mailman/pipermail/dbsnp-announce/2012q2/000122.html

there are 53,558,214 rs ids in dbSNP 137 for Human.

But in SNPlocs.Hsapiens.dbSNP.20120608:

   > library(SNPlocs.Hsapiens.dbSNP.20120608)
   > sum(getSNPcount())
   [1] 45416711

As explained in ?SNPlocs.Hsapiens.dbSNP.20120608, the package (like
all other SNPlocs packages) was curated:

      SNPs from dbSNP were filtered to keep only those satisfying the 3
      following criteria:

         • The SNP is a single-base substitution i.e. its type is "snp".
           Other types used by dbSNP are: "in-del", "mixed",
           "microsatellite", "named-locus",
           "multinucleotide-polymorphism", etc... All those SNPs were
           dropped.

         • The SNP is marked as notwithdrawn.

         • A *single* location on the reference genome (GRCh37.p5) is
           reported for the SNP, and this location is on chromosomes
           1-22, X, Y, MT.

In the case of rs7775397, it was dropped because of this last reason.
More precisely, the record in ds_flat_ch6.flat for this SNP contains
the following CTG lines:

CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=32261252 | NT_007592.15 | 
ctg-start=32201252 | ctg-end=32201252 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_113891.2 | 
ctg-start=3732030 | ctg-end=3732030 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167245.1 | 
ctg-start=3540499 | ctg-end=3540499 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167246.1 | 
ctg-start=3604088 | ctg-end=3604088 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167248.1 | 
ctg-start=3522471 | ctg-end=3522471 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167249.1 | 
ctg-start=3609047 | ctg-end=3609047 | loctype=2 | orient=+

That is, more than 1 CTG line corresponding to the reference assembly
(GRCh37.p5). This is the reason why the SNP was dropped.

I realize now that maybe I could keep those SNPs that have more than
1 CTG line corresponding to the reference assembly as long as exactly
1 of them actually provides a value for the chr-pos field. Would that
be reasonable?

Thanks,
H.


On 01/15/2013 05:19 PM, Christina Chaivorapol wrote:
> Hi,
>
> Has anyone ever had a case where a SNP was not found in
> SNPlocs.Hsapiens.dbSNP.
> 20120608, but is found in dbSNP 137?  I am having this problem with SNP
> rs7775397.
>
>> library(SNPlocs.Hsapiens.dbSNP.20120608)
>> rsidsToGRanges('rs7775397')
> Error in .snpid2rowidx(x, snpid) : SNP id(s) not found: 7775397
>
> Thanks,
> Christina
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] datasets  utils     grDevices graphics  stats     methods   base
>
> other attached packages:
> [1] SNPlocs.Hsapiens.dbSNP.
> 20120608_0.99.8
> [2] BSgenome_1.26.1
> [3] Biostrings_2.26.2
> [4] GenomicRanges_1.10.5
> [5] IRanges_1.16.4
> [6] BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] parallel_2.15.2 stats4_2.15.2
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list