[BioC] Problem locating SNP by rsID for SNPlocs.Hsapiens.dbSNP.20120608 package Bioconductor x

Hervé Pagès hpages at fhcrc.org
Thu Jan 17 21:53:51 CET 2013


Hi Christina,

On 01/16/2013 10:15 AM, Christina Chaivorapol wrote:
> Thanks for your help Tim and Herve.
>
> It would be very useful to include the SNPs that have a value for the
> chr-pos field even if they have more than 1 CTG line for my purposes
> since I deal with a lot of immune-related genes that tend to be
> difficult to map.  Would it be possible to include these types of SNPs,
> but flag them as having more than 1 CTG line?

So I've included them in version 0.99.9 of SNPlocs.Hsapiens.dbSNP.20120608.
They're not flagged though. Note that there still is a *single*
location on the reference genome that is reported for those SNPs,,
because the other "locations" are reported as ? (question mark)
and it seems fair to not consider ? as a location.

With this new version of the package:

   > library(SNPlocs.Hsapiens.dbSNP.20120608)
   > sum(getSNPcount())
   [1] 45697775

that is, 281064 more SNPs (i.e. 0.6%) compared to the previous version
(i.e. 0.99.8). rs7775397 is one of them now:

   > rsidsToGRanges("rs7775397")
   GRanges with 1 range and 2 metadata columns:
         seqnames               ranges strand |   RefSNP_id alleles_as_ambig
            <Rle>            <IRanges>  <Rle> | <character>      <character>
     [1]      ch6 [32261252, 32261252]      + |     7775397                K
     ---
     seqlengths:
            ch1       ch2       ch3       ch4 ...       chX       chY 
    chMT
      249250621 243199373 198022430 191154276 ... 155270560  59373566 
   16569

SNPlocs.Hsapiens.dbSNP.20120608 version 0.99.9 will be available in
Bioc devel (requires devel version of R i.e. R 3.0) thru biocLite() in
about 45 min. Only the source package for now, which you should be
able to install on Windows or Mac with biocLite( , type="source").

Let me know if you have questions about this.

Cheers,
H.

>
> Thanks for your help,
> Christina
>
>
> On Tue, Jan 15, 2013 at 11:00 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Christina,
>
>     According to the official announcement:
>
>
>     http://www.ncbi.nlm.nih.gov/__mailman/pipermail/dbsnp-__announce/2012q2/000122.html
>     <http://www.ncbi.nlm.nih.gov/mailman/pipermail/dbsnp-announce/2012q2/000122.html>
>
>     there are 53,558,214 rs ids in dbSNP 137 for Human.
>
>     But in SNPlocs.Hsapiens.dbSNP.__20120608:
>
>        > library(SNPlocs.Hsapiens.__dbSNP.20120608)
>        > sum(getSNPcount())
>        [1] 45416711
>
>     As explained in ?SNPlocs.Hsapiens.dbSNP.__20120608, the package (like
>     all other SNPlocs packages) was curated:
>
>           SNPs from dbSNP were filtered to keep only those satisfying the 3
>           following criteria:
>
>              • The SNP is a single-base substitution i.e. its type is "snp".
>                Other types used by dbSNP are: "in-del", "mixed",
>                "microsatellite", "named-locus",
>                "multinucleotide-polymorphism"__, etc... All those SNPs were
>                dropped.
>
>              • The SNP is marked as notwithdrawn.
>
>              • A *single* location on the reference genome (GRCh37.p5) is
>                reported for the SNP, and this location is on chromosomes
>                1-22, X, Y, MT.
>
>     In the case of rs7775397, it was dropped because of this last reason.
>     More precisely, the record in ds_flat_ch6.flat for this SNP contains
>     the following CTG lines:
>
>     CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=32261252 | NT_007592.15 |
>     ctg-start=32201252 | ctg-end=32201252 | loctype=2 | orient=+
>     CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_113891.2 |
>     ctg-start=3732030 | ctg-end=3732030 | loctype=2 | orient=+
>     CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167245.1 |
>     ctg-start=3540499 | ctg-end=3540499 | loctype=2 | orient=+
>     CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167246.1 |
>     ctg-start=3604088 | ctg-end=3604088 | loctype=2 | orient=+
>     CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167248.1 |
>     ctg-start=3522471 | ctg-end=3522471 | loctype=2 | orient=+
>     CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167249.1 |
>     ctg-start=3609047 | ctg-end=3609047 | loctype=2 | orient=+
>
>     That is, more than 1 CTG line corresponding to the reference assembly
>     (GRCh37.p5). This is the reason why the SNP was dropped.
>
>     I realize now that maybe I could keep those SNPs that have more than
>     1 CTG line corresponding to the reference assembly as long as exactly
>     1 of them actually provides a value for the chr-pos field. Would that
>     be reasonable?
>
>     Thanks,
>     H.
>
>
>
>     On 01/15/2013 05:19 PM, Christina Chaivorapol wrote:
>
>         Hi,
>
>         Has anyone ever had a case where a SNP was not found in
>         SNPlocs.Hsapiens.dbSNP.
>         20120608, but is found in dbSNP 137?  I am having this problem
>         with SNP
>         rs7775397.
>
>             library(SNPlocs.Hsapiens.__dbSNP.20120608)
>             rsidsToGRanges('rs7775397')
>
>         Error in .snpid2rowidx(x, snpid) : SNP id(s) not found: 7775397
>
>         Thanks,
>         Christina
>
>             sessionInfo()
>
>         R version 2.15.2 (2012-10-26)
>         Platform: x86_64-unknown-linux-gnu (64-bit)
>
>         locale:
>            [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>            [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>            [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>            [7] LC_PAPER=C                 LC_NAME=C
>            [9] LC_ADDRESS=C               LC_TELEPHONE=C
>         [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>         attached base packages:
>         [1] datasets  utils     grDevices graphics  stats     methods   base
>
>         other attached packages:
>         [1] SNPlocs.Hsapiens.dbSNP.
>         20120608_0.99.8
>         [2] BSgenome_1.26.1
>         [3] Biostrings_2.26.2
>         [4] GenomicRanges_1.10.5
>         [5] IRanges_1.16.4
>         [6] BiocGenerics_0.4.0
>
>         loaded via a namespace (and not attached):
>         [1] parallel_2.15.2 stats4_2.15.2
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> Christina Chaivorapol, Ph.D.
> Genentech, Inc.
> Bioinformatics & Computational Biology
> phone: 650-225-6903
> chrichai at gene.com <mailto:chrichai at gene.com>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list