[BioC] get chr position for a batch of human SNPs

Thu Sep 29 03:24:16 CEST 2011

Hi Shirley,

On 11-09-22 02:24 PM, Hervé Pagès wrote:
> Hi Shirley,
>
> On 11-09-22 11:29 AM, shirley zhang wrote:
>> Dear All,
>>
>> I am planing to map the SNPids to hg18 positions (chr and position)
>> for a huge list of human snps. I've tried the package
>> "SNPlocs.Hsapiens.dbSNP.20090506" and have 2 questions regarding this
>> package:
>>
>> 1. Do the SNPs in this package map the hg18 genome (NCBI Build 36.3
>> with Group Label "reference" instead of "Celera" or "HuRef"?
>
> Yes, they are mapped to hg18. See:
>
>
> http://bioconductor.org/packages/release/data/annotation/html/SNPlocs.Hsapiens.dbSNP.20090506.html
>
>
> and the man page of the package for additional details:
>
>  > library(SNPlocs.Hsapiens.dbSNP.20090506)
>  > ?SNPlocs.Hsapiens.dbSNP.20090506
>
>>
>> 2. If I don't know the chr information (seqname), can I obtain the
>> position with dbSNP Id only?
>
> Unfortunately, because SNPs are stored in one data frame per
> chromosome, if you don't know the chr then you need to load and
> query each data frame individually.
>
> With more recent SNPlocs packages (e.g.
> SNPlocs.Hsapiens.dbSNP.20100427), provision was added to
> let the user load SNPs from more than 1 chromosome in a single
> GRanges object, so you can do something like:
>
> ## Load all the SNPs in a big GRanges object (takes about
> ## 13 minutes and requires 6GB of RAM!):
> all_snps <- getSNPlocs(names(getSNPcount()), as.GRanges=TRUE)
>
> ## Use the rs ids to set the names (takes about 6 minutes):
> names(all_snps) <- paste("rs", elementMetadata(all_snps)$RefSNP_id,
> sep="")
>
> ## Then extract your SNPs from the big GRanges object (again,
> ## this can take a long time, depending on how many SNPs you
> ## extract):
> my_rs_ids <- sample(names(all_snps), 1000)
> my_snps <- all_snps[my_rs_ids]
>
> However, please note that, starting with
> SNPlocs.Hsapiens.dbSNP.20100427 (i.e. dbSNP Build 131),
> SNPs are mapped to GRCh37 (UCSC hg19) instead of hg18.

I've made some improvements to the SNPlocs packages. The improved
packages are in BioC 2.9 (current devel, soon to be released) and
had their versions bumped to 0.99.6. They are:

   SNPlocs.Hsapiens.dbSNP.20090506: dbSNP Build 130, based on hg18
   SNPlocs.Hsapiens.dbSNP.20100427: dbSNP Build 131, based on hg19
   SNPlocs.Hsapiens.dbSNP.20101109: dbSNP Build 132, based on hg19
   SNPlocs.Hsapiens.dbSNP.20110815: dbSNP Build 134, based on hg19

Only the source packages are available at the moment but they should
be installable on Windows/Mac from R-2.14 with

   biocLite(..., type="source").

The data in those packages have not changed but are organized more
efficiently. There is a new function rsidsToGRanges() to extract
SNP information for a set of rs ids. Using rsidsToGRanges() is
much more memory (1/10) and time efficient (40x) than using the
code I provided above. See ?rsidsToGRanges for the details.

Let me know if you have questions.

Cheers,
H.

>
> Hope this helps,
> H.
>
>>
>> Further, I find dbSNP batch queries a little more difficult to work
>> with because they map to different versions of the hg18 like Celera,
>> HumanRef, etc.Can anybody let me know a better option to get hg18 chr
>> position with the most popular or confident version of dbSNP?
>>
>> Thanks in advance
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319