[BioC] pulling functional information for SNPs

James W. MacDonald jmacdon at med.umich.edu
Tue May 4 20:56:41 CEST 2010


Hi Kay,

Please don't take things off-list. The list archives are intended to be 
a useful repository of questions that were asked and answered, and if we 
go off-list, that function is invalidated.


Kay Jaja wrote:
> Hi James,
>  
> Thank you very much for the response, I have tried to use BiomaRt and 
> about 700,000 SNPs on the Affy 6.0 chip to get the chromosome and the 
> base pair position but the script is taking forever (more than 24 hours) 
> , I have tried a test on a small set of SNPs and it works. below is my 
> script

Getting a query for 700K things will likely take a long time. Had you 
mentioned that you were using the Affy 6.0 chip, we could have gone in a 
different direction.

biocLite("pd.genomewidesnp.6") ## this will take a while
library(pd.genomewidesnp.6)
con <- pd.genomewidesnp.6 at getdb()
## there might be a better way of setting up the connection...
## If so, Benilton will correct me very soon ;-P

## check things out
 > dbListTables(con)
[1] "featureSet"    "featureSetCNV" "pmfeature"     "pmfeatureCNV"
[5] "sequence"      "sequenceCNV"   "sqlite_stat1"  "table_info"
 > dbListFields(con, "featureSet")
  [1] "fsetid"           "man_fsetid"       "affy_snp_id" 
"dbsnp_rs_id"
  [5] "chrom"            "physical_pos"     "strand" 
"cytoband"
  [9] "allele_a"         "allele_b"         "gene_assoc" 
"fragment_length"
[13] "fragment_length2" "dbsnp"            "cnv"

## try a 70K query - I won't show how I made the snp vector...

 > length(snps)
[1] 70000
 > system.time(dbGetQuery(con, paste("select dbsnp_rs_id, chrom, 
physical_pos from featureSet where dbsnp_rs_id in ('", paste(snps, 
collapse = "','"), "');", sep = "")))
    user  system elapsed
    4.89    1.09  119.09

So about 2 min for 70K query. Not bad.

 > dbGetQuery(con, "select dbsnp_rs_id, chrom, physical_pos from 
featureSet limit 10;")
    dbsnp_rs_id chrom physical_pos
1    rs2887286     1      1145994
2    rs1496555     1      2224111
3   rs41477744     1      2319424
4    rs3890745     1      2543484
5   rs10492936     1      2926730
6   rs10489588     1      2941694
7    rs2376495     1      3084986
8    rs4648462     1      3155127
9   rs10492939     1      3292731
10   rs9424283     1      3695086

Best,

Jim



>  
> ########################################################
> # snps is a vector of  about 700,000 rs numbers from the Affy 6.0 #####
> ######################################################
>  library(biomaRt)
>  mart <- useMart("snp")
>  mart<-useDataset("hsapiens_snp",mart)
> Checking attributes and filters ... ok
>  
>  out=getBM(attributes=("refsnp_id","chr_name","chrom_start"),
> + filters=c("refsnp_id"), values=snps, mart=mart)
>  
> ############################################
>  
> is there a reason why you prefer to use RMySQL over biomaRt?
> Do you know why my script is taking too long?
>  
> Thanks for your help,
> 
> ------------------------------------------------------------------------
> *From:* James W. MacDonald <jmacdon at med.umich.edu>
> *To:* Kay Jaja <kjaja27 at yahoo.com>
> *Cc:* bioconductor at stat.math.ethz.ch
> *Sent:* Wed, April 28, 2010 1:42:28 PM
> *Subject:* Re: [BioC] pulling functional information for SNPs
> 
> Hi Kay,
> 
> Kay Jaja wrote:
>  > Hi ,
>  >
>  > I have a list of SNPS (rs numbers ) and I am interested in pulling 
> the functional data corresponding to each SNP from a data base like 
> ensemble, i.e.( is the gene name if the snp i sin a gene, intron, exon, 
> non_ synonymous snp, or synonymous snp). is it possible to do this in R 
> using BioMart or any other packages?
> 
> Do you mean to ask if it is possible, or is it easy? It is certainly 
> possible, although it depends on exactly what you want. Your question is 
> not as complete as it could be. In the future, you should try to explain 
> exactly what you are trying to do rather than asking open-ended questions.
> 
> You can get information about SNPs using biomaRt, but the available 
> information looks pretty sparse to me when compared to the small list of 
> interests you seem to have. But you can look to see what is available 
> easily enough:
> 
> library(biomaRt)
> mart <- useMart("snp","hsapiens_snp")
> listAttributes(mart)
> 
> There are one or two vignettes that come with biomaRt that should help 
> you get started if you like what you see.
> 
> I generally don't use biomaRt for this sort of thing, instead preferring 
> to hit the UCSC database directly. Note that what I show below might be 
> done as easily using the rtracklayer package; you might explore the 
> vignettes for that package as well. Anyway, I would use the RMySQL 
> package and query directly:
> 
> library(RMySQL)
> con <- dbConnect("MySQL", host = "genome-mysql.cse.ucsc.edu 
> <http://genome-mysql.cse.ucsc.edu/>", dbname = "hg18", user = "genome")
> 
> ## what type of info is available?
> 
>  > dbGetQuery(con, "select * from snp129 where name='rs25';")
>   bin chrom chromStart chromEnd name score strand refNCBI refUCSC observed
> 1 673  chr7  11550666 11550667 rs25    0      -      T      T      A/G
>   molType  class                            valid    avHet  avHetSE  func
> 1 genomic single by-cluster,by-frequency,by-hapmap 0.499586 0.014383 intron
>   locType weight
> 1  exact      1
> 
> Note two things here. First, you don't know the return order, so you 
> should always ask for the database to return what you are querying on 
> (this is true of biomaRt as well). Second, if you are querying lots of 
> SNPs, just do it in one big query instead of one by one. Repeatedly 
> querying an online database will get you banned. So say your rs IDs are 
> in a vector rsid, and you want the chromosome, the position, the bases, 
> and the function (intron, exon, intragenic, etc).
> 
> sql <- paste("select name, chrom, chromEnd, observed, func from snp129 
> where name in ('", paste(rsid, collapse = "','"), "');", sep = "")
> 
> there are a lot of ' and " in there, because we want something that 
> looks like this:
> 
> select name, chrom, chromEnd, observed, func from snp129 where name in 
> ('rs25','rs26','rs27','rs28');
> 
> so you want to make sure the sql statement looks OK first. Then just do
> 
> dat <- dbGetQuery(con, sql)
> 
>  > rsid <- c("rs25","rs26","rs27","rs28")
>  > rsid
> [1] "rs25" "rs26" "rs27" "rs28"
>  > sql <- paste("select name, chrom, chromEnd, observed, func from 
> snp129 where name in ('", paste(rsid, collapse = "','"), "');", sep = "")
>  > sql
> [1] "select name, chrom, chromEnd, observed, func from snp129 where name 
> in ('rs25','rs26','rs27','rs28');"
>  > z <- dbGetQuery(con, sql)
>  > z
>   name chrom chromEnd observed  func
> 1 rs25  chr7 11550667      A/G intron
> 2 rs26  chr7 11549996    -/A/G intron
> 3 rs27  chr7 11549750      C/G intron
> 4 rs28  chr7 11562590      A/G intron
> 
> Best,
> 
> Jim
> 
> 
> 
>  >
>  > I appreciate your help,
>  > thanks
>  >
>  >
>  >          [[alternative HTML version deleted]]
>  >
>  >
>  >
>  > ------------------------------------------------------------------------
>  >
>  > _______________________________________________
>  > Bioconductor mailing list
>  > Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>  > https://stat.ethz.ch/mailman/listinfo/bioconductor
>  > Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> -- James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not 
> be used for urgent or sensitive issues
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 



More information about the Bioconductor mailing list