[BioC] GGtools: transScores function returning cis eQTL

Fri Oct 5 14:53:18 CEST 2012

Dear Prof. Carey,

Thanks for such a swift response.

Yes you are quite right re: the error in the script I posted. Apologies- I 
somehow managed to copy and paste part of another bit of code from the 
analysis of the dataset I'm using- although in fact since upgrading to the 
latest version of GGtools the 'chr' prefix now creates an error with that 
too (rectified by specifying the argument chrnames as a numeric vector).

Thanks also for highlighting the transTab function- this saves some of the 
manual construction of the results data frame I had been doing.

I ended up using the NCBI2R package as I was having problems with the 
rsid2loc function in the SNPlocs.Hsapiens.dbSNP.20111119 package (I realise 
there is now a newer version of the SNPlocs package); specifically the 
genotyping of the dataset I'm using was done on the Affy SNP 6.0 array, and 
some of the rs ids of my SNPs are not recognised by the function rsid2loc 
resulting in an error message:

# mysnps is a character vector of eSNP ids:
> head(mysnps)
[1] "rs17117873" "rs17117877" "rs11590757" "rs2372309"  "rs12118488"
[6] "rs2274226"

> rsid2loc(mysnps)
Error in .rsid2rowidx(rsids) : 
  rs id(s) not found: 387642, 6581161, 3129934, 3129900, 17651213, 3129868, 
17563800, 12150111, 17688944, 453997, 241033, 241031, 17688056, 17688434, 
17563501, 2902662, 17571718, 17572169, 17572851, 41437445, 7221390, 
17687667, 17564780, 8070723, 17574228, 413778, 418891, 17653162, 17688391, 
17649553, 17689882, 17650063, 916793, 17649641, 242559, 4074462, 41374248, 
17575556, 2532292, 12150672, 17763086, 17688922, 17652502, 17575850, 
11079729, 17650872, 7350980, 17563787, 41382552, 8072451, 17573175, 
17688002, 17572893, 17760733, 17762769, 17652449, 4597358, 11079724, 
17563599, 17689824, 41399444, 17761207, 3129768, 2732706, 17762165, 
17660907, 41384744, 2732675, 1406068, 17662235, 17563827, 17659881, 
17426106, 17688773, 2532297, 2532316, 1467969, 17660132, 17660065, 
17688068, 17660595, 17769490, 17763533, 34097347, 17649954, 17689471, 
12150090, 2696526, 2732589, 17691328, 3129063, 4277389, 2732647, 2532259, 
2532275, 760293, 2532257, 12150320, 1611350, 8067056

I think these are SNPs where the rs id has changed over time. The NCBI2R 
package seems to be able to handle these older rs ids.

Perhaps these and similar SNPs are reflected in the NAs you demonstrated 
below? I will try the approaches you suggested.

Kind regards, and once again many thanks

James

PS sessionInfo for the above code snippet:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] splines   stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] NCBI2R_1.4.3                          
 [2] SNPlocs.Hsapiens.dbSNP.20111119_0.99.6
 [3] hugene10stensembl.db_2.1.0            
 [4] hugene10stensembl68.db_1.18.1         
 [5] org.Hs.eg.db_2.8.0                    
 [6] RSQLite_0.11.2                        
 [7] DBI_0.2-5                             
 [8] AnnotationDbi_1.20.0                  
 [9] oligo_1.22.0                          
[10] oligoClasses_1.20.0                   
[11] Biobase_2.18.0                        
[12] GGtools_4.6.0                         
[13] Rsamtools_1.10.1                      
[14] Biostrings_2.26.0                     
[15] GenomicRanges_1.10.0                  
[16] IRanges_1.16.2                        
[17] BiocGenerics_0.4.0                    
[18] GGBase_3.20.0                         
[19] snpStats_1.8.0                        
[20] Matrix_1.0-9                          
[21] lattice_0.20-10                       
[22] survival_2.36-14                      

loaded via a namespace (and not attached):
 [1] affxparser_1.30.0       affyio_1.26.0           annotate_1.36.0        
 [4] BiocInstaller_1.8.1     biomaRt_2.14.0          bit_1.1-8              
 [7] bitops_1.0-4.1          BSgenome_1.26.0         codetools_0.2-8        
[10] ff_2.2-7                foreach_1.4.0           genefilter_1.40.0      
[13] GenomicFeatures_1.10.0  grid_2.15.1             iterators_1.0.6        
[16] preprocessCore_1.20.0   RCurl_1.95-0.1.2        rtracklayer_1.18.0     
[19] tools_2.15.1            VariantAnnotation_1.4.0 XML_3.95-0.1           
[22] xtable_1.7-0            zlibbioc_1.4.0     

On Oct 4 2012, Vincent Carey wrote:

>First, I want to thank you for making a detailed and reproducible report.
> I have only done enough to exonerate the current version.
>
>Second, there was a problem in the script you gave -- I see no way for
>chrnames="chr1" to work with current annotation specifications as the
>chrnames is used to query the geneannopk package.  for illuminaHumanv1.db
>you have to use "1".
>
>Third, the following additional code shows that the trans scores retrieved
>involve SNP-gene distances > 100000 bp, USING THE BIOCONDUCTOR ANNOTATION
>PACKAGES SPECIFIED IN THE CALL.
>
>> tt = transTab(trS)  # get a data frame of the scores and some location
>information
>> dim(tt)
>[1] 2853880       8
>
># that gives 20 per-SNP association scores, for the 20 largest trans
>associations on chr1 to SNP on chr1, that is, the best among SNP-gene pairs
>satisfying
># your distance restriction
>
>> library(SNPlocs.Hsapiens.dbSNP.20111119)
>> sl1 = getSNPlocs("ch1", as.GRanges=TRUE)
>> rsn1 = paste("rs", values(sl1)$RefSNP_id, sep="")
>>
>> sst = start(sl1)
>> names(sst) = rsn1
>>
>> MM <- match(as.character(tt[,1]), names(sst), nomatch=0)
>> sst2 =sst[MM]
>>
>> tt2 = tt[which(MM>0),]
>>
>> summary(abs(abs(tt2$geneloc)-sst2))
>     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's
>   100100  35100000  76200000  88130000 137300000 248300000     53284
>
>this shows that the SNP-gene pairs on which scores were returned are
>separated by at least 100100 b
>
>it also shows that some gene/snp pairs couldn't be related by location, and
>that needs to be rationalized.  it may be something
>simple.
>
>I think the problem that you report comes from using NCBI2R SNP annotation
>to characterize SNP
>locations ... I'm not familiar with that package but I assume that it
>yields SNP locations not completely consistent with the
>SNPlocs packaged used in the filltering for association testing.
>
>It would be possible to use the NCBI2R locations for filtering if you made
>GRanges with the names and locations corresponding
>to the genotyped variants and put them in a package with a getSNPlocs
>function that would emulate the SNPlocs
>package behaviors.
>
>Another approach would work from passing the "cis map" in as a list,
>indicating which SNPs are excluded from
>testing for each gene.  If these become compelling use cases I can
>introduce this capability.
>
>Let me know if there are further issues.
>
>On Thu, Oct 4, 2012 at 12:09 PM, Vincent Carey
><stvjc at channing.harvard.edu>wrote:
>
>> Thank you for this report. I will get back to you ASAP. This is a tough 
>> time with versions changing, but I wonder if you could try with the new 
>> release. The robustification of location handling has been a key concern 
>> of development in the past couple of months.
>>
>> On Thu, Oct 4, 2012 at 11:56 AM, James Peters [guest] <
>> guest at bioconductor.org> wrote:
>>
>>>
>>> Hi there,
>>>
>>> I'm using the excellent GGtools to package to look for eQTLs. When I use
>>> the transScore function to look for trans eQTL I find that some of the
>>> eQTL-gene pairs I discover are closer than the distance specified by the
>>> argument 'radius' (used to define what is cis; hence eSNPs closer to the
>>> target gene than radius should not be returned by transScores).
>>>
>>> Any suggestions on what I'm doing wrong would be greatly appreciated.
>>>
>>> Best wishes
>>>
>>> James
>>>
>>> An example using GGdata looking only at genes and SNPs on chromosome 1.
>>> Here I define trans as > 100,000 bps.
>>>
>>> library(GGtools)
>>> library(GGdata)
>>>
>>> # get radius used to define cis
>>> rad <- 1e+05
>>>
>>> trS <-transScores("GGdata", snpchr = as.character(1), rhs = ~1, K = 
>>> 20, targdirpref = "tsco", geneApply = lapply, chrnames = paste("chr", 
>>> as.character(1), sep = ""), radius = rad, renameChrs = NULL, 
>>> probesToKeep = NULL, batchsize = 200, genegran = 50, shortfac = 10, 
>>> wrapperEndo= function(x) MAFfilter(x, low=0.1), geneannopk = 
>>> "illuminaHumanv1.db", snpannopk = "SNPlocs.Hsapiens.dbSNP.20111119", 
>>> gchrpref = "", schrpref = "ch")
>>>
>>> # get gene indexes
>>> TOPind <- geneIndcol(trS, 1)
>>>
>>> # construct results table using probe ids, chi squared #association 
>>> scores and snp loci TOP <- data.frame(cbind(probeid= 
>>> featureNames(getSS("GGdata",1))[TOPind], score= topScores(trS), snpids 
>>> = locusNames(trS)))
>>>
>>> TOP$score <- as.numeric(as.character(TOP$score))
>>>
>>> # set chi squared score threshold for significance for trans eQTL
>>> (equivalent to p val < 1e-11)
>>> threshold <- 46.328476
>>>
>>> tophits <- TOP[which(TOP$score > threshold),]
>>>
>>> # map probe to gene symbol, entrez id and chromosome
>>> PROBE2SYMS <- select(illuminaHumanv1.db, keys =
>>> as.character(tophits$probeid), cols = c("SYMBOL","ENTREZID","CHR"))
>>>
>>> tophits <- cbind(PROBE2SYMS, tophits)
>>>
>>> # get snp details
>>> library(NCBI2R)
>>> mysnps <- as.character(tophits$snpids)
>>> annotSNPs <- AnnotateSNPList(mysnps, file="")
>>>
>>> selectSNPinfo <- annotSNPs[,c("marker","chr","chrpos")]
>>>
>>> # get annotation on the genes using NCBI
>>> geneinfo <- GetGeneInfo(tophits$ENTREZID)
>>> # ori gives info on whether the gene is on sense or #anti-sense strand
>>> geneinfo <- geneinfo[,c("locusID","GeneLowPoint","GeneHighPoint","ori")]
>>>
>>> y <- cbind(tophits, selectSNPinfo)
>>> #check
>>> if(identical(y$marker, as.character(y$snpids))==FALSE){stop("The SNP ids
>>> don't match up")}
>>>
>>> # merge gene info with snp info and trans eQTL results
>>> final <- cbind(geneinfo, y)
>>> # check
>>> if(identical(final$locusID, final$ENTREZID)==FALSE){stop("The Entrez ids
>>> don't match up")}
>>>
>>> # clean up the dataframe (remove duplicate cols, make col names clearer)
>>> colnames(final)[1] <- 'EntrezID'
>>> colnames(final)[8] <- 'geneCHR'
>>> colnames(final)[13] <- 'snpCHR'
>>> colnames(final)[14] <- 'snpPos'
>>> final <- final[,-c(7,9,12)]
>>>
>>> # calculate distance from snp to gene start position
>>> diff <- rep(NA, nrow(final))
>>> for (i in 1:nrow(final)){
>>>   if(final$ori[i] == '+'){
>>>     diff[i] <- abs(final$GeneLowPoint[i] - final$snpPos[i])
>>>   } else if
>>>   (final$ori[i] == '-'){
>>>     diff[i] <- abs(final$GeneHighPoint[i] - final$snpPos[i])
>>>   }
>>> }
>>>
>>> # eQTL which are in fact cis
>>> final[which(diff < rad),]
>>>
>>>
>>>
>>>  -- output of sessionInfo():
>>>
>>> > sessionInfo()
>>> R version 2.15.1 (2012-06-22)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>>
>>> attached base packages:
>>> [1] splines   stats4    stats     graphics  grDevices utils     datasets
>>>  methods   base
>>>
>>> other attached packages:
>>>  [1] SNPlocs.Hsapiens.dbSNP.20111119_0.99.6
>>> SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 GGdata_1.0.19
>>>    illuminaHumanv1.db_1.12.2              org.Hs.eg.db_2.7.1
>>>  [6] RSQLite_0.11.2                         DBI_0.2-5
>>>          AnnotationDbi_1.18.4                   Biobase_2.16.0
>>>             GGtools_4.4.0
>>> [11] Rsamtools_1.8.6                        Biostrings_2.24.1
>>>          GenomicRanges_1.8.13                   IRanges_1.14.4
>>>             BiocGenerics_0.2.0
>>> [16] GGBase_3.18.0                          snpStats_1.6.0
>>>           Matrix_1.0-9                           lattice_0.20-10
>>>              survival_2.36-14
>>> [21] NCBI2R_1.4.3
>>>
>>> loaded via a namespace (and not attached):
>>>  [1] annotate_1.34.1          biomaRt_2.12.0           bit_1.1-8
>>>        bitops_1.0-4.1           BSgenome_1.24.0          ff_2.2-7
>>>       genefilter_1.38.0
>>>  [8] GenomicFeatures_1.8.3    grid_2.15.1              RCurl_1.95-0.1.2
>>>       rtracklayer_1.16.3       tools_2.15.1
>>> VariantAnnotation_1.2.11 XML_3.95-0
>>> [15] xtable_1.7-0
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>
>>
>