[BioC] find gene symbols immediately flanking before and after a SNP position

Fri Feb 15 22:18:23 CET 2013

Dear Valerie,

Apologies for the delay. Your suggestions solves my problem perfectly.
Thank you for your help and also for writing the package. I have
blogged on how to do this at
http://adairama.wordpress.com/2013/02/15/functionally-annotate-snps-and-indels-in-bioconductor/

And here are R codes for posterity and if anyone is interested.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

#########################
## 1. Prepare the query file ##
#########################

input <- rbind.data.frame( c("rs3753344", "chr1", 1142150),
c("rs12191877", "chr6", 31252925),  c("rs881375", "chr9", 123652898) )
colnames(input) <- c("rsid", "chr", "pos")
input$pos          <- as.numeric(as.character(input$pos))
input
#               rsid   chr       pos
# 1  rs3753344  chr1   1142150
# 2 rs12191877 chr6  31252925
# 3   rs881375   chr9 123652898

target <- with(input, GRanges( seqnames = Rle(chr),
                                             ranges   = IRanges(pos,
end=pos, names=rsid),
                                            strand   = Rle(strand("*")) ) )

###############
## 2. Annotate ##
###############

loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants())
names(loc) <- NULL
out <- as.data.frame(loc)
out$names <- names(target)[ out$QUERYID ]
out <- out[ , c("names", "seqnames", "start", "end", "LOCATION",
"GENEID", "PRECEDEID", "FOLLOWID")]
out <- unique(out)
out
#         names seqnames         start         end   LOCATION GENEID
PRECEDEID FOLLOWID
# 1  rs3753344        chr1     1142150   1142150    promoter      8784
            <NA>     <NA>
# 5  rs3753344        chr1     1142150   1142150   intergenic
<NA>             8784     7293
# 6 rs12191877       chr6   31252925  31252925        intron      3106
              <NA>     <NA>
# 7   rs881375         chr9 123652898 123652898   intergenic     <NA>
          26147     7185

##############################################
## 3. Convert from Entrez Gene ID to Gene Symbols ##
##############################################

library(org.Hs.eg.db)
Symbol2id <- as.list( org.Hs.egSYMBOL2EG )
id2Symbol <- rep( names(Symbol2id), sapply(Symbol2id, length) )
names(id2Symbol) <- unlist(Symbol2id)

x <- unique( with(out, c(levels(GENEID), levels(PRECEDEID), levels(FOLLOWID))) )
table( x %in% names(id2Symbol) ) # good, all found

out$GENESYMBOL <- id2Symbol[ as.character(out$GENEID) ]
out$PRECEDESYMBOL <- id2Symbol[ as.character(out$PRECEDEID) ]
out$FOLLOWSYMBOL <- id2Symbol[ as.character(out$FOLLOWID) ]
out

#         names seqnames     start        end    LOCATION
# 1  rs3753344     chr1   1142150     1142150   promoter
# 5  rs3753344     chr1   1142150     1142150 intergenic
# 6 rs12191877     chr6  31252925    31252925     intron
# 7   rs881375     chr9 123652898   123652898 intergenic
#
#   GENEID PRECEDEID FOLLOWID GENESYMBOL PRECEDESYMBOL FOLLOWSYMBOL
# 1     8784           <NA>         <NA>       TNFRSF18
       <NA>         <NA>
# 5    <NA>           8784           7293             <NA>
    TNFRSF18      TNFRSF4
# 6    3106           <NA>          <NA>           HLA-B
          <NA>         <NA>
# 7    <NA>         26147           7185              <NA>
         PHF19        TRAF1

Many thanks. I hope this helps someone in the future!

Regards, Adai

On Wed, Feb 13, 2013 at 11:53 PM, Valerie Obenchain <vobencha at fhcrc.org> wrote:
>
> Hi Adai,
>
> When using precede() and follow() the 'x' and 'subject' arguments can be any of the following combinations,
>
> > showMethods("precede")
> Function: precede (package IRanges)
> x="ANY", subject="SummarizedExperiment"
> x="GenomicRanges", subject="GenomicRanges"
> x="GenomicRanges", subject="missing"
> x="Ranges", subject="RangesORmissing"
> x="SummarizedExperiment", subject="ANY"
> x="SummarizedExperiment", subject="SummarizedExperiment"
>
>
> The function transcriptsBy() returns a GRangesList. Instead try using the transcripts() function which will return a GRanges,
>
>     tx <- transcripts(txdb)
>
>
> Another function worth exploring is locateVariants() in the VariantAnnotation package. See the examples on the ?locateVariants man page to make sure the seqnames (chromosome names) in your data and the txdb match. You can try using the AllVariants() region
>
>     loc <- locateVariants(query, subject, AllVariants())
>
> or IntergenicVariants() if you are sure the snp is intergenic.
>
>     loc <- locateVariants(query, subject, IntergenicVariants())
>
> In these examples, 'query' is a GRanges of your data and 'subject' is the txdb you made from UCSC.
>
> Valerie
>
>
>
>
>
>
> On 02/13/2013 02:38 PM, Adaikalavan Ramasamy wrote:
>>
>> Dear all,
>>
>> I have a list of several hundred SNP that I would like to annotate
>> functionally and am able to do this via websites such as SeattleSeq.
>> However, for intergenic SNPs it does not give me the neighbouring
>> genes. Therefore, I have tried to find genes immediately flanking a
>> SNP (one left and right) in R. I note that this question has been
>> asked previously. I am trying to follow one of the previous
>> suggestions (https://stat.ethz.ch/pipermail/bioconductor/2010-December/037185.html).
>> I been struggling with this for the last two days but I think I am
>> getting something fundamentally wrong.
>>
>> I have chosen the following two SNPs (among several thousands). I am
>> expecting to see the following kind of output:
>>     rs881375    (chr9:123652898)  is located between PHF19 and TRAF1
>>     rs12191877 (chr6:31252925)   is located between RPL3P2 and WASF5P
>>
>>
>> First, I code the query up as a GRange object:
>>
>>     rsid <- c("rs881375", "rs12191877")
>>     chr  <- c("chr9",     "chr6")
>>     pos  <- c(123652898,  31252925)
>>
>>     library(GenomicFeatures)
>>
>>     target <- GRanges(
>>        seqnames = Rle( chr ),
>>        ranges   = IRanges(pos, end=pos, names=rsid),
>>        strand   = Rle(strand( rep("*", length(chr)) ))
>>     )
>>
>>     # GRanges with 2 ranges and 0 metadata columns:
>>     #              seqnames                 ranges strand
>>     #                <Rle>              <IRanges>  <Rle>
>>     #        rs881375     chr9 [123652898, 123652898]      *
>>     #    rs12191877     chr6 [ 31252925,  31252925]      *
>>     # ---
>>     #     seqlengths:
>>     #   chr6 chr9
>>     #       NA   NA
>>
>>     txdb <- makeTranscriptDbFromUCSC("hg19")   # about 5 min
>>     tx  <- transcriptsBy(txdb)
>>
>>
>> But when I try
>>     precede( target, tx )
>>     follow( target, tx )
>>
>> I get the following message:
>>     Error in function (classes, fdef, mtable)  :
>>       unable to find an inherited method for function ‘precede’ for
>> signature ‘"GRanges", "GRangesList"’
>>
>> Any help would be very much appreciated. I am happy to try other
>> packages or websites if available. Many thanks.
>>
>> Regards, Adai
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>