[BioC] From location to get the SNP rs ID

Fabrice Tourre fabrice.ciup at gmail.com
Mon Jun 3 20:55:05 CEST 2013


Hervé Pagès,

When I get rsid from position, I think some of them cannot mapped to a
rsid. How can we let they also return a value, otherwise we will have
a problem to map the position to rsid.

For example:

posi <- matrix(unlist(strsplit(as.character(dat[,1]),":")),ncol=2,byrow=T)
chrs <- paste("ch",posi[,1],sep="")
mylocs <- GRanges(chrs,
        IRanges(as.integer(posi[,2]), width=1))
rsids <- loc2rsid(mylocs)

dat$SNP <- unlist(rsids)
Error in `$<-.data.frame`(`*tmp*`, "SNP", value = c("rs1374570",
"rs145939433",  :
  replacement has 92428 rows, data has 92489
> length(chrs)
[1] 92489
> length(unlist(rsids))
[1] 92428

On Fri, May 31, 2013 at 7:01 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
> On 05/31/2013 03:31 AM, Fabrice Tourre wrote:
>>
>> Hervé Pagès,
>>
>> It helps. Thank you very much.
>>
>> But loc2rsid seems does not in version SNPlocs.Hsapiens.dbSNP.2010042
>> and SNPlocs.Hsapiens.dbSNP.20120608.
>
>
> Nope. That's why I sent you the code ;-)
>
>
> H.
>
>>
>> On Fri, May 31, 2013 at 3:42 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>
>>> Hi Fabrice,
>>>
>>> With the loc2rsid() function (see code at the bottom) and using
>>> SNPlocs.Hsapiens.dbSNP.20100427 (I don't have
>>> SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment):
>>>
>>>
>>>    mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1,
>>> 1)),
>>>                      IRanges(c(200, 16369861, 146, 73, 9411609,
>>> 16051107),
>>> width=1))
>>>
>>>    > mylocs
>>>    GRanges with 6 ranges and 0 metadata columns:
>>>          seqnames               ranges strand
>>>             <Rle>            <IRanges>  <Rle>
>>>      [1]     ch22 [     200,      200]      *
>>>      [2]     ch22 [16369861, 16369861]      *
>>>      [3]      chM [     146,      146]      *
>>>      [4]      chM [      73,       73]      *
>>>      [5]     ch21 [ 9411609,  9411609]      *
>>>      [6]     ch22 [16051107, 16051107]      *
>>>      ---
>>>      seqlengths:
>>>       ch21 ch22  chM
>>>         NA   NA   NA
>>>
>>> Then:
>>>
>>>    library(SNPlocs.Hsapiens.dbSNP.20100427)
>>>
>>>    > loc2rsid(mylocs)
>>>    CharacterList of length 6
>>>    [[1]] character(0)
>>>    [[2]] rs75258394 rs78180314
>>>    [[3]] character(0)
>>>    [[4]] character(0)
>>>    [[5]] rs76676778
>>>    [[6]] rs6518357
>>>
>>> You need to be careful to use the same chromosome naming convention as
>>> the SNPlocs package (which uses dbSNP naming convention). For example
>>> in the 'mylocs' object, the mitochondrial chromosome is named chM but
>>> in the SNPlocs package it's chMT:
>>>
>>>    > getSNPcount()
>>>        ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8 ch9
>>> ch10
>>>    1369185 1422439 1186807 1191984 1064540 1040203  977275  919207 740516
>>> 859433
>>>       ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18 ch19
>>> ch20
>>>     855225  822825  615382  543041  499101  556394  464686  482359 375259
>>> 450685
>>>       ch21    ch22     chX     chY    chMT
>>>     253480  244482  445596   53908     667
>>>
>>> This is why the locations on chM are not mapped to anything. But
>>> after renaming:
>>>
>>>    > seqlevels(mylocs)[3] <- "chMT"
>>>    > seqlevels(mylocs)
>>>    [1] "ch21" "ch22" "chMT"
>>>
>>>    > loc2rsid(mylocs)
>>>    CharacterList of length 6
>>>    [[1]] character(0)
>>>    [[2]] rs75258394 rs78180314
>>>    [[3]] rs72619361
>>>    [[4]] rs3087742
>>>    [[5]] rs76676778
>>>    [[6]] rs6518357
>>>
>>> they get mapped.
>>>
>>> The CharacterList object returned by loc2rsid() can be set as a
>>> metadata column of the input GRanges object:
>>>
>>>    > mcols(mylocs)$rsid <- loc2rsid(mylocs)
>>>    > mylocs
>>>    GRanges with 6 ranges and 1 metadata column:
>>>          seqnames               ranges strand |                  rsid
>>>             <Rle>            <IRanges>  <Rle> |       <CharacterList>
>>>      [1]     ch22 [     200,      200]      * |
>>>      [2]     ch22 [16369861, 16369861]      * | rs75258394,rs78180314
>>>      [3]     chMT [     146,      146]      * |            rs72619361
>>>      [4]     chMT [      73,       73]      * |             rs3087742
>>>      [5]     ch21 [ 9411609,  9411609]      * |            rs76676778
>>>      [6]     ch22 [16051107, 16051107]      * |             rs6518357
>>>      ---
>>>      seqlengths:
>>>       ch21 ch22 chMT
>>>         NA   NA   NA
>>>
>>> Hope this helps,
>>> H.
>>>
>>>
>>> ## Maps the genomic locations in 'locs' to the corresponding rs ids.
>>> ## 'locs' must be a GRanges object where all ranges are of with 1.
>>> ## Because dbSNP can assign distinct ids to SNPs located at the same
>>> ## position, the mapping is a 1-to-many relationship.
>>> ## Returns a CharacterList of the same length as 'locs'.
>>> loc2rsid <- function(locs)
>>> {
>>>      if (!is(locs, "GRanges"))
>>>          stop("'locs' must be a GRanges object")
>>>      if (!all(width(locs) == 1L))
>>>          stop("all ranges in 'locs' must be of width 1")
>>>      common_seqlevels <- intersect(seqlevels(locs), names(getSNPcount()))
>>>      if (length(common_seqlevels) == 0L)
>>>          stop("chromosome names (a.k.a. seqlevels) in 'locs' don't seem
>>> to ",
>>>               "be\n  compatible with the chromosome names in the SNPlocs
>>> ",
>>>               "package. Maybe they\n  use a different naming convention?
>>> ",
>>>               "If that's the case then you first need\n  to rename the ",
>>>               "seqlevels in 'locs'. See '?seqlevels' for how to do
>>> this.")
>>>      f <- as.factor(seqnames(locs))
>>>      locs_by_chrom <- split(start(locs), f)
>>>      rsids_by_chrom <- lapply(seq_along(locs_by_chrom),
>>>             function(i)
>>>             {
>>>                 seqname <- levels(f)[i]
>>>                 locs2 <- locs_by_chrom[[i]]
>>>                 ans2 <- vector("list", length=length(locs2))
>>>                 if (length(locs2) == 0L || !(seqname %in%
>>> common_seqlevels))
>>>                     return(ans2)
>>>                 snplocs <- getSNPlocs(seqname, as.GRanges=FALSE)
>>>                 hits <- findMatches(locs2, snplocs$loc)
>>>                 rsids <- paste0("rs",
>>> snplocs$RefSNP_id[subjectHits(hits)])
>>>                 q_hits <- queryHits(hits)
>>>                 tmp <- split(rsids, q_hits)
>>>                 ans2[as.integer(names(tmp))] <- tmp
>>>                 ans2
>>>             })
>>>      CharacterList(unsplit(rsids_by_chrom, f))
>>> }
>>>
>>> PS: This functionality will make it to the SNPlocs packages (in this
>>> form or another).
>>>
>>>
>>>
>>> On 05/30/2013 09:54 PM, Fabrice Tourre wrote:
>>>>
>>>>
>>>> Dear list,
>>>>
>>>> I have a list of genome positions. Is there is a package to get the
>>>> snp name at this position?
>>>>
>>>> It is the reverse function as rsid2loc in
>>>> SNPlocs.Hsapiens.dbSNP.20120608
>>>>
>>>> Thank you.
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list