[BioC] From location to get the SNP rs ID

Fabrice Tourre fabrice.ciup at gmail.com
Mon Jun 3 23:21:24 CEST 2013


Hervé Pagès,

It helps. Thank you very much.

On Mon, Jun 3, 2013 at 4:58 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi Fabrice,
>
>
> On 06/03/2013 11:55 AM, Fabrice Tourre wrote:
>>
>> Hervé Pagès,
>>
>> When I get rsid from position, I think some of them cannot mapped to a
>> rsid. How can we let they also return a value, otherwise we will have
>> a problem to map the position to rsid.
>
>
> Any genomic position can be mapped to 0, 1, or more than 1 rs ids (the
> "more than 1 rs ids" case can happen because dbSNP sometimes assign
> distinct ids to SNPs located at the same position, don't ask me why).
> This is summarized by saying that the mapping is a 1-to-many
> relationship.
>
>>
>> For example:
>>
>> posi <- matrix(unlist(strsplit(as.character(dat[,1]),":")),ncol=2,byrow=T)
>> chrs <- paste("ch",posi[,1],sep="")
>> mylocs <- GRanges(chrs,
>>          IRanges(as.integer(posi[,2]), width=1))
>> rsids <- loc2rsid(mylocs)
>>
>> dat$SNP <- unlist(rsids)
>> Error in `$<-.data.frame`(`*tmp*`, "SNP", value = c("rs1374570",
>> "rs145939433",  :
>>    replacement has 92428 rows, data has 92489
>>>
>>> length(chrs)
>>
>> [1] 92489
>>>
>>> length(unlist(rsids))
>>
>> [1] 92428
>
>
> THE PROBLEM: Because the relationship is 1-to-many, loc2rsid() cannot
> in general return a character vector of the same length as the input
> vector (i.e. the GRanges object holding the genomic positions). This
> is why the function returns a *CharacterList* object. This CharacterList
> is guaranteed to have the length of the input vector. However the list
> elements of that CharacterList can be of length 0, 1, or more than 1.
> Because of that, you should not expect that the result of unlisting this
> CharacterList will give you a character vector of the length of the
> input vector.
>
> THE SOLUTION: The general advice is to not use unlist() if you want to
> preserve length and positioning (i.e. if you want the i-th element in
> the result to correspond to the i-th in the input):
>
>   > x <- CharacterList("a", "bb", character(), "ccc")
>
>   > x
>   CharacterList of length 4
>   [[1]] a
>   [[2]] bb
>   [[3]] character(0)
>   [[4]] ccc
>
>   > unlist(x)
>   [1] "a"   "bb"  "ccc"
>
> Use as.character() instead:
>
>   > as.character(x)
>   [1] "a"   "bb"  NA    "ccc"
>
> as.character() is guaranteed to preserve length and positioning... or
> to fail. It will fail if a list element has a length >= 2. Because that
> means a choice needs to be made and as.character() won't make that
> choice for you:
>
>   > x2 <- CharacterList(letters[1:3], "bb", character(), "ccc")
>
>   > x2
>   CharacterList of length 4
>   [[1]] a b c
>   [[2]] bb
>   [[3]] character(0)
>   [[4]] ccc
>
>   > as.character(x2)
>   Error in as.vector(x, mode = "character") :
>     coercing an AtomicList object to an atomic vector is supported only for
>     objects with top-level elements of length <= 1
>
> Possible choices are (the goal being to end up with a CharacterList
> object where all the list elements have a length <= 1):
>
>   (1) Keep 1 arbitrary element of any list element of length >= 2.
>       For example, to keep the 1st element:
>
>         idx <- elementLengths(x2) >= 2
>         x2_subset <- x2[idx]
>         x2_subset <- sapply(x2_subset, `[`, 1L)
>         x2[idx] <- x2_subset
>
>       Equivalently, you could choose to keep the last element:
>
>         idx <- elementLengths(x2) >= 2
>         x2_subset <- x2[idx]
>         x2_subset <- sapply(x2_subset, function(x) x[length(x)])
>         x2[idx] <- x2_subset
>
>       Note that the above sapply() can be replaced by much faster
>
>         x2_subset <- unlist(x2_subset)[cumsum(elementLengths(x2_subset))]
>
>   (2) Keep nothing:
>
>         idx <- elementLengths(x2) >= 2
>         x2[idx] <- NA
>
>   (3) Keep an element randomly...
>
> Then you should be able to do 'as.character(x2)'.
>
> Genomic positions mapped to more than 1 rs id are rare though so it
> could be that most of the times you won't run into that problem.
> However if the code you're writing is going to be re-used by other
> people, they might supply those problematic genomic positions in
> the input so it's good if your code can handle them gracefully...
>
> Cheers,
>
> H.
>
>>
>> On Fri, May 31, 2013 at 7:01 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>
>>> On 05/31/2013 03:31 AM, Fabrice Tourre wrote:
>>>>
>>>>
>>>> Hervé Pagès,
>>>>
>>>> It helps. Thank you very much.
>>>>
>>>> But loc2rsid seems does not in version SNPlocs.Hsapiens.dbSNP.2010042
>>>> and SNPlocs.Hsapiens.dbSNP.20120608.
>>>
>>>
>>>
>>> Nope. That's why I sent you the code ;-)
>>>
>>>
>>> H.
>>>
>>>>
>>>> On Fri, May 31, 2013 at 3:42 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>>
>>>>>
>>>>> Hi Fabrice,
>>>>>
>>>>> With the loc2rsid() function (see code at the bottom) and using
>>>>> SNPlocs.Hsapiens.dbSNP.20100427 (I don't have
>>>>> SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment):
>>>>>
>>>>>
>>>>>     mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1,
>>>>> 1)),
>>>>>                       IRanges(c(200, 16369861, 146, 73, 9411609,
>>>>> 16051107),
>>>>> width=1))
>>>>>
>>>>>     > mylocs
>>>>>     GRanges with 6 ranges and 0 metadata columns:
>>>>>           seqnames               ranges strand
>>>>>              <Rle>            <IRanges>  <Rle>
>>>>>       [1]     ch22 [     200,      200]      *
>>>>>       [2]     ch22 [16369861, 16369861]      *
>>>>>       [3]      chM [     146,      146]      *
>>>>>       [4]      chM [      73,       73]      *
>>>>>       [5]     ch21 [ 9411609,  9411609]      *
>>>>>       [6]     ch22 [16051107, 16051107]      *
>>>>>       ---
>>>>>       seqlengths:
>>>>>        ch21 ch22  chM
>>>>>          NA   NA   NA
>>>>>
>>>>> Then:
>>>>>
>>>>>     library(SNPlocs.Hsapiens.dbSNP.20100427)
>>>>>
>>>>>     > loc2rsid(mylocs)
>>>>>     CharacterList of length 6
>>>>>     [[1]] character(0)
>>>>>     [[2]] rs75258394 rs78180314
>>>>>     [[3]] character(0)
>>>>>     [[4]] character(0)
>>>>>     [[5]] rs76676778
>>>>>     [[6]] rs6518357
>>>>>
>>>>> You need to be careful to use the same chromosome naming convention as
>>>>> the SNPlocs package (which uses dbSNP naming convention). For example
>>>>> in the 'mylocs' object, the mitochondrial chromosome is named chM but
>>>>> in the SNPlocs package it's chMT:
>>>>>
>>>>>     > getSNPcount()
>>>>>         ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8 ch9
>>>>> ch10
>>>>>     1369185 1422439 1186807 1191984 1064540 1040203  977275  919207
>>>>> 740516
>>>>> 859433
>>>>>        ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18
>>>>> ch19
>>>>> ch20
>>>>>      855225  822825  615382  543041  499101  556394  464686  482359
>>>>> 375259
>>>>> 450685
>>>>>        ch21    ch22     chX     chY    chMT
>>>>>      253480  244482  445596   53908     667
>>>>>
>>>>> This is why the locations on chM are not mapped to anything. But
>>>>> after renaming:
>>>>>
>>>>>     > seqlevels(mylocs)[3] <- "chMT"
>>>>>     > seqlevels(mylocs)
>>>>>     [1] "ch21" "ch22" "chMT"
>>>>>
>>>>>     > loc2rsid(mylocs)
>>>>>     CharacterList of length 6
>>>>>     [[1]] character(0)
>>>>>     [[2]] rs75258394 rs78180314
>>>>>     [[3]] rs72619361
>>>>>     [[4]] rs3087742
>>>>>     [[5]] rs76676778
>>>>>     [[6]] rs6518357
>>>>>
>>>>> they get mapped.
>>>>>
>>>>> The CharacterList object returned by loc2rsid() can be set as a
>>>>> metadata column of the input GRanges object:
>>>>>
>>>>>     > mcols(mylocs)$rsid <- loc2rsid(mylocs)
>>>>>     > mylocs
>>>>>     GRanges with 6 ranges and 1 metadata column:
>>>>>           seqnames               ranges strand |                  rsid
>>>>>              <Rle>            <IRanges>  <Rle> |       <CharacterList>
>>>>>       [1]     ch22 [     200,      200]      * |
>>>>>       [2]     ch22 [16369861, 16369861]      * | rs75258394,rs78180314
>>>>>       [3]     chMT [     146,      146]      * |            rs72619361
>>>>>       [4]     chMT [      73,       73]      * |             rs3087742
>>>>>       [5]     ch21 [ 9411609,  9411609]      * |            rs76676778
>>>>>       [6]     ch22 [16051107, 16051107]      * |             rs6518357
>>>>>       ---
>>>>>       seqlengths:
>>>>>        ch21 ch22 chMT
>>>>>          NA   NA   NA
>>>>>
>>>>> Hope this helps,
>>>>> H.
>>>>>
>>>>>
>>>>> ## Maps the genomic locations in 'locs' to the corresponding rs ids.
>>>>> ## 'locs' must be a GRanges object where all ranges are of with 1.
>>>>> ## Because dbSNP can assign distinct ids to SNPs located at the same
>>>>> ## position, the mapping is a 1-to-many relationship.
>>>>> ## Returns a CharacterList of the same length as 'locs'.
>>>>> loc2rsid <- function(locs)
>>>>> {
>>>>>       if (!is(locs, "GRanges"))
>>>>>           stop("'locs' must be a GRanges object")
>>>>>       if (!all(width(locs) == 1L))
>>>>>           stop("all ranges in 'locs' must be of width 1")
>>>>>       common_seqlevels <- intersect(seqlevels(locs),
>>>>> names(getSNPcount()))
>>>>>       if (length(common_seqlevels) == 0L)
>>>>>           stop("chromosome names (a.k.a. seqlevels) in 'locs' don't
>>>>> seem
>>>>> to ",
>>>>>                "be\n  compatible with the chromosome names in the
>>>>> SNPlocs
>>>>> ",
>>>>>                "package. Maybe they\n  use a different naming
>>>>> convention?
>>>>> ",
>>>>>                "If that's the case then you first need\n  to rename the
>>>>> ",
>>>>>                "seqlevels in 'locs'. See '?seqlevels' for how to do
>>>>> this.")
>>>>>       f <- as.factor(seqnames(locs))
>>>>>       locs_by_chrom <- split(start(locs), f)
>>>>>       rsids_by_chrom <- lapply(seq_along(locs_by_chrom),
>>>>>              function(i)
>>>>>              {
>>>>>                  seqname <- levels(f)[i]
>>>>>                  locs2 <- locs_by_chrom[[i]]
>>>>>                  ans2 <- vector("list", length=length(locs2))
>>>>>                  if (length(locs2) == 0L || !(seqname %in%
>>>>> common_seqlevels))
>>>>>                      return(ans2)
>>>>>                  snplocs <- getSNPlocs(seqname, as.GRanges=FALSE)
>>>>>                  hits <- findMatches(locs2, snplocs$loc)
>>>>>                  rsids <- paste0("rs",
>>>>> snplocs$RefSNP_id[subjectHits(hits)])
>>>>>                  q_hits <- queryHits(hits)
>>>>>                  tmp <- split(rsids, q_hits)
>>>>>                  ans2[as.integer(names(tmp))] <- tmp
>>>>>                  ans2
>>>>>              })
>>>>>       CharacterList(unsplit(rsids_by_chrom, f))
>>>>> }
>>>>>
>>>>> PS: This functionality will make it to the SNPlocs packages (in this
>>>>> form or another).
>>>>>
>>>>>
>>>>>
>>>>> On 05/30/2013 09:54 PM, Fabrice Tourre wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> I have a list of genome positions. Is there is a package to get the
>>>>>> snp name at this position?
>>>>>>
>>>>>> It is the reverse function as rsid2loc in
>>>>>> SNPlocs.Hsapiens.dbSNP.20120608
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list