[BioC] Finding mismatches between reads and reference sequence

Fri Apr 5 09:43:09 CEST 2013

Hi David,

Sorry for the delay in answering this. Was too busy trying to get
a few things done before the new BioC release.

On 03/26/2013 05:50 AM, David Greber wrote:
> Hello,
>
> How can I find the mismatching positions between a read (e.g. in a
> "GappedAlignments" object) and the reference sequence (a "BSgenome"
> object)? In general, I am looking for an operation that maps the read
> sequence against the reference genome (taking cigar operation into account)
> and compares the DNAString objects.
>
> I tried this, but due to the different cigar string operations, this seems
> to become difficult for complex alignments. The "Rsamtools" package offers
> with "cigarToIRangesListByAlignment", but does not take soft clips into
> account.
>
> Is there some functionality in bioconductor for this? I assume that it is a
> common task, but could not find anything like it.

I don't think there is an easy way at the moment to find the mismatches
between the reads and reference sequences.

Below is some code that solves this problem granted that (1) you have
a BSgenome object containing the reference sequences, and (2) there
are no insertions in your alignments (i.e. no I's in the cigars).
Like in the following example:

   library(pasillaBamSubset)
   library(Rsamtools)

   ## Query sequences are in the SEQ field in the BAM file.
   param <- ScanBamParam(what="seq", tag=c("MD", "NM"))
   reads <- readGappedAlignments(untreated1_chr4(), param=param)

Then:

   > colSums(cigarOpTable(cigar(reads)))
          M        I        D        N        S        H        P
   15326625        0        0 21682582        0        0        0

No insertions (I), no deletions (D). What's important for the following
code to work is that you have no insertions. Having deletions, or soft
clipping (S) or hard clipping (H) is OK.

It would probably not be too hard to adapt the code below to actually
support insertions, but that would make it a little bit more complicated.

We'll proceed in 3 steps:
   (a) Extract the query sequences, clip them, and flip them if needed.
   (b) Extract the reference sequences from the BSgenome object.
   (c) Compare query sequences to reference sequences.

(a) Extract the query sequences, clip them, and flip them if needed.

       qseqs <- mcols(reads)$seq

     Clip them. Only soft-clipping needs to be performed. The
     query sequences stored in BAM file are already hard-clipped:

       softClip <- function(qseq, cigar)
       {
           if (length(qseq) != length(cigar))
               stop("'qseq' and 'cigar' must have the same length")
           split_cigar <- splitCigar(cigar)
           no_clipping_idx <- unlist(unname(lapply(split_cigar,
                                function(x) {
                                    x1 <- x[[1L]]
                                    c(rawToChar(x1[1L]),
                                      rawToChar(x1[length(x1)])) != "S"
                                })))
           clipping <- unlist(unname(lapply(split_cigar,
                                function(x) {
                                    x2 <- x[[2L]]
                                    x2[c(1L, length(x2))]
                                })))
           clipping[no_clipping_idx] <- 0L
           left_clipping <- clipping[c(TRUE, FALSE)]
           right_clipping <- clipping[c(FALSE, TRUE)]
           narrow(qseq, start=left_clipping+1L, end=-right_clipping-1L)
       }

       qseqs <- softClip(qseqs, cigar(reads))

     Because the BAM format imposes that the read sequence stored in
     the SEQ field is reverse complemented when the read is aligned
     to the minus strand, we reverse complement it again to get it in
     the original orientation:

       is_on_minus <- as.logical(strand(reads) == "-")
       qseqs[is_on_minus] <- reverseComplement(qseqs[is_on_minus])

(b) Extract corresponding reference sequences from BSgenome object.

       library(BSgenome.Dmelanogaster.UCSC.dm3)

     A trick here is to extract the ranges of the reads in a GRangesList
     object (each read will typically produce 1 range or more), and to
     treat that GRangesList object as if it was representing exons
     grouped by transcript:

       grl <- grglist(reads, order.as.in.query=TRUE, drop.D.ranges=TRUE)

       library(GenomicFeatures)
       ## Warning can be ignored.
       rseqs <- extractTranscriptsFromGenome(Dmelanogaster, grl)

(c) Compare 'qseqs' and 'rseqs'.

     At this point 'qseqs' and 'rseqs' should have the same shape (i.e.
     same length and widths). Furthermore, if there was no mismatch
     in any of the alignments, 'qseqs' and 'rseqs' would contain
     exactly the same sequences. The mismatches can be found with:

       mismatches <- function(x, y)
       {
           if (!is(x, "DNAStringSet") || !is(y, "DNAStringSet"))
               stop("'x' and 'y' must be DNAStringSet objects")
           x_width <- width(x)
           y_width <- width(y)
           if (!identical(x_width, y_width))
               stop("'x' and 'y' must have the same shape ",
                    "(i.e. same length and widths)")
           unlisted_ans <- which(as.raw(unlist(x)) != as.raw(unlist(y)))
           breakpoints <- cumsum(x_width)
           ans_eltlens <- tabulate(findInterval(unlisted_ans - 1L,
                                                breakpoints) + 1L,
                                   nbins=length(x))
           skeleton <- PartitioningByEnd(cumsum(ans_eltlens))
           ans <- relist(unlisted_ans, skeleton)
           offsets <- c(0L, breakpoints[-length(breakpoints)])
           ans <- ans - offsets
           ans
       }

       mm <- mismatches(qseqs, rseqs)

     The result is an IntegerList:

       > mm
       IntegerList of length 204355
       [[1]] 27
       [[2]] 9 13 54
       [[3]] 13 45
       [[4]] 48
       [[5]] 37 57
       [[6]] 56
       [[7]] integer(0)
       [[8]] integer(0)
       [[9]] 35 73
       [[10]] 23 41 48
       ...
       <204345 more elements>

     Nb of mismatches per alignment:

       elementLengths(mm)

     In the case of pasillaBamSubset, since there are not indels in the
     alignments, this should be identical to the edit distance reported
     in the NM tag of the BAM file:

       stopifnot(identical(elementLengths(mm), mcols(reads)$NM))

     Unfortunately, using 'mm' to subset 'qseqs' or 'rseqs' is not
     supported at the moment (we should add it). Here is a temporary
     workaround:

       subsetByIntegerList <- function(x, i)
       {
           breakpoints <- cumsum(width(x))
           offsets <- c(0L, breakpoints[-length(breakpoints)])
           i <- i + offsets
           unlisted_ans <- unlist(x)[unlist(i)]
           as(successiveViews(unlisted_ans, elementLengths(i)), class(x))
       }

     Subsetting 'qseqs' and 'rseqs' produces 2 DNAStringSet objects
     with the shape of 'mm':

       > subsetByIntegerList(qseqs, mm)
         A DNAStringSet instance of length 204355
                width seq
            [1]     1 G
            [2]     3 GAG
            [3]     2 AG
            [4]     1 A
            [5]     2 CA
            ...   ... ...
       [204351]     1 A
       [204352]     1 A
       [204353]     2 AG
       [204354]     0
       [204355]     0

       > subsetByIntegerList(rseqs, mm)
         A DNAStringSet instance of length 204355
                width seq
            [1]     1 A
            [2]     3 TGA
            [3]     2 GC
            [4]     1 G
            [5]     2 AC
            ...   ... ...
       [204351]     1 G
       [204352]     1 G
       [204353]     2 GT
       [204354]     0
       [204355]     0

Lightly tested only. Let me know if you run into problems.

Lot of the above stuff will be added soon in one form or another to
the basic infrastructure. Maybe a mismatches(x, y) generic with methods
for x=GappedAlignments,y=BSgenome ; x=DNAStringSet,y=DNAStringSet ; and
other useful combinations...

Cheers,
H.

>
> Cheers
> David
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319