[BioC] Finding mismatches between reads and reference sequence

Sean Davis sdavis2 at mail.nih.gov
Fri Apr 5 11:58:02 CEST 2013


On Fri, Apr 5, 2013 at 5:43 AM, David Greber <dav.greber at gmail.com> wrote:
> Hi Sean,
>
> Thanks for the hint. In my case, the pileup functions do not help much, as
> I'm interested in finding the mismatches per read/alignment, whereas the
> pileup functions only return the mismatches by positions.

In that case, you might want to use the MD tag rather than the CIGAR
string.  If your aligner does not generate an MD tag, you could use
samtools calmd to generate one and then read the MD tag with the
reads.  From the SAM format specification:

"The MD eld aims to achieve SNP/indel calling without looking at the
reference. For example, a string `10A5^AC6'
means from the leftmost reference base in the alignment, there are 10
matches followed by an A on the reference which
is dierent from the aligned read base; the next 5 reference bases are
matches followed by a 2bp deletion from the
reference; the deleted sequence is AC; the last 6 bases are matches.
The MD eld ought to match the CIGAR string"

Sean


> Best
> David
>
>
> 2013/4/5 Sean Davis <sdavis2 at mail.nih.gov>
>
>> On Tue, Mar 26, 2013 at 8:50 AM, David Greber <dav.greber at gmail.com>
>> wrote:
>> > Hello,
>> >
>> > How can I find the mismatching positions between a read (e.g. in a
>> > "GappedAlignments" object) and the reference sequence (a "BSgenome"
>> > object)? In general, I am looking for an operation that maps the read
>> > sequence against the reference genome (taking cigar operation into
>> account)
>> > and compares the DNAString objects.
>> >
>> > I tried this, but due to the different cigar string operations, this
>> seems
>> > to become difficult for complex alignments. The "Rsamtools" package
>> offers
>> > with "cigarToIRangesListByAlignment", but does not take soft clips into
>> > account.
>> >
>> > Is there some functionality in bioconductor for this? I assume that it
>> is a
>> > common task, but could not find anything like it.
>>
>> Hi, David.
>>
>> This doesn't answer your question directly, but you may want to look
>> at pileup functionality in Rsamtools.  In particular, check out the
>> applyPileups() function.
>>
>> Sean
>>
>>
>> > Cheers
>> > David
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list