[BioC] Reading Paired End Native Report Format in ShortRead

Mon Jul 16 00:31:07 CEST 2012

On 07/13/2012 11:17 AM, Murli wrote:
> Hi Martin,
> Thanks for your suggestions. I have tried for some time to parse the
> novoAlign output using read.table, with no success yet. I have just
> started using shortRead and have not worked with GenomicRanges yet. I
> have created two files of the paired end data containing  ~500 lines
> of data that may be downloaded from
> http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible
> for your take a look at the format and guide me a little here?  I
> would greatly appreciate it.
> Cheers../Murli

Really I think the most straight-forward approach is to get novoAlign to 
generate SAM or BAM output; if SAM then use Rsamtools::asBam to convert 
to BAM. You can then read the data in with Rsamtools::scanBam, 
GenomicRanges::readGappedAlignments, and in devel 
GenomicRanges::readGappedAlignmentPairs. Otherwise you'll spend a lot of 
time working through the appropriate interpretation of the novoAlign fields.

I'm attaching a script that might at least provide some inspiration if 
you really want to parse the current text format; there are some 
significant problems (see the FIXME's, for example) and some aspects of 
your files look weird, e.g., the second column contains 'S' indicating 
single-end alignments, but the intention seems to have been to align 
paired end data?

Martin

>
>
>
> On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 07/12/2012 11:42 AM, Murli Nair [guest] wrote:
>>>
>>>
>>> Hi,
>>>
>>> I am trying to read the alignments generated using NovoAlign. The format I
>>> have the data is Paired End Native Report
>>> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf).
>>> What is the most efficient way to read this data into ShortRead? Since it
>>> is paired end data I have two files corresponding to the two sides.
>>> I tried without success using the different formats using readAligned(). I
>>> also read an earlier posting about it which suggests to convert it to SAM
>>> format.
>>> I would appreciate your suggestions.
>>
>>
>>  From the document you reference
>>
>> Three output formats are provided.
>>
>> 1. Native
>>
>> 2. Extended Native
>>
>> 3. Pairwise
>>
>> 4. SAM
>>
>> If Paired End Native Report Format is 1 or 2 with a single record per line
>> then I believe the only support for input would be as tab-delimited files
>> (read.table and friends; these are flexible and could easily be used to
>> iterate through a large file in a memory efficient way); you would then use
>> an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to
>> create an object that you could manipulate. Format 3 looks challenging to
>> parse.
>>
>> Generally, for aligned reads aim for BAM files, which is output format 4
>> followed by using Rsamtools or other with asBam, sortBam, indexBam to create
>> a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs for
>> many paired-end tasks.
>>
>> It might help to think a little further ahead about what you want to do,
>> e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq
>> differential expression to count reads in regions of interest, and would
>> need bam files but would manage data input for you.
>>
>> Martin
>>
>>> Cheers../Murli
>>>
>>>
>>>    -- output of sessionInfo():
>>>
>>> R version 2.15.0 (2012-03-30)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] ShortRead_1.14.4    latticeExtra_0.6-19 RColorBrewer_1.0-5
>>> [4] Rsamtools_1.8.5     lattice_0.20-6      Biostrings_2.24.1
>>> [7] GenomicRanges_1.8.7 IRanges_1.14.4      BiocGenerics_0.2.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0    hwriter_1.3
>>> stats4_2.15.0
>>> [6] tools_2.15.0   zlibbioc_1.2.0
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

-------------- next part --------------
library(GenomicRanges)

.novoAlign_input <-
    function(fl, filt, gal, ...)
{
    ## input
    col.names <- c("qname", "type", "qseq", "qual", "status", "maps",
                   "mapq", "rname", "pos", "strand", "mrnm", "mpos",
                   "mstrand", "mismatches")
    df <- read.table(fl, sep="\t", fill=TRUE, col.names=col.names,
                     stringsAsFactors=FALSE, ...)
    ## filter and coerce to GappedAlignments object
    gal(filt(df))
}

.novoAlign_filter <-
    function(df, ...)
{
    ## keep only uniquely aligning pairs on the forward or reverse strand
    keep <- (df$status == "U") & (df$strand %in% c("F", "R"))
    df[keep,]
}

.novoAlign_GappedAlignments <-
    function(df, ...)
{
    ## coerce some information to GappedAlignments object
    with(df, {
        GappedAlignments(seqnames=rname, pos=pos,
                         ## FIXME: not correct for indels
                         cigar=paste0(nchar(qseq), "M"),
                         strand=strand(c(F="+", R="-")[strand]),
                         names=qname,
                         qseq=DNAStringSet(qseq),
                         qual=PhredQuality(qual))
    })
}

.novoAlign_GappedAlignmentPairs <-
    function(first, last, ...)
{
    ## create GappedAlignmentPairs
    nms <- union(as.character(seqnames(first)),
                 as.character(seqnames(last)))
    seqlevels(first) <- seqlevels(last) <- nms
    isProperPair <- rep(TRUE, length(first))
    GappedAlignmentPairs(first, last, isProperPair)
}

fls <- dir(pattern="novoOutput$")
pairs <- lapply(fls, .novoAlign_input, .novoAlign_filter,
                .novoAlign_GappedAlignments)
## FIXME: is this matching correct?
o <- match(names(pairs[[1]]), names(pairs[[2]]))

galp <- .novoAlign_GappedAlignmentPairs(pairs[[1]][!is.na(o)],
                                        pairs[[2]][o[!is.na(o)]])