[BioC] Reading Paired End Native Report Format in ShortRead

Mon Jul 16 19:54:30 CEST 2012

Martin Thanks a lot for your time. I contacted the developers of
Novoalign and they have a perl script to convert the novo output to
sam called novo2sam.pl. I was able to convert the data I have to sam
format using that. I shall try to use Rsamtools to convert it to Bam.
I tried using samtools and the reference index file to convert it to
Bam format which I did achieve. But for some reason it did not have
the chromosome information when I tried to read it using
readAligned().  Let me play with the suggestions you have just send me
and see how it goes.
Also, the my data while it is paired end data, my alignments are
single read alignments. This is because I am looking at the structure
of DNA in chromatin so the contacts are from different chromosomes. I
presume it probably is inefficient to align paired-end data if they
involve different chromosomes. I am not sure of this but that is the
only reason I can think of. You may comment if you have any thoughts
on this.
Thanks again for script, I greatly appreciate it.

Cheers../Murli

On Sun, Jul 15, 2012 at 6:31 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 07/13/2012 11:17 AM, Murli wrote:
>>
>> Hi Martin,
>> Thanks for your suggestions. I have tried for some time to parse the
>> novoAlign output using read.table, with no success yet. I have just
>> started using shortRead and have not worked with GenomicRanges yet. I
>> have created two files of the paired end data containing  ~500 lines
>> of data that may be downloaded from
>> http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible
>> for your take a look at the format and guide me a little here?  I
>> would greatly appreciate it.
>> Cheers../Murli
>
>
> Really I think the most straight-forward approach is to get novoAlign to
> generate SAM or BAM output; if SAM then use Rsamtools::asBam to convert to
> BAM. You can then read the data in with Rsamtools::scanBam,
> GenomicRanges::readGappedAlignments, and in devel
> GenomicRanges::readGappedAlignmentPairs. Otherwise you'll spend a lot of
> time working through the appropriate interpretation of the novoAlign fields.
>
> I'm attaching a script that might at least provide some inspiration if you
> really want to parse the current text format; there are some significant
> problems (see the FIXME's, for example) and some aspects of your files look
> weird, e.g., the second column contains 'S' indicating single-end
> alignments, but the intention seems to have been to align paired end data?
>
> Martin
>
>
>
>>
>>
>>
>> On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>
>>> On 07/12/2012 11:42 AM, Murli Nair [guest] wrote:
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I am trying to read the alignments generated using NovoAlign. The format
>>>> I
>>>> have the data is Paired End Native Report
>>>> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf).
>>>> What is the most efficient way to read this data into ShortRead? Since
>>>> it
>>>> is paired end data I have two files corresponding to the two sides.
>>>> I tried without success using the different formats using readAligned().
>>>> I
>>>> also read an earlier posting about it which suggests to convert it to
>>>> SAM
>>>> format.
>>>> I would appreciate your suggestions.
>>>
>>>
>>>
>>>  From the document you reference
>>>
>>> Three output formats are provided.
>>>
>>> 1. Native
>>>
>>> 2. Extended Native
>>>
>>> 3. Pairwise
>>>
>>> 4. SAM
>>>
>>> If Paired End Native Report Format is 1 or 2 with a single record per
>>> line
>>> then I believe the only support for input would be as tab-delimited files
>>> (read.table and friends; these are flexible and could easily be used to
>>> iterate through a large file in a memory efficient way); you would then
>>> use
>>> an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to
>>> create an object that you could manipulate. Format 3 looks challenging to
>>> parse.
>>>
>>> Generally, for aligned reads aim for BAM files, which is output format 4
>>> followed by using Rsamtools or other with asBam, sortBam, indexBam to
>>> create
>>> a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs
>>> for
>>> many paired-end tasks.
>>>
>>> It might help to think a little further ahead about what you want to do,
>>> e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq
>>> differential expression to count reads in regions of interest, and would
>>> need bam files but would manage data input for you.
>>>
>>> Martin
>>>
>>>> Cheers../Murli
>>>>
>>>>
>>>>    -- output of sessionInfo():
>>>>
>>>> R version 2.15.0 (2012-03-30)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] ShortRead_1.14.4    latticeExtra_0.6-19 RColorBrewer_1.0-5
>>>> [4] Rsamtools_1.8.5     lattice_0.20-6      Biostrings_2.24.1
>>>> [7] GenomicRanges_1.8.7 IRanges_1.14.4      BiocGenerics_0.2.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0    hwriter_1.3
>>>> stats4_2.15.0
>>>> [6] tools_2.15.0   zlibbioc_1.2.0
>>>>
>>>> --
>>>> Sent via the guest posting facility at bioconductor.org.
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
>