[BioC] Reading Paired End Native Report Format in ShortRead

Murli murlinair at gmail.com
Fri Jul 13 20:17:59 CEST 2012


Hi Martin,
Thanks for your suggestions. I have tried for some time to parse the
novoAlign output using read.table, with no success yet. I have just
started using shortRead and have not worked with GenomicRanges yet. I
have created two files of the paired end data containing  ~500 lines
of data that may be downloaded from
http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible
for your take a look at the format and guide me a little here?  I
would greatly appreciate it.
Cheers../Murli



On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 07/12/2012 11:42 AM, Murli Nair [guest] wrote:
>>
>>
>> Hi,
>>
>> I am trying to read the alignments generated using NovoAlign. The format I
>> have the data is Paired End Native Report
>> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf).
>> What is the most efficient way to read this data into ShortRead? Since it
>> is paired end data I have two files corresponding to the two sides.
>> I tried without success using the different formats using readAligned(). I
>> also read an earlier posting about it which suggests to convert it to SAM
>> format.
>> I would appreciate your suggestions.
>
>
> From the document you reference
>
> Three output formats are provided.
>
> 1. Native
>
> 2. Extended Native
>
> 3. Pairwise
>
> 4. SAM
>
> If Paired End Native Report Format is 1 or 2 with a single record per line
> then I believe the only support for input would be as tab-delimited files
> (read.table and friends; these are flexible and could easily be used to
> iterate through a large file in a memory efficient way); you would then use
> an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to
> create an object that you could manipulate. Format 3 looks challenging to
> parse.
>
> Generally, for aligned reads aim for BAM files, which is output format 4
> followed by using Rsamtools or other with asBam, sortBam, indexBam to create
> a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs for
> many paired-end tasks.
>
> It might help to think a little further ahead about what you want to do,
> e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq
> differential expression to count reads in regions of interest, and would
> need bam files but would manage data input for you.
>
> Martin
>
>> Cheers../Murli
>>
>>
>>   -- output of sessionInfo():
>>
>> R version 2.15.0 (2012-03-30)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=C                 LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] ShortRead_1.14.4    latticeExtra_0.6-19 RColorBrewer_1.0-5
>> [4] Rsamtools_1.8.5     lattice_0.20-6      Biostrings_2.24.1
>> [7] GenomicRanges_1.8.7 IRanges_1.14.4      BiocGenerics_0.2.0
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0    hwriter_1.3
>> stats4_2.15.0
>> [6] tools_2.15.0   zlibbioc_1.2.0
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
>



More information about the Bioconductor mailing list