[BioC] creating GRanges and reading BAM files

Thu Mar 8 22:41:11 CET 2012

Hi again,

The latest version of the SAM spec (v1.4-r994), not-yet-released but
accessible here:

   https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/sam-spec

has a whole new and very enlightening section about the "padded SAM
format" (3. Guide for Describing Assembly Sequences in SAM) where
the P operation plays a central role.

After reading it and also having a quick look at this thread:

http://sourceforge.net/mailarchive/forum.php?thread_name=CAKVJ-_6FpuinBN0-c9VtjaCBdiBWg%2Byzdn5mHTGGty0K__k0DQ%40mail.gmail.com&forum_name=samtools-devel

my understanding is that the only purpose of the padding CIGAR
operation 'P' is to allow conversions between the "padded SAM"
and "unpadded SAM" formats with no loss of information.

More precisely, and IIUC, P's can only show up in the "unpadded SAM"
format (which is the most commonly used format) and allow the
conversion from "unpadded SAM" to "padded SAM".

Inversely, a "padded SAM" (typically used when dealing with assembly
sequences) will never contain I's or P's and can easily be converted
into "unpadded SAM" (there is a new pad2unpad tool in samtools for
doing this).

So to summarize, yes it looks like the cigar utils in GenomicRanges
can safely ignore the P operations. I'll do that change.

Cheers,
H.

On 03/08/2012 12:11 PM, Hervé Pagès wrote:
> Hi Michael, Dave,
>
> On 03/07/2012 09:23 AM, Michael Lawrence wrote:
>> Herve, do you think we could support the P in the CIGAR strings? Or
>> does it
>> complicate things too much?
>
> According to the SAM spec:
>
> P: padding (silent deletion from padded reference)
>
> Not clear to me what they mean exactly. In particular, what's
> the "padded reference"?
>
> So I looked at the example provided in section 1.1 of the spec
> (please use a fixed-width font to display this message):
>
> Coor 12345678901234 5678901234567890123456789012345
> ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
> +r001/1 TTAGATAAAGGATA*CTG
> +r002 aaaAGATAA*GGATA
> +r003 gcctaAGCTAA
> +r004 ATAGCT..............TCAGC
> -r003 ttagctTAGGC
> -r001/2 CAGCGCCAT
>
> The corresponding SAM format is:
>
> @HD VN:1.3 SO:coordinate
> @SQ SN:ref LN:45
> r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
> r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
> r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
> r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
> r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0
> r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *
>
> As you can see, P is used in the cigar of read r002.
>
> So yes the reference sequence is padded with 2 consecutive * but
> that's just a displaying trick here that makes it easier to display
> the 2-letter insertion in r001/1.
>
> So I'm confused about how relevant the 1P operation in the r002
> cigar is. The same cigar but without the 1P operation would be
> 3S6M1I4M, and it tells me all I need to know about how the read
> is aligned to the reference (not to the *padded* reference, why
> should I care?).
>
> Am I missing something or are there situations where the P
> actually carries more interesting/important meaning? Otherwise,
> adapting the cigar utils in GenomicRanges will be easy: they'll
> just ignore the P's :-)
>
> But I'd really like to have a closer look at those BAM files
> produced by the Roche's software first. Dave do you think you
> can put this file (the one with P's), or a subset of it,
> somewhere online where I can access it?
>
> Thanks,
> H.
>
>>
>> On Wed, Mar 7, 2012 at 9:03 AM, David A.<dasolexa at hotmail.com> wrote:
>>
>>> Thanks a lot Michael.
>>>
>>> I will modify my pseudo-bed file accordingly.
>>>
>>>
>> If you do actually get your BED file into spec, then you could use
>> rtracklayer::import to load it directly as a GRanges. You'll still
>> need to
>> set the seqlengths on it.
>>
>>
>>> Regarding padding not being supported, and while waiting to see if it
>>> ever
>>> gets support so that I can continue with analysis, is there any tool
>>> that
>>> will automatically unpad the alignment? or do people create their own
>>> scripts to remove the P flags? That is all it needs right?
>>>
>>>
>>>
>>> Dave
>>>
>>> ------------------------------
>>> Date: Wed, 7 Mar 2012 07:21:31 -0800
>>> Subject: Re: [BioC] creating GRanges and reading BAM files
>>> From: lawrence.michael at gene.com
>>> To: dasolexa at hotmail.com
>>> CC: bioconductor at r-project.org
>>>
>>>
>>>
>>>
>>> On Wed, Mar 7, 2012 at 4:51 AM, David A.<dasolexa at hotmail.com> wrote:
>>>
>>>
>>> Hi list,
>>>
>>> sorry for a simple question, but I am a newbie a bit lost reading all
>>> the
>>> information on how to handle NGS data using R tools. I have a set of BAM
>>> files from Junior sequencer, one BAM per amplicon per sample (Roche's
>>> software does not output one single BAM file per sample including all
>>> amplicons). I also have a bed file with the features sequenced. At the
>>> moment I am only testing with two amplicons for one sample. My final
>>> aim is
>>> to get the coverage per amplicon per sample
>>>
>>> I am using R version 2.14.2 (2012-02-29)
>>>
>>> I read in the bed file and tried to create a GRanges object:
>>>
>>>> bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE)
>>>> bed.frame
>>> chromosome start end
>>> 1 APOBex26_M13 1 478
>>> 2 APOBex29_M13 1 448
>>>
>>>
>>> This does not appear to be a valid BED file. There should not be any
>>> column names, and the intervals should be 0-based. But anyway, let's
>>> just
>>> say it is BED-like.
>>>
>>>
>>>>
>>> gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end=bed.frame[,3]))
>>>
>>>> gr
>>> GRanges with 2 ranges and 0 elementMetadata values:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] APOBex26_M13 [1, 478] *
>>> [2] APOBex29_M13 [1, 448] *
>>> ---
>>> seqlengths:
>>> APOBex26_M13 APOBex29_M13
>>> NA NA
>>>
>>>
>>> I also read in the list of BAM files vand convert to BamFileList:
>>>
>>>> fls = list.files(pattern="*bam",full=TRUE)
>>>> fls
>>> [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam"
>>> [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam"
>>> bfs<- BamFileList(fls)
>>>
>>> I then try to summarizeOverlaps but gives me the following error.
>>>
>>>> olap<-summarizeOverlaps(gr, bfs)
>>> Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") :
>>> in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported
>>> yet, sorry!
>>>
>>> What is the meaning of this error and how can I overcome it?
>>>
>>>
>>> Well I would just take it at face value: the P cigar operation is not
>>> supported yet. Maybe it is time to start supporting it.. :)
>>>
>>>
>>> Is the gr object not created properly? Is Metadata necessary? Why is not
>>> seqlengths filled automatically using the IRanges?
>>>
>>>
>>> There is no way to know the seqlengths automatically. You've
>>> constructed a
>>> GRanges with features that lay on some typically longer sequence, but
>>> the
>>> sequence length was never specified. You can pass your "end" values
>>> as the
>>> "seqlengths" to the GRanges constructor.
>>>
>>> Michael
>>>
>>>
>>> Thanks for your help,
>>>
>>> Dave
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319