[BioC] [devteam-bioc] Create consensus fasta file from indexed Bam file

Wed Jun 19 21:45:11 CEST 2013

Hi Jacques,

I just added a tool to the Rsamtools package, that makes this easy.
It's the sequenceLayer() function. Here is how to use it:

(1) Load the BAM file into a GAlignments object:

       library(Rsamtools)
       bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
       param <- ScanBamParam(what="seq")
       gal <- readGAlignmentsFromBam(bamfile, param=param)
       qseq <- mcols(gal)$seq  # the query sequences

(2) Use sequenceLayer() to "lay" the query sequences on the reference
     space. This will remove the parts from the query sequences that
     correspond to insertions and soft clipping, and it will fill them
     with - where deletions and/or skipped regions occurred:

       qseq_on_ref <- sequenceLayer(qseq, cigar(gal),
                                    from="query", to="reference")

(3) Compute 1 consensus matrix per chromosome:

       qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
       qseq_pos_by_chrom <- splitAsList(start(gal), seqnames(gal))

       cm_by_chrom <- lapply(names(qseq_pos_by_chrom),
           function(seqname)
               consensusMatrix(qseq_on_ref_by_chrom[[seqname]],
                               as.prob=TRUE,
                               shift=qseq_pos_by_chrom[[seqname]]-1,
                               width=seqlengths(gal)[[seqname]]))
       names(cm_by_chrom) <- names(qseq_pos_by_chrom)

       ## 'cm_by_chrom' is a list of consensus matrices. Each matrix
       ## has 17 rows (1 per letter in the DNA alphabet) and 1 column
       ## per chromosome position.

(4) Compute the consensus string from each consensus matrix. We'll
     put "+" ins the strings wherever there is no coverage for that
     position, and "N" where there is coverage but no consensus.

       cs_by_chrom <- lapply(cm_by_chrom,
           function(cm) {
               ## Because consensusString() doesn't like consensus
               ## matrices with columns that contain only zeroes (and
               ## you will have columns like for chromosome positions
               ## that don't receive any coverage), we need to "fix"
               ## 'cm' first.
               idx <- colSums(cm) == 0
               cm["+", idx] <- 1
               DNAString(consensusString(cm, ambiguityMap="N"))
           })

(5) Write 'cs_by_chrom' to a FASTA file:

       writeXStringSet(DNAStringSet(cs_by_chrom), "consensus.fa")

Please let us know how this goes with your 454 data.

Note that you will need the 1.13.18 version of Rsamtools for this,
which is the latest devel version of the package. This means you need
to install BioC devel. See our website for how to do this.
Rsamtools 1.13.18 will be available in the next 24 hours or so thru
biocLite().

Cheers,
H.

On 06/19/2013 02:54 AM, Maintainer wrote:
>
> Hello,
> I have an indexed bam files from 454 sequencing on short reference sequences. I would like to create a consensus sequence in fasta format from this bam file using R Bioconductor. Is there a way to do that simply ?
> Thanks in advance
> Jacques
>
>   -- output of sessionInfo():
>
> R version 2.15.1 (2012-06-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252
> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
> [5] LC_TIME=French_France.1252
>
> attached base packages:
> [1] grDevices datasets  splines   graphics  stats     tcltk     utils
> [8] methods   base
>
> other attached packages:
> [1] Rsamtools_1.10.2     Biostrings_2.26.3    GenomicRanges_1.10.7
> [4] IRanges_1.16.6       BiocGenerics_0.4.0   TinnR_1.0-5
> [7] R2HTML_2.2.1         Hmisc_3.10-1.1       survival_2.36-14
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5    cluster_1.14.4  grid_2.15.1     lattice_0.20-15
> [5] parallel_2.15.1 stats4_2.15.1   tools_2.15.1    zlibbioc_1.4.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319