[BioC] [devteam-bioc] Create consensus fasta file from indexed Bam file

Jacques LEGOUIS Jacques.LEGOUIS at clermont.inra.fr
Sat Jun 22 10:30:28 CEST 2013


Hi Hervé,
     Just tested what you proposed on my files. It worked very fine.
Thanks a lot for the help.
Regards


> Hi Jacques,
>
> I just added a tool to the Rsamtools package, that makes this easy.
> It's the sequenceLayer() function. Here is how to use it:
>
> (1) Load the BAM file into a GAlignments object:
>
>       library(Rsamtools)
>       bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
>       param <- ScanBamParam(what="seq")
>       gal <- readGAlignmentsFromBam(bamfile, param=param)
>       qseq <- mcols(gal)$seq  # the query sequences
>
> (2) Use sequenceLayer() to "lay" the query sequences on the reference
>     space. This will remove the parts from the query sequences that
>     correspond to insertions and soft clipping, and it will fill them
>     with - where deletions and/or skipped regions occurred:
>
>       qseq_on_ref <- sequenceLayer(qseq, cigar(gal),
>                                    from="query", to="reference")
>
> (3) Compute 1 consensus matrix per chromosome:
>
>       qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
>       qseq_pos_by_chrom <- splitAsList(start(gal), seqnames(gal))
>
>       cm_by_chrom <- lapply(names(qseq_pos_by_chrom),
>           function(seqname)
>               consensusMatrix(qseq_on_ref_by_chrom[[seqname]],
>                               as.prob=TRUE,
> shift=qseq_pos_by_chrom[[seqname]]-1,
>                               width=seqlengths(gal)[[seqname]]))
>       names(cm_by_chrom) <- names(qseq_pos_by_chrom)
>
>       ## 'cm_by_chrom' is a list of consensus matrices. Each matrix
>       ## has 17 rows (1 per letter in the DNA alphabet) and 1 column
>       ## per chromosome position.
>
> (4) Compute the consensus string from each consensus matrix. We'll
>     put "+" ins the strings wherever there is no coverage for that
>     position, and "N" where there is coverage but no consensus.
>
>       cs_by_chrom <- lapply(cm_by_chrom,
>           function(cm) {
>               ## Because consensusString() doesn't like consensus
>               ## matrices with columns that contain only zeroes (and
>               ## you will have columns like for chromosome positions
>               ## that don't receive any coverage), we need to "fix"
>               ## 'cm' first.
>               idx <- colSums(cm) == 0
>               cm["+", idx] <- 1
>               DNAString(consensusString(cm, ambiguityMap="N"))
>           })
>
> (5) Write 'cs_by_chrom' to a FASTA file:
>
>       writeXStringSet(DNAStringSet(cs_by_chrom), "consensus.fa")
>
> Please let us know how this goes with your 454 data.
>
> Note that you will need the 1.13.18 version of Rsamtools for this,
> which is the latest devel version of the package. This means you need
> to install BioC devel. See our website for how to do this.
> Rsamtools 1.13.18 will be available in the next 24 hours or so thru
> biocLite().
>
> Cheers,
> H.
>
>
> On 06/19/2013 02:54 AM, Maintainer wrote:
>>
>> Hello,
>> I have an indexed bam files from 454 sequencing on short reference 
>> sequences. I would like to create a consensus sequence in fasta 
>> format from this bam file using R Bioconductor. Is there a way to do 
>> that simply ?
>> Thanks in advance
>> Jacques
>>
>>   -- output of sessionInfo():
>>
>> R version 2.15.1 (2012-06-22)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252
>> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
>> [5] LC_TIME=French_France.1252
>>
>> attached base packages:
>> [1] grDevices datasets  splines   graphics  stats     tcltk utils
>> [8] methods   base
>>
>> other attached packages:
>> [1] Rsamtools_1.10.2     Biostrings_2.26.3 GenomicRanges_1.10.7
>> [4] IRanges_1.16.6       BiocGenerics_0.4.0   TinnR_1.0-5
>> [7] R2HTML_2.2.1         Hmisc_3.10-1.1       survival_2.36-14
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-5    cluster_1.14.4  grid_2.15.1 lattice_0.20-15
>> [5] parallel_2.15.1 stats4_2.15.1   tools_2.15.1 zlibbioc_1.4.0
>>
>> -- 
>> Sent via the guest posting facility at bioconductor.org.
>>
>> ________________________________________________________________________
>> devteam-bioc mailing list
>> To unsubscribe from this mailing list send a blank email to
>> devteam-bioc-leave at lists.fhcrc.org
>> You can also unsubscribe or change your personal options at
>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>
>


-- 
Jacques Le Gouis
INRA, UMR INRA/UBP 1095
Génétique, Diversité et Ecophysiologie des Céréales

Domaine de Crouelle
5 Chemin de Beaulieu
63039 Clermont Ferrand Cedex 2

Tel : +33 (0)4-73-62-43-11
Fax : +33 (0)4-73-67-18-39
e-mail : jlegouis at clermont.inra.fr

http://www4.clermont.inra.fr/umr1095/abc



More information about the Bioconductor mailing list