[BioC] Problems Counting Reads with summarizeOverlaps

Valerie Obenchain vobencha at fhcrc.org
Tue Dec 24 15:54:11 CET 2013


On 12/24/13 04:24, Sam McInturf wrote:
> Valerie,
>    Sorry for the lame error, was reading your comments about countOverlaps.
>
> I found my problem in that the features data had strand information,
> while my sequencing was not stranded, so by setting the strand
> information to "*" in the GRanges object and then reduce(split()) I was
> able to map succesfully

Great! Glad it worked.

There is an 'ignore.strand' argument in many of the count functions. 
Setting 'ignore.strand=TRUE' is equivalent to setting strand to "*" in 
the GRanges/GAlignments etc. The benefit of using the argument is that 
strand is ignored during counting but that information is retained in 
the original GRanges.

Valerie
>
>
> Thanks for the help!
>
> On Saturday, December 21, 2013, Valerie Obenchain wrote:
>
>     On 12/21/13 05:14, Sam McInturf wrote:
>
>         Valerie and Reema,
>         I am using single end reads.
>
>         I did as Valerie suggested
>
>            so <- summarizeOverlaps(tx, reads, mode="Union",type = "any")
>
>
>     This is not the example I used below. You did not include
>     'inter.feature=FALSE'. 'type' is not an argument to
>     summarizeOverlaps(). In other words, do not use 'type', use
>     'inter.feature'. Please see my previous explanation and the man page
>     for summarizeOverlaps().
>
>
>     Valerie
>
>
>         colSums(assays(so)$counts
>         #185,940
>
>         which feels silly, allow for double counts and return fewer
>         counts than
>         prior
>
>         P.S. thank you for your prompt reply, my email "hid" it from me
>         Best
>
>         Sam
>
>
>
>         On Mon, Dec 9, 2013 at 6:08 PM, Valerie Obenchain
>         <vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>> wrote:
>
>              Reema,
>
>              Thank you for pointing out the error in the summarizeOverlaps
>              vignette. That sentence was a leftover from a previous
>         version and
>              should have been removed. I have updated the vignette in both
>              release (GenomicRanges 1.14.4) and devel (GenomicAlignments
>         0.99.8).
>
>              When the 'reads' argument is a GAlignmentPairs object the
>              stand-alone functions of Union(), IntersectionStrict() and
>              IntersectionNotEmpty() count the data as paired-end. If
>         'reads' is a
>              GAlignments object they are counted as single-end.
>              summarizeOverlaps() reads the data into a GAlignments or
>              GAlignmentPairs object and calls these functions
>         internally; they
>              are exactly the same.
>
>              Valerie
>
>
>
>              On 12/09/2013 02:46 AM, Reema Singh wrote:
>
>                  Hi all,
>
>                  Indeed summarizeOverlaps() works on paired end data.
>         But different
>                  counting modes [Union, Intersection and
>         IntersectionNotEmpty]
>                  they do
>                  not handle paired end reads. Here the reference
>
>         [http://www.bioconductor.org/____packages/release/bioc/____vignettes/GenomicRanges/inst/____doc/summarizeOverlaps.pdf]-
>         <http://www.bioconductor.org/__packages/release/bioc/__vignettes/GenomicRanges/inst/__doc/summarizeOverlaps.pdf%5D->
>
>         <http://www.bioconductor.org/__packages/release/bioc/__vignettes/GenomicRanges/inst/__doc/summarizeOverlaps.pdf%5D-
>         <http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf%5D->>
>                  Page 2 [under Counting Modes].
>
>                  Kind Regards
>
>
>
>                  On Sun, Dec 8, 2013 at 6:31 PM, Valerie Obenchain
>                  <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>                  <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>>> wrote:
>
>                       summarizeOverlaps() does count paired-end reads.
>         Use option
>                       singleEnd=FALSE. Details under 'singleEnd' in the
>         'Arguments'
>                       section on the man page.
>
>                       library(GenomicAlignments) ## devel
>                       library(GenomicRanges) ## release
>                       ?summarizeOverlaps
>
>                       Valerie
>
>
>                       On 12/08/13 09:42, Reema Singh wrote:
>
>                           Hi Sam,
>
>                           It depends on whether you are using single read or
>                  paired end reads
>                           files. As far as I known summarizeOverlaps
>         function
>                  only working
>                           in case
>                           of single read. Anyone please correct me if I
>         am wrong.
>
>                           Kind Regards
>
>
>                           On Sun, Dec 8, 2013 at 8:25 PM, Valerie Obenchain
>                           <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>                  <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>                           <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                  <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>>>> wrote:
>
>                                Hi Sam,
>
>                                Are you getting any error messages related to
>                  seqlevels
>                           when you
>                                count? If you have confirmed that
>         annotation seqlevels
>                           match the bam
>                                files next look at the maximum possible
>         overlap in
>                  a single
>                           bam file.
>
>                                   bf <- singleBamFile
>                                   reads <- readGAlignments(bf)  ## assuming
>                  single-end reads
>                                   so <- summarizeOverlaps(tx, reads,
>         mode=Union,
>                           inter.feature=FALSE)
>
>                                'inter.feature=FALSE' with 'mode=Union' is
>                  countOverlaps with
>                                'type=ANY'. Reads that hit multiple
>         features will
>                  be double
>                           counted
>                                in this case. The idea to to see if there
>         are any
>                  common
>                           regions at all.
>
>                                If there are still no counts, you could
>         look more
>                  closely at a
>                                smaller chromosome region in 'reads'
>         where you
>                  would expect
>                           counts.
>                                This visual inspection might give you
>         some insight
>                  as to
>                           why there
>                                is no overlap.
>
>
>                                Valerie
>
>
>                                On 12/05/13 18:51, Sam McInturf wrote:
>
>                                    Hello Bioconductors,
>                                         I am working on an Arabidopsis
>         RNA seq
>                  set, and after
>                                    executing my count
>                                    reads script, I get a count table with no
>                  reads counted (ie
>                                    sum(colSums(mat)) = 0).
>
>                                    I mapped my reads using Tophat2 (without
>                  supplying a
>                           gtf/gff
>                                    file) and used
>                                    samtools to sort and index my
>                  accepted_hits.bam files.
>                             I loaded
>                                    these bam
>                                    files into IGV (integrated Genome
>         Browser) and the
>                           reads appear
>                                    to have
>                                    been mapped (additionally the
>                  align_summary.txt says I
>                           have good
>                                    mapping).
>                                       Script and session info below.
>
>                                    Essentially I use Rsamtools to load in my
>                  bamfiles, and
>                           then
>                                    makeTranscriptsFromGFF to make a txdb
>         object,
>                           transcriptsBy to get
>                                    transcripts, and then
>         summarizeOverlaps to
>                  count the
>                           reads.  My
>                                    gff file is
>                                    TAIR10.  I am relatively sure that
>         the genome
>                  I mapped
>                           to was
>                                    the TAIR10
>                                    release, but I am not 100% on that.
>
>                                    I have also used biomaRt to do this
>         (commented
>                  out),
>                           but I had
>                                    the same
>                                    results.
>
>                                    I am currently upgrading to R 3.0.2 and
>                  re-installing
>                                    bioconductor (
>           "/data/smcintur/Annotation/________TAIR10_GFF3_genes.gff";
>                                    library(ShortRead)
>                                    library(GenomicFeatures)
>                                    library(Rsamtools)
>                                    #library(biomaRt)
>
>
>
>         setwd("/data/smcintur/RNASeq/________NMseq/newTophat/BamFiles/__")
>
>
>                                    bf <- list.files(pattern = ".bam$")
>                                    bi <- list.files(pattern = "bam.bai$")
>                                    bfl <- BamFileList(bf, bi)
>
>                                    #mart <-
>                  makeTranscriptDbFromBiomart(________biomart =
>
>
>                                    "Public_TAIRV10", dataset=
>                                    "tairv10")
>                                    tx <- transcriptsBy(mart)
>                                    gff <-
>         makeTranscriptDbFromGFF(________gffdir,
>                  format = "gff")
>
>
>                                    gff
>
>                                    cds <- cdsBy(gff)
>                                    tx <- transcriptsBy(gff)
>
>                                    olapTx <- summarizeOverlaps(tx, bfl)
>                                    olapCds <- summarizeOverlaps(cds, bfl)
>
>                                    txMat <- assays(olapTx)$counts
>                                    cdsMat <- assays(olapCds)$counts
>
>                                    write.table(txMat, file =
>         "countMatTxGff.txt",
>                  sep = "\t")
>                                    write.table(cdsMat, file =
>                  "countMatCdsGff.txt", sep =
>                           "\t")
>
>                                    sessionInfo
>                                    R version 3.0.0 (2013-04-03)
>                                    Platform: x86_64-unknown-linux-gnu
>         (64-bit)
>
>                                    locale:
>                                       [1] LC_CTYPE=en_US.UTF-8
>         LC_NUMERIC=C
>                                       [3] LC_TIME=en_US.UTF-8
>                    LC_COLLATE=en_US.UTF-8
>                                       [5] LC_MONETARY=en_US.UTF-8
>                    LC_MESSAGES=en_US.UTF-8
>                                       [7] LC_PAPER=C
>         LC_NAME=C
>                                       [9] LC_ADDRESS=C
>         LC_TELEPHONE=C
>                                    [11] LC_MEASUREMENT=en_US.UTF-8
>                  LC_IDENTIFICATION=C
>
>                                    attached base packages:
>                                    [1] stats     graphics  grDevices utils
>                  datasets
>                             methods   base
>
>                                    other attached packages:
>                                    [1] BiocInstaller_1.12.0
>
>                                    loaded via a namespace (and not
>         attached):
>                                       [1] AnnotationDbi_1.24.0
>         Biobase_2.22.0
>                                    BiocGenerics_0.8.0
>                                       [4] biomaRt_2.18.0
>         Biostrings_2.30.1
>                             bitops_1.0-6
>                                       [7] BSgenome_1.30.0        DBI_0.2-7
>                                      GenomicFeatures_1.14.2
>                                    [10] GenomicRanges_1.14.3
>         IRanges_1.20.6
>                           parallel_3.0.0
>                                    [13] RCurl_1.95-4.1
>         Rsamtools_1.14.2
>                           RSQLite_0.11.4
>                                    [16] rtracklayer_1.22.0     stats4_3.0.0
>                           tools_3.0.0
>                                    [19] XML_3.98-1.1           XVector_0.2.0
>                             zlibbioc_1.8.0
>
>
>
>         _______________________________________________________
>
>                                Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>                  <mailto:Bioconductor at r-__project.org
>                  <mailto:Bioconductor at r-project.org>>
>                           <mailto:Bioconductor at r-____project.org
>                  <mailto:Bioconductor at r-__project.org>
>                           <mailto:Bioconductor at r-__project.org
>                  <mailto:Bioconductor at r-project.org>>>
>         https://stat.ethz.ch/mailman/________listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/______listinfo/bioconductor>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/____listinfo/bioconductor>>
>
>           <https://stat.ethz.ch/mailman/______listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/____listinfo/bioconductor>
>                  <https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>>>
>
>
>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/____listinfo/bioconductor>
>                  <https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>>
>
>           <https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>                  <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>>>
>                                Search the archives:
>         http://news.gmane.org/gmane.________science.biology.__informatics.______conductor
>         <http://news.gmane.org/gmane.______science.biology.informatics.______conductor>
>
>         <http://news.gmane.org/gmane.______science.biology.__informatics.____conductor
>         <http://news.gmane.org/gmane.____science.biology.informatics.____conductor>>
>
>
>         <http://news.gmane.org/gmane.______science.biology.__informatics.____conductor
>         <http://news.gmane.org/gmane.____science.biology.informatics.____conductor>
>
>         <http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>>>
>
>
>
>         <http://news.gmane.org/gmane.______science.biology.__informatics.____conductor
>         <http://news.gmane.org/gmane.____science.biology.informatics.____conductor>
>
>         <http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>>
>
>
>         <http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>>>
>
>
>
>
>
>                           --
>                           Reema Singh
>                           PhD Scholar
>                           Computational Biology and Bioinformatics
>                           School of Computational and Integrative Sciences
>                           Jawaharlal Nehru University
>                           New Delhi-110067
>                           INDIA
>
>
>
>
>
>                  --
>                  Reema Singh
>                  Postdoctoral Research Assistant
>                  College of Life Sciences
>                  University of Dundee,
>                  Dundee DD1 4HN, Scotland
>                  United Kingdom
>
>
>
>              --
>              Valerie Obenchain
>
>              Program in Computational Biology
>              Division of Public Health Sciences
>              Fred Hutchinson Cancer Research Center
>              1100 Fairview Ave. N, M1-B155
>              P.O. Box 19024
>              Seattle, WA 98109-1024
>
>              E-mail: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>              Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>              Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
>         --
>         Sam McInturf
>
>
>
>
> --
> Sam McInturf



More information about the Bioconductor mailing list