[BioC] perspective on differential expression counting - unmapped reads?

Thu Oct 31 21:43:47 CET 2013

Hi Jonathan,

I believe that even with the binomTest function, you could still use 
calcNormFactors and feed in the normalized library sizes (= library 
size * normfactor) for the n1 and n2 arguments to binomTest. However, 
it is certainly correct to exclude those special counts from binomTest 
in any case. Remember that ninomTest was written for SAGE data, where 
there is no such thing as "unmapped reads" or reads that don't map to a 
feature. For what it's worth, I always take the (un-normalized) library 
size to be the sum of reads that were counted toward a feature.

-Ryan

On Thu 31 Oct 2013 11:59:36 AM PDT, Jon BR wrote:
> Hi,
>      Ryan, thanks for your thoughts!
>
> I noticed in the manual the calcNormFactors step, and I've used it when
> applying EdgeR to detect genes di erentially expressed between tumor and
> normal tissue, adjusting for any di fferences between the patients (as in
> the oral carcinoma case study).
>
> That analysis provides a single p-value, looking for the genes most up, or
> down as a result of the disease across patients.
>
> Now I'm trying to look for differences on a patient-by-patient basis.  For
> example, we have a handful of genes that we think can help explain some of
> the variability between patients (they all have the same disease, but they
> are heterogeneous, and we have some gene-specific hypotheses).  We'd like
> to query whether expression changes coincide with some other phenotypes,
> looking patient-by-patient.
>
> In other words, I want to answer: Is gene x differentially expressed in
> patient y?  What about z?  Is there a gene that's up in y but down in z?
>
> I realize that there are no replicates, as we have one tumor and one normal
> for each patient, so we'll need to be careful drawing any major
> conclusions...  but I'd still like something, probably I can feel OK about
> the most dramatically different genes in a single patient.
>
> To this end, my reading of the manual pointed me towards simply using
> the binomTest function, which reads in a vector of counts for each of two
> samples, and produces a vector of p-values, one for each gene.  Under that
> scenario, I notice big (order of magnitude) differences of p-values
>   depending on whether I kept the unmapped reads or filtered them out of the
> counts vectors.
>
>
> If anyone has an alternative/better idea for a way to handle this, I would
> love to hear it!
>
> Presuming the binomial test is the best way to go, my current opinion is to
> remove the unmapped/ambiguously mapped before passing the counts vectors in.
>
> Any thoughts welcome!  (does it make sense?)
>
> Thanks,
> Jonathan
>
>
>
>
> On Thu, Oct 31, 2013 at 1:55 PM, Devon Ryan <dpryan at dpryan.com> wrote:
>
>> You can just remove those lines (in fact, that's what DESeq2 does
>> internally), they'll just needlessly increase the number of tests performed.
>>
>> ____________________________________________
>> Devon Ryan, Ph.D.
>> Email: dpryan at dpryan.com
>> Tel: +49 (0)178 298-6067
>> Molecular and Cellular Cognition Lab
>> German Centre for Neurodegenerative Diseases (DZNE)
>> Ludwig-Erhard-Allee 2
>> 53175 Bonn, Germany
>>
>> On Oct 31, 2013, at 6:30 PM, Jon BR wrote:
>>
>>> Hello,
>>>     I'm interested in calculating differential expression from some paired
>>> RNAseq samples.
>>>
>>> I've used htseq-count after mapping; quite happy with how easy that was.
>>>
>>> My question is with regard to whether or not to trip the last five rows
>>> from htseq-count output.
>>>
>>> Those rows look like this:
>>> no_feature 152030
>>> ambiguous 4876
>>> too_low_aQual 0
>>> not_aligned 0
>>> alignment_not_unique 0
>>>
>>> I can dream of reasons supporting either side of this question.. The
>> number
>>> of unmapped or ambiguously-mapping reads do contribute to the total
>> library
>>> size.  However, I'm also interested in quantifying the difference between
>>> what's human in both samples, so intuition would tell me to remove those
>>> reads.
>>>
>>> Because the counts are big, this matters a great deal.  I'm using EdgeR
>>> (again, very happy with that software), and the manual cites htseq-count
>> as
>>> a  viable methodology, but doesn't comment on their preferred treatment
>> of
>>> the unmapped reads.
>>>
>>> My first (somewhat careless) utilization of EdgeR gave us results that
>>> appeared to make sense, but upon digging a little deeper, I noticed that
>>> this question affects the p-values quite a lot because the unmapped
>> counts
>>> are so big.
>>>
>>> I would appreciate any comments/opinions!
>>>
>>> Thanks,
>>> Jonathan
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor