[BioC] Question about interpretation of CHARM results

Mon Aug 20 19:42:08 CEST 2012

On Sun, Aug 19, 2012 at 2:21 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
> (a bit late, but better late than never)
>
> Dr. Coombes is right, of course, and I ought to have mentioned this
> earlier.  It is a thorny issue, especially with epidemiological studies.
>  Cancer, less so.
>
> For differences that are real, usually they are significant on the beta
> value scale *and* the logistic scale.  Changes that are only significant on
> the logit scale are often artifacts.  The logistic scale is much more
> sensitive to technical artifacts, even if you damp the extremes by using a
> huge offset.  There are intermediate transformations (probit,
> arcsin(sqrt(x)), etc.) which are intermediate between proportional and
> logistic, but they haven't found much traction.
>
> There are a couple of things that do help and can be used to filter an
> unbiased fashion.  I'm hoping to get them submitted in the very near
> future.  I do not know if they will be as useful for CHARM data as they
> seem to be for Illumina arrays, eRRBS, and WGSBS sequence data, but I guess
> we shall find out.
>
> regarding TCGA data:
>
> TCGA methylation data is background corrected and dye bias equalized (for
> the 450k samples, at least, and as batches are updated, 27k as well) but no
> batch correction is done for the level 3 data.  In the case of multi-batch
> tumors it is a good idea to run ComBat or (if you must) SVA to compensate.

Sorry to hijack the thread, but, what is the reason to prefer ComBat over SVA?

>  It's run through normalizeMethyLumiSet(methylumi.bgcorr(the.data)) for
> level 2 and 3 data, and the IDAT files are provided as level 1, so anyone
> who wants to reproduce things from scratch with a different preprocessing
> strategy is welcome to do so.  Methylumi and minfi happen to have the same
> IDAT parsing code these days, and I would not be surprised if they
> eventually merged.  I wouldn't bother using anything else, especially for
> large volumes of samples.
>
> switching from 0.1% methylated to 99.9% methylated is probably a real
> effect.  Switching from 1% to 3% across the board is probably technical
> artifacts.

I'm guessing this to be true only for tumor/normal comparisons or
"pure" samples.
What about peripheral blood where one may be measuring a signal from a
variety of cell types or tissues?

> You will see this all the time on 450k data that hasn't been
> (ahem) properly background corrected,  and to a significant degree in 27k
> data as well.  It's not limited to TCGA; I've seen plenty of data from
> other centers that benefited significantly from being re-processed
> sensibly.  In any event, I pushed for, and got, changes to policy so that
> Illumina methylation data for TCGA is provided as raw IDAT files, and all
> of the code used for processing is available either from the Bioconductor
> package repository or from GitHub (in the case of the packaging pipeline
> itself).  It's not perfect but at least it is 100% transparent.
>
> one last thing:
>
> The Illumina annotations have been updated to a FeatureDb (one for hg19,
> and another available soon for hg18) which I would like to propose as the
> standard for these arrays, as they cover both 27k and 450k features (along
> with some information that very few people seem to know about each).  I
> think it's as fast as what minfi uses and as comprehensive as the .db0
> packages, while less confusing than either.  So anyone who wants to, please
> try them.
>
>
>
> On Fri, Aug 17, 2012 at 1:05 PM, Kevin R. Coombes <kevin.r.coombes at gmail.com
>> wrote:
>
>>  I understand all the statistical reasons for converting from methylation
>> "beta values" to something logistic, and am frequently tempted to do this
>> myself.
>>
>> But I think in the context of methylation that this advice should come
>> with a warning: changes in levels near 0 and 1 may have a lot of leverage
>> on the final results.  For example, we have done analyses on some of the
>> TCGA data where we find "statistically significant differences in
>> methylation between normal and tumor" where the mean beta values are 0.03
>> and 0.08.  I find it hard to believe that this level of change in
>> methylation has any kind of biological meaning.  In fact, I'm not even
>> convinced that we can accurately measure this amount of change using the
>> technology that TCGA is using (although I might well believe that such a
>> change could result from batch effects, whether in the assay or in the data
>> processing).
>>
>> I don't have any magic solution to fix this issue; it is intrinsic in the
>> shape of the logistic curve. One might want to explore shrinking the beta
>> values toward 0.5 (i.e., away from 0 and 1), but I can't offer any concrete
>> advice on how well this might work in practice.
>>
>> Best,
>>     Kevin
>>
>>
>> On 8/17/2012 12:36 PM, Tim Triche, Jr. wrote:
>>
>> The reason to switch from a proportion (%, beta-value, whichever; anything
>> measuring M / (M+U) where M and U are surrogates for methylated and
>> unmethylated cytosines) to a fold-change (logit(proportion.methylated) or
>> log2(M/U)) is that the latter is far more amenable to linear models, and
>> roughly parallels the expected behavior in terms of expression changes on a
>> log2 or log-fold-change scale.
>>
>> Furthermore, the range for logit(M/U) is -Infinity to +Infinity, which is
>> appropriate when you are modeling something as having Gaussian error.
>>  Something with a range of 0 to 1 is neither homoskedastic (which is to
>> say, such a 0-1 measurement will have a variance that depends on the mean)
>> nor unbounded (this turns out to be an issue when computing maximum
>> likelihood estimates, for example, as values close to the boundary will
>> cause problems).
>>
>> In any event, logit(% methylation) is equivalent to log(M/U) which is where
>> I veered off course this morning.  My brain seems to have been a bit slow.
>>
>>
>> On Fri, Aug 17, 2012 at 9:26 AM, zeynep özkeserli <zeynep.ozkeserli at gmail.com> wrote:
>>
>>
>>  Dear Tim,
>>
>> Thank you for your answer. But to my understanding, if I could get this
>> answer by undoing the logit function (I tought you were doing this), we
>> should use inverse logit function. Which is exp(x)/(1+exp(x))
>>
>> And in my case it gives:
>>
>>
>>  exp(-0.30427)/(1+exp(-0.30427))
>>
>>  [1] 0.424514
>>
>> Ok, this seems reasonable. And it makes sense how you put this into words.
>> But if we could use this one as a methylation measure, why would the
>> creators make things more complicated and convert the value to a logit
>> value? So, again, to my understanding, I shall learn how to interpret the
>> diff thing.
>>
>> Thank you again,
>>
>> Best :)
>>
>> Zeynep
>>
>> On Fri, Aug 17, 2012 at 6:29 PM, Tim Triche, Jr. <tim.triche at gmail.com> <tim.triche at gmail.com>wrote:
>>
>>
>>  Perhaps "on average this region has an
>>
>> R> 1 - exp(-0.347)
>> [1] 0.2931947
>>
>> approximately 29.3% relative decrease in cytosine methylation after
>> treatment?"
>>
>>
>>
>> On Fri, Aug 17, 2012 at 1:56 AM, zeynep özkeserli <zeynep.ozkeserli at gmail.com> wrote:
>>
>>
>>  Dear All, Dear Dr. Aryee and Dr. Carvalho,
>>
>> I have a question on interpreting the results of dmrFinder function.
>>
>> We have performed a CHARM analysis on the data we got from NimbleGen
>> Promoter Medip Arrays. The data is obtained from each patient before and
>> after treatment. And after performing CHARM analysis, we got some
>> differentially methylated regions (DMRs).
>>
>> As the samples are before and after treatment results of the same
>> patient,
>> the samples are treated as paired samples.
>>
>> My question is about interpretation of the results:
>>
>> After running this:
>>
>> dmr1_2 <- dmrFinder(rawData, p = p, groups = grp,compare = c("to", "ts"),
>> cutoff=0.995,paired=TRUE,pairs=pairs)
>>
>> to: before treatment
>> ts: after treatment
>>
>> - For example I have found a DMR like this (I summerized the result for
>> my
>> question):
>>
>> chr 8, diff= -0.30427 and maxdiff=0.47935
>>
>> As the diff value is calculated like this:   average l (logit(percentage)
>> methylation if l=NULL) difference within the DMR if paired=TRUE
>>
>> Is it true to say that: "The region has 0.30427 times the risk of being
>> methylated in samples of after treatment compared to samples of before
>> treatment."
>>
>> I know that it does not look meaningful to use the word "risk" when
>> talking
>> about something like that but I can not find a better way to say it
>> truely. Is it possible to express it like a "0.30427 fold difference in
>> methylation"? And also am I interpreting the "-" sign truely?
>>
>> Thank you for your help in advance,
>>
>> Best Regards,
>>
>> Zeynep
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing listBioconductor at r-project.orghttps://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>  --
>> *A model is a lie that helps you see the truth.*
>> *
>> *
>> Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing listBioconductor at r-project.orghttps://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
> --
> *A model is a lie that helps you see the truth.*
> *
> *
> Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
>         [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor