[BioC] pm summarization method

Wed Apr 18 15:10:15 CEST 2012

Hi Assa,

On 4/18/2012 2:33 AM, Assa Yeroslaviz wrote:
> Hi Jim,
>
> On Tue, Apr 17, 2012 at 20:04, James W. MacDonald <jmacdon at uw.edu 
> <mailto:jmacdon at uw.edu>> wrote:
>
>     Hi Assa,
>
>
>     On 4/17/2012 3:09 AM, Assa Yeroslaviz wrote:
>
>         Hi Jim and bioC-Users,
>
>
>
>         On Mon, Apr 16, 2012 at 17:54, James W. MacDonald
>         <jmacdon at uw.edu <mailto:jmacdon at uw.edu> <mailto:jmacdon at uw.edu
>         <mailto:jmacdon at uw.edu>>> wrote:
>
>            Hi Assa,
>
>
>            On 4/16/2012 9:05 AM, Assa Yeroslaviz wrote:
>
>                Hi everybody,
>
>                I have a question about the behavior of the expresso
>         command when
>                extracting the raw data from an affyBatch.
>
>                I wanted to evaluate the raw intensities values of a
>         specific
>                gene from my
>                data set and tried to extract it like that:
>                rawdata<- expresso(totalExpressionData, bg.correct=FALSE,
>                                       normalize=FALSE,
>                                       pmcorrect.method="pmonly",
>                summary.method="avgdiff",
>                                       verbose=TRUE)
>
>                I've got the result I wanted:
>                             wt1    wt2    wt3    treat1    treat2  
>          treat3
>                gene_at    125.5    101    123.5    52.5    63.5    58
>
>                The problem was that i expected them to be the other
>         way around.
>
>
>            Why did you expect it to be the other way around? What
>         exactly did
>            you expect this call to expresso() to do?
>
>
>         Sorry I didn't explain a bit more. The gene I was
>         investigating is the 'white' gene in Drosophila, which sets
>         the eye color of the flies. It was used in our experiment to
>         test the flies if they have a certain
>         knockout/mutant/treatment. By 'expect it to be the other way
>         around' I meant that only the treated flies should'veh ad a
>         high level of the white gene. In this data it is the other way
>         around.
>         We are not sure, if the error is in the data or maybe it was a
>         error made in the lab who analyzed the arrays and made an
>         erroneous identification symbols for the arrays.
>
>         That was a little background information about the experiment.
>         I hope it clarify the problem better.
>
>            Since affy is primarily based on S4 methods, it can be a bit
>            difficult to figure out what a given function is going to
>         do, so I
>            can understand you not knowing what this call to expresso() is
>            going to end up doing. However, what you are doing is pretty
>            weird, no? The avgdiff method implies pm-mm, so what do you
>         expect
>            to happen if you then specify pmonly?
>
>            Given that the pmcorrect.method controls how we correct the PM
>            probes, and there is a subtractmm option, one would normally
>            assume that the 'difference' part of avgdiff might happen
>         in that
>            step. But you said not to compute that, so all you are left
>         with
>            is the 'avg' part of avgdiff.
>
>
>         Now for the point of my command. To see if I did wrongly
>         classify the experiments or it was given to me as such, I
>         wanted to extract the raw data from the arrays. As I said
>         before, I know one can extract the PM and MM values for each
>         of the probes of a probe-set. What I really would like to have
>         is _one value for each probe-set and array each_.
>
>
>         For that reason I used the summarization of the values with
>         the expresso command. I will try to answer your questions as I
>         understand them.
>         I used the pmonly correction method, because if I understood
>         it the right way, this is what the RMA algorithm is using when
>         running the normalization procedure on the data. Am I wrong
>         about that?
>
>         Which way is the best way to extract the 'real' raw values
>         from the arrays?
>         I know that asking such question get you never a
>         straightforward answer, but a near one will be good :-)
>
>
>     Why do you need a raw value from the array to see if you mixed
>     samples? To put the question a different way, if normalized,
>     background corrected expression values are good enough for doing
>     statistics, why are they not good enough to determine if you mixed
>     samples up?
>
>
> I basically wanted to show the raw values to the biologists in the 
> lab. The problem was, that they suggested that I somehow might have 
> switched the way of analyzing the data. I mean they thought, that 
> instead of doing Treatment- Control I did Control-Treatment, which 
> would have easily explained the results of the higher expression of 
> this gene in the control.
>
> By showing them the raw data, I can give them a proof, that the switch 
> wasn't made by me, but the arrays were switched, or that the gene is 
> somehow truly down-regulated in the treatment.
>
>
>
>
>         If my logical thinking is wrong, will this cmmand will present
>         me with a more accurate results?
>         rawdatasubtractmm <- expresso(totalExpressionData,
>         bg.correct=FALSE,
>                                        normalize=FALSE,
>                                        pmcorrect.method="subtractmm",
>         summary.method="avgdiff",
>                                        verbose=TRUE)
>
>         I will run it now and look for the differences.
>
>
>            But let's set the logical assumptions aside and look at the
>         actual
>            code. Going through the code to expresso() is a bit like
>         following
>            Alice down the rabbit hole, so I will cut to the chase. In the
>            end, you will be calling two pieces of code that will
>         handle the
>            pm adjustment and the summary statistic calculation. In
>         your call
>            to expresso() these will be (respectively):
>
>         > pmcorrect.pmonly
>            function (object)
>            {
>               return(pm(object))
>            }
>
>            and
>
>         > generateExprVal.method.avgdiff
>            function (probes, ...)
>            {
>               list(exprs = apply(probes, 2, median), se.exprs =
>         apply(probes,
>                   2, sd)/sqrt(nrow(probes)))
>            }
>
>            So the pmonly will just give you the pm probes, and then
>         avgdiff
>            will give you the column medians of the pm data. Therefore you
>            have basically told expresso() that you want the median
>         value for
>            the non-background corrected, unnormalized pm probes.
>
>
>         In the vignette from affy it said to give back the average of
>         the values, so I expected it to give me back the mean, not the
>         median. May be this is why I didn't really understood the
>         results. Now it make more sense.
>
>            Was that your intention?
>
>
>         As I said earlier, I wanted to extract the 'real' values for
>         the probe-sets and I thought that extracting the PM values
>         with no correction or normalization will be the best method to
>         achieve it.
>         I ran now several other possibilities:
>         pm.correction    summary.method    gene    wt1    wt2    wt3  
>          treat1    treat2    treat3
>         mas    mas    gene_at    52.01982773    34.16503645  
>          46.87094917    4.081420059    9.505471139    3.538333392
>         pmonly    avgdiff    gene_at    125.5    101    123.5    52.5
>            63.5    58
>         pmonly    medianpolish    gene_at    6.862247387  
>          6.564551864    6.926547933    6.093298307    6.355810999  
>          6.105348101
>         subtractmm    avgdiff    gene_at 36 15 22 2 8 2.5
>         <tel:36%20%20%20%2015%20%20%20%2022%20%20%20%202%20%20%20%208%20%20%20%202.5>
>
>         This are the results.
>         The mas and the subtractmm gives me the differences between PM
>         and MM. Though the results are quite different, both of them
>         gives me the same behavior. the expression in higher in the
>         wild-type.
>         The medianpolish gives me the log2 values of the PM values.
>          This results I don't understand, a I can clearly see a
>         difference between the wt and the treatemnt, but the log-scale
>         values show almost no difference at all.
>
>
>     Here is where our interpretations of the raw data diverge. I only
>     see a few probes in your raw data that have very different
>     expression levels between the sample types. The only probes that
>     look to be binding at any appreciable level are the first two, and
>     there really isn't any difference between them. In addition, if
>     you believe that MM probes measure background, it doesn't really
>     look like you have much if any of this transcript being expressed
>     in either sample.
>
>
> Sorry,but this here I don't understand.
> I didn't present any probes from my data, but just the results of _one 
> gene_ after running the expresso command with _different options_.
> How can you say that: "I only see a few probes in your raw data that 
> have very different expression levels between the sample types"?
> I mean after analyzing the data , I think you are also right, it just 
> wonder me how could you have known that.

What are the data you show just below? You said they were the PM and MM 
probe data for the gene in question.

>
> I was just wondering here, why the medianpolish option shows basically 
> no differences at all, while the others three with the mas or 
> subtractmm options shows such a huge difference.
> Can it be explained?

First, these aren't huge differences. Sure, if you do fold changes they 
are, but remember that these data range from 0 to 2^14 or so 
(hypothetically 2^16, but it is rare to get much over 2^14). A 
difference of 40 at the very low end of that range is indistinguishable 
from noise. Second, both mas5 and avgdiff suffer from the same problems; 
they assume MM probes accurately measure just background (not true), and 
they ignore probe-specific binding, which adds nuisance data to the 
result. Once you account for probe-specific binding, the differences 
largely disappear.

Best,

Jim

>
> Thanks,
> Assa
>
>
>
>     Best,
>
>     Jim
>
>
>
>         I hope it make some more sense and I would be happy to hear
>         some more suggestions/comments on that topic
>
>         thanks a lot,
>         Assa
>
>
>            Best,
>
>            Jim
>
>
>
>                So I decided to look into the specific probe values of the
>                probes for this
>                probe-set.
>                This are the values I've got from the PM and MM
>         respectively:
>                    wt1    wt2    wt3    treat1    treat2    treat3
>                probe1    403    379    220    420    530    316
>                probe2    117    84    104    52    57    54
>                probe3    49    49    73    38    58    52
>                probe4    87    67    110    55    43    49
>                probe5    66    61    51    46    72    62
>                probe6    118    100    104    69    87    74
>                probe7    180    142    170    45    46    45
>                probe8    133    102    137    95    132    81
>                probe9    80    71    65    52    54    46
>                probe10    63    45    56    53    53    54
>                probe11    293    321    260    444    618    408
>                probe12    171    167    169    49    75    72
>                probe13    198    197    307    40    67    50
>                probe14    247    265    348    53    60    62
>
>                probe1    533    519    294    507    739    404
>                probe2    1789    1271    1468    1430    1666    1552
>                probe3    56    66    59    51    45    48
>                probe4    49    52    64    47    45    47
>                probe5    54    47    33    49    65    55
>                probe6    84    72    90    53    92    71
>                probe7    73    72    65    40    53    54
>                probe8    83    108    115    81    111    94
>                probe9    49    56    43    52    41    53
>                probe10    56    46    62    68    77    57
>                probe11    54    83    55    47    64    46
>                probe12    106    98    76    52    66    53
>                probe13    43    48    37    36    52    39
>                probe14    94    92    99    43    43    49
>
>                When I calculate the average of these two tables for each
>                array I don't get
>                the same values as presented in the top table.
>                I would like to understand how from the values on the
>         last two
>                tables I
>                come to a summarized value I get. Even if I ignore the
>         MM values
>                completely, which I think it does, I still don't see how it
>                comes to these
>                values. The two probes (Nr. 1 and 11 of the PM values) are
>                strongly differ
>                from the rest of the probes for this probe-set. Are
>         they being
>                ignored in
>                the summarization?
>
>                Thanks in advance
>
>                Assa
>
>                       [[alternative HTML version deleted]]
>
>                _______________________________________________
>                Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org>>
>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>                Search the
>              
>          archives:http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>            --     James W. MacDonald, M.S.
>            Biostatistician
>            University of Washington
>            Environmental and Occupational Health Sciences
>            4225 Roosevelt Way NE, # 100
>            Seattle WA 98105-6099
>
>
>
>     -- 
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099