[BioC] Pairs plots in lumi, plots look different?

Thu May 8 22:42:40 CEST 2008

Hi Kasper,

It is a good idea to clearly indicate the plots were based on random
samplings if subsetting was used, so I do not need to set random seed. I
will also add smoothScatter as an option in the functions. Thanks for your
comments!

Best,

Pan

On 5/8/08 2:19 PM, "Kasper Daniel Hansen" <khansen at stat.Berkeley.EDU> wrote:

> I have two comments to this, a general and a specific.
> 
> I'll start with the specific: in this case you are providing a pairs
> plot. Presumably to avoid overplotting you subsample the data points.
> Depending on what you want to use the plot for this may be quite ok -
> but the users need to know this! Clearly in this case the user was
> surprised to see it (perhaps it is highlighted on the help page, I
> don't know). For certain things - especially for QC I would say - I
> would personally prefer to plot all points (perhaps using a smoother
> like Wolfgang suggested). If users start interpreting these plots
> without knowing that it is only a fraction of the data they see, it is
> likely that they will misinterpret them. Setting the seed just
> addresses the symptom - that the plots are not "reproducible", not the
> underlying problem that this plots may not be suitable for whatever
> the original poster had in mind (otherwise he would not care that they
> look differently). What in my opinion should be done instead is
> 1) highlight it in the help page
> 2) provide some title on the plot like "based on 5000 samples" so that
> people do not get confused.
> 3) not set the seed
> 
> And now for the general comment (I guess there are two points in the
> following): I believe it is very misleading to set the seed in
> essentially any package (see below for one special case though). The
> seed is essentially a global variable and when you mess with it, other
> parts of the analysis may get affected. If an analysis method depends
> on random sampling, the conclusions (or the method) should take this
> into account. That means that the conclusions should be completely
> unaffected by whatever random numbers were generated. If that is not
> the case the analysis is flawed. It can be fixed by fixing the method,
> increase the number of samples or finally by adjusting the conclusions
> of the analysis. In most cases setting the seed for reproducibility
> (as was done in gcrma, see older post on the email list) just hides
> the problem and worse - typically makes users unaware of the fact that
> they need to take the effect of the randomness into account. So my
> points are
> 1) any conclusion based on random sampling should be invariant to this
> sampling.
> 2) setting the seed affects a global variable which you should never do.
> 
> Now, some people have a seed parameter to their function. In case this
> parameters has a default argument like
>    .., seeed = 123,...
> I believe it is very dangerous based on the stuff above. If the
> default case of the seed parameter is to not set a seed (perhaps by
> doing something like)
>    .., seed = NULL,.. or ..., seed = FALSE, ...
> you might as well not include it. There is not much difference between
>    set.seed(123)
>    myFunc()
> and
>    myFunc(seed = 123)
> 
> Finally I can only think of one case where a package might have a good
> reason to play with the seed: if you are trying to provide an update
> method for a resampling based method, like
>    update(bootstrapObject, additonalSample = 1000)
> and even then it needs to be done with great care.
> 
> Kasper
>