[BioC] M values; and dist functions

Wed Apr 27 17:31:07 CEST 2011

Hi John,

As an aside -- when replying to bioc emails, make sure you hit "reply
all" so the email is sent back to the list.

Comments inline:

On Wed, Apr 27, 2011 at 11:17 AM, john herbert <arraystruggles at gmail.com> wrote:
> Hello Steve,
> Thank you very much for answering the questions.
>
> "That is correct, with the (obvious) exception that there are no hard
> and fast rules for what type of samples are labeled with cy5 or cy3
> ... and often times there are dye swaps done to control for bias ...
> those scenarios are (I think) covered in the limma manual, btw."
>
> Yes, I understand about dye swapping and how they are portrayed in a design
> matrix; though I am far from fulling understand notation and design
> matrices.
>
> For heatmap, clustering, do you have to separate out your dye intensities to
> make the columns of data (separate out samples)? Most example data
> generating heatmaps is affy, which is like one colour.

I think you generally want to be plotting fold-change in your
heatmaps, so just make sure your numerators/denominators are what you
expect them to be.

>  "Hmmm ...
>
> Well, R does have a limit on vector length that is a result of it
> using 32bit integers (for indexing, I guess) -- so, I think the max
> size of a vector is 2^31 - 1 == .Machine$integer.max == 2147483647.
>
> But that's still bigger than 582309001. This works on my cpu, for instance:
>
> R> x <- integer(582309001)
>
> It takes a ton (~2gb) of memory, but it still works (I don't have
> enough free memory to make a "numeric" (aka double) vector of that
> size, though). Maybe you're hitting the memory limits of your machine?
>
> How much RAM do you have? Are you running R in 32-bit or 64-bit mode?
> What's the result of:
>
> R> sessionInfo()"
>
> My machine is medium powerful, with CoreI7 and 12Gb of RAM.

Nice!

> My session info
> is here is at the bottom (machine is 64 bit) and R is 32 bit (as some
> packages are not 64 friendly).

I'm a bit surprised that you've found some packages to be non 64-bit
friendly ... I haven't use 32bit R in years and haven't had any
problems that I can really remember (related to 64 bit, that is).

I'm also not running on windows, so maybe there are some 64bit
problems with windows that I'm not aware of (maybe others can
comment).

Anyway, I'd try again using 64bit R -- I think it should work for you.
Try to run the minimal amount of code to necessary so you don't have
to load any of the 64-bit problematic packages.

As far as I know, you shouldn't have any problems with any
bioconductor packages breaking in R-64bit, and `dist` is in base-R, so
... everything should be OK.

Might be as good a time as any to upgrade to R-2.13 and run a
'minimal' bioconductor install so you're not tempted to run anything
"extraneous"

> I am off a few days now but I could try and regenerate the R and analyses
> that broke dist, it could be a code/concept problem.

Dollars to donuts it's the cpu architecture ... :-)

-steve

>
> Best wishes,
>
> John.
>
>
>
> # Session information.
> R version 2.12.1 (2010-12-16)
> Platform: i386-pc-mingw32/i386 (32-bit)
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
> Kingdom.1252
> [3] LC_MONETARY=English_United Kingdom.1252
> LC_NUMERIC=C
> [5] LC_TIME=English_United Kingdom.1252
> attached base packages:
> [1] stats4    grid      stats     graphics  grDevices utils     datasets
> methods   base
> other attached packages:
>  [1] ALL_1.4.7                 gplots_2.8.0
> caTools_1.11              bitops_1.0-4.1
>  [5] gdata_2.8.1               gtools_2.6.2
> DESeq_1.2.1               locfit_1.5-6
>  [9] akima_0.5-4               arrayQualityMetrics_3.2.4
> affyPLM_1.26.0            preprocessCore_1.12.0
> [13] gcrma_2.22.0              affy_1.28.0
> UsingR_0.1-13             MASS_7.3-9
> [17] ggplot2_0.8.9             proto_0.3-8
> reshape_0.8.3             plyr_1.4
> [21] bioDist_1.22.0            impute_1.24.0
> flexclust_1.3-1           modeltools_0.2-17
> [25] scatterplot3d_0.3-33      pvclust_1.2-2
> cluster_1.13.2            ks_1.8.1
> [29] misc3d_0.7-1              rgl_0.92.798
> mvtnorm_0.9-95            KernSmooth_2.23-4
> [33] Agi4x44PreProcess_1.10.0  genefilter_1.32.0
> convert_1.26.0            marray_1.28.0
> [37] geneplotter_1.28.0        lattice_0.19-13
> annotate_1.28.0           AnnotationDbi_1.12.0
> [41] CCl4_1.0.9                limma_3.6.9
> vsn_3.18.0                Biobase_2.10.0
> loaded via a namespace (and not attached):
>  [1] affyio_1.18.0       beadarray_2.0.6     Biostrings_2.18.2
> DBI_0.2-5           hwriter_1.3
>  [6] IRanges_1.8.8       latticeExtra_0.6-14 RColorBrewer_1.0-2
> RSQLite_0.9-4       simpleaffy_2.26.1
> [11] splines_2.12.1      survival_2.36-2     SVGAnnotation_0.9-0
> tools_2.12.1        XML_3.2-0.2
> [16] xtable_1.5-6
>
>
> On Tue, Apr 26, 2011 at 2:22 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>>
>> Hi John,
>>
>>
>> On Tue, Apr 26, 2011 at 6:24 AM, john herbert <arraystruggles at gmail.com>
>> wrote:
>> > It would be helpful to get some clarification on some, supposedly,
>> > simple
>> > array facts;
>> >
>> > Part1)
>> >
>> > For 2 colour arrays, Mvalues!
>> >
>> > Am I correct in thinking that M values from a two colour array are the
>> > same
>> > as log2 fold change?
>> > Cy5 = case, Cy3 = control and M = log2( case/control), so a log fold
>> > change
>> > of -1 is 2 fold down-regulated etc?
>>
>> That is correct, with the (obvious) exception that there are no hard
>> and fast rules for what type of samples are labeled with cy5 or cy3
>> ... and often times there are dye swaps done to control for bias ...
>> those scenarios are (I think) covered in the limma manual, btw.
>>
>> > For a 1 colour array, Mvalues will arise from array1 = case and arrray 2
>> > =
>> > control
>> > So log2(array1/array2) is again the equivalent of log2 fold change.
>>
>> Yup ... just make sure you normalize your arrays together.
>>
>> > with both these scenarios, I am right in stating that the raw
>> > fluorescent
>> > signals are themselves log2 transformed to make plot distributions close
>> > to
>> > normal?
>>
>> Yes.
>>
>> > Part2)
>> >
>> > I use the marray package to extract array data, normalise and an impute
>> > package to replace missing values for M values.
>> > I make myself an expression set object using "new"
>> >
>> > I then want to generate a dist object
>> >
>> >> dd = dist(exprs(exampleSet))
>> > Error in vector("double", length) :
>> >  cannot allocate vector of length 582309001
>> >
>> > or
>> >
>> >> dd = dist(exampleSet)
>> > Error in vector("double", length) :
>> >  cannot allocate vector of length 582309001
>> >
>> > It is probable I need to reduce the data set first as most genes are not
>> > differentially expressed (as is the assumption with microarrays).
>> > It would be great to understand these types of things more.
>>
>> Hmmm ...
>>
>> Well, R does have a limit on vector length that is a result of it
>> using 32bit integers (for indexing, I guess) -- so, I think the max
>> size of a vector is 2^31 - 1 == .Machine$integer.max == 2147483647.
>>
>> But that's still bigger than 582309001. This works on my cpu, for
>> instance:
>>
>> R> x <- integer(582309001)
>>
>> It takes a ton (~2gb) of memory, but it still works (I don't have
>> enough free memory to make a "numeric" (aka double) vector of that
>> size, though). Maybe you're hitting the memory limits of your machine?
>>
>> How much RAM do you have? Are you running R in 32-bit or 64-bit mode?
>> What's the result of:
>>
>> R> sessionInfo()
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
>

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact