[R] R Studio v3.0.3 for Windows 32bits is too slow
Phan, Truong Q
Troung.Phan at team.telstra.com
Fri Jul 11 03:46:43 CEST 2014
Thanks for all comments/suggestions.
I would like to clarify a few things for some of your questions/doubts.
1) Matt Peeples' K-Clusters R scripts has covered some of pre-requisite steps for the K-Clusters algorithm.
I would recommend for those who has not done K-Clusters to read it.
a) Convert count data to percent
b) Allow to use Z-score standardize data when variables differ greatly in range or standard deviation or are not directly comparable measures
2) K-Clusters algorithm can handle well for the dataset which has less than 5000 features.
3) Our original dataset has more than 9000 parameters which I have been using Hadoop MapReduce streaming via Python to reduce them down to around 2000 parameter and then I use R to further cleansing data down to 1426 parameters.
4) I have been trying to use Mahout and Cloudera's Oryx tools but the integration of different tools perform tasks: Prepare data, Build models, Cross Validating models, Test models and Present data product are far too complicate for this small POC.
Thanks and Regards,
P + 61 2 8576 5771
M + 61 4 1463 7424
E troung.phan at team.telstra.com
From: peter dalgaard [mailto:pdalgd at gmail.com]
Sent: Thursday, 10 July 2014 10:08 AM
To: Jeff Newmiller
Cc: Bert Gunter; Phan, Truong Q; r-help at r-project.org
Subject: Re: [R] R Studio v3.0.3 for Windows 32bits is too slow
Grumpy today, Jeff?
For the concrete issue, I'd conjecture that the base problem is that there are way too many columns in the data and that the nature of the method is not properly understood. It is not obvious that k-means clustering based on Euclidean distance makes sense in 1426-dimensional space. It is quite possible that the data set not even consists of columns measured in the same units. Even if it does fit the problem, it is a quite computationally intensive. Some sort of feature extraction or data reduction technique is likely to be required.
So basically, further study of the methodology, or contact with a machine learning expert (which I am not) seems advisable.
On 09 Jul 2014, at 18:24 , Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
> Grumpy today, Bert?
> While it is a fact that RStudio is a separate tool from R, it is clear from the question that the OP is interested in capabilities that R is providing and he simply cannot tell the difference.
> 1) "Better" is a word that leads to pointless arguments. You will have to be the judge of what works for you. I caution you that Open Source tools almost always achieve success by interoperating with other OS tools, and much of the success you have already obtained is the result of many contributions, of which R and its contributed packages deserve the lion's share of credit. RStudio is a very convenient editor that makes using R and LaTeX and Markdown and version control easier, but it is unlikely that either the blame for your dissatisfaction or the credit for your success should be attributed to RStudio.
> I have successfully used all sorts of plain text editors and command line interfaces with R, and if you plan to scale up your projects then you will likely want to be very clear on this distinction between editors and computing tools so you can distribute your work on multiple parallel servers (where editors may not necessarily even be helpful) even if you choose to use RStudio as your controlling environment for launching such tasks.
> 2) and 3) I know that R has contributed packages that can manage Hadoop data processing, but I have no personal experience with them. Google is your friend... especially if you keep in mind that these tools are not all found in one monolithic package.
> For future reference: this is a plain text mailing list, so please adjust your mail client appropriately when sending to this list. Also, there are considerable resources mentioned in the Posting Guide that you should be aware of... see the link below.
> Jeff Newmiller The ..... ..... Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
> Live: OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
> ----- Sent from my phone. Please excuse my brevity.
> On July 9, 2014 7:10:00 AM PDT, Bert Gunter <gunter.berton at gene.com> wrote:
>> RStudio is a separate product with its own support. Post there, not
>> -- Bert
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>> (650) 467-7374
>> "Data is not information. Information is not knowledge. And knowledge
>> is certainly not wisdom."
>> Clifford Stoll
>> On Tue, Jul 8, 2014 at 7:34 PM, Phan, Truong Q
>> <Troung.Phan at team.telstra.com> wrote:
>>> Hi R'er,
>>> I have a dataset which has a matrix of 7502 x 1426 (rows x columns).
>>> The data is in a CSV format which has a size around 68Mb. This
>> dataset is less than 10% of our dataset.
>>> I have been adopting the Anomaly detection method as described by
>> http://www.mattpeeples.net/kmeans.html .
>>> It has been running more than 24hrs and still haven't completed the
>>> I did manage to run it with a smaller dataset (ie, 2100 rows x 1426
>> columns). It took around 12hrs to run.
>>> I have a few questions and need your expertise guidance.
>>> 1) Is there any better Open source tools to use to do in one
>> tool (eg, R Studio): prepare data, build models, validate models,
>> test models and present data. I am looking a tool which will allow me
>> to do the same as per the above link (Matt Peeples' blog).
>>> 2) Is there an Open source tools to perform the above which will
>> allow me to run on top of Hadoop eco-system?
>>> 3) Can we use R Studio for windows as a client to run on top of
>> Hadoop eco-system? If yes, please point me to the site where they
>> have a use cases or samples.
>>> Thanks and Regards,
>>> Truong Phan
>>> [[alternative HTML version deleted]]
>>> R-help at r-project.org mailing list
>>> PLEASE do read the posting guide
>>> and provide commented, minimal, self-contained, reproducible code.
>> R-help at r-project.org mailing list
>> PLEASE do read the posting guide
>> and provide commented, minimal, self-contained, reproducible code.
> R-help at r-project.org mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-help