[R] how to use large data set ?

Dylan Beaudette dylan.beaudette at gmail.com
Wed Jul 19 23:11:31 CEST 2006


Hi,

While R is generally flexible enough for just about anything you can throw at 
it, detailed analysis of imagery might be better accomplished in a 
specialized piece of software. One option might be GRASS, which would allow 
you to do further processing on a subset of the original data in R.

Cheers,

Dylan

On Wednesday 19 July 2006 13:22, mahesh r wrote:
> Hi,
> I would like to extend to the query posted earlier on using large data
> bases. I am trying to use Rgdal to mine within the remote sensing
> imageries. I dont have problems bring the images within the R environment.
> But when I try to convert the images to a data.frame I receive an warning
> message from R saying "1: Reached total allocation of 510Mb: see
> help(memory.size)" and the process terminates. Due to project constarints I
> am given a very old 2.4Ghz computer with only 512 MB RAM. I think what R is
> currently doing is
> trying to store the results in the RAM and since the image size is very big
> (some 9 million pixels), I think it gets out of memory.
>
> My question is
> 1. Is there any possibility to dump the temporary variables in a temp
> folder within the hard disk (as many softwares do) instead of leting R
> store them in RAM
> 2. Could this be possible without creating a connection to a any back hand
> database like Oracle.
>
> Thanks,
>
> Mahesh
>
> On 7/19/06, Greg Snow <Greg.Snow at intermountainmail.org> wrote:
> > You did not say what analysis you want to do, but many common analyses
> > can be done as special cases of regression models and you can use the
> > biglm package to do regression models.
> >
> > Here is an example that worked for me to get the mean and standard
> > deviation by day from an oracle database with over 23 million rows (I
> > had previously set up 'edw' as an odbc connection to the database under
> > widows, any of the database connections packages should work for you
> > though):
> >
> > library(RODBC)
> > library(biglm)
> >
> > con <- odbcConnect('edw',uid='glsnow',pwd=pass)
> >
> > odbcQuery(con, "select ADMSN_WEEKDAY_CD, LOS_DYS from CM.CASEMIX_SMRY")
> >
> > t1 <- Sys.time()
> >
> > tmp <- sqlGetResults(con, max=100000)
> >
> > names(tmp) <- c("Day","LoS")
> > tmp$Day <- factor(tmp$Day, levels=as.character(1:7))
> > tmp <- na.omit(tmp)
> > tmp <- subset(tmp, LoS > 0)
> >
> > ff <- log(LoS) ~ Day
> >
> > fit <- biglm(ff, tmp)
> >
> > i <- nrow(tmp)
> > while( !is.null(nrow( tmp <- sqlGetResults(con, max=100000) ) ) ){
> >         names(tmp) <- c("Day","LoS")
> >         tmp$Day <- factor(tmp$Day, levels=as.character(1:7))
> >         tmp <- na.omit(tmp)
> >         tmp <- subset(tmp, LoS > 0)
> >
> >         fit <- update(fit,tmp)
> >
> >         i <- i + nrow(tmp)
> >         cat(format(i,big.mark=',')," rows processed\n")
> > }
> >
> > summary(fit)
> >
> > t2 <- Sys.time()
> >
> > t2-t1
> >
> > Hope this helps,
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at intermountainmail.org
> > (801) 408-8111
> >
> >
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Yohan CHOUKROUN
> > Sent: Wednesday, July 19, 2006 9:42 AM
> > To: 'r-help at stat.math.ethz.ch'
> > Subject: [R] how to use large data set ?
> >
> > Hello R users,
> >
> >
> >
> > Sorry for my English, i'm French.
> >
> >
> >
> > I want to use a large dataset (3 millions of rows and 70 var) but I
> > don't know how to do because my computer crash quickly (P4 2.8Ghz, 1Go
> > ).
> >
> > I have also a bi Xeon with 2Go so I want to do computation on this
> > computer and show the results on mine. Both of them are on Windows XP...
> >
> >
> >
> > To do shortly I have:
> >
> >
> >
> > 1 server with a MySQL database
> >
> > 1computer
> >
> > and I want to use them with a large dataset.
> >
> >
> >
> > I'm trying to use RDCOM to connect the database and installing (but it's
> > hard for me..) Rpad.
> >
> >
> >
> > Is there another solutions ?
> >
> >
> >
> > Thanks in advance
> >
> >
> >
> >
> >
> > Yohan C.
> >
> >
> >
> > ----------------------------------------------------------------------
> > Ce message est confidentiel. Son contenu ne represente en aucun cas un
> > engagement de la part du Groupe Soft Computing sous reserve de tout
> > accord conclu par ecrit entre vous et le Groupe Soft Computing. Toute
> > publication, utilisation ou diffusion, meme partielle, doit etre
> > autorisee prealablement.
> > Si vous n'etes pas destinataire de ce message, merci d'en avertir
> > immediatement l'expediteur.
> > This message is confidential. Its content does not constitute a
> > commitment by Soft Computing Group except where provided for in a
> > written agreement between you and Soft Computing Group. Any unauthorised
> > disclosure, use or dissemination, either whole or partial, is
> > prohibited. If you are not the intended recipient of this message,
> > please notify the sender immediately.
> > ----------------------------------------------------------------------
> >
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide commented, minimal,
> self-contained, reproducible code.

-- 
Dylan Beaudette
Soils and Biogeochemistry Graduate Group
University of California at Davis
530.754.7341



More information about the R-help mailing list