[R] Fast update of a lot of records in a database?

Sat May 20 17:22:54 CEST 2006

On 5/20/2006 8:15 AM, Steve Miller wrote:
> Though I'd question the prudence of doing mass database changes through R: 
> 
> Other things equal, single updates in a loop should be slower than a
> correlated update due to the performance costs of bind and execute (and
> perhaps even parse). Also, updates are generally slower than inserts because
> of rollback and logging, and the potential of introducing chained blocks to
> hold the updates. The correlated temporary table update should be faster
> than the original, provided indexing is appropriate (i.e an index on
> bigtable.id).
> 
> Any chance you could create a new bigtable as a join select of old bigtable
> and updates? This would be faster still. 

Some context would probably be helpful here:

The table will eventually record the results of a long simulation.  This 
step is just initializing some columns of the table that will be used to 
target the following simulations.  That is, there are 400000 records in 
the table, and we need to select which ones to work on.  The criterion 
we're using is relatively straightforward to calculate (takes about a 
minute on an in-memory copy of the full table).  The simulations 
themselves are slow and memory intensive, so we need to do them in small 
batches (approx 100 records at a time); after each batch, the criterion 
for that batch will be updated in order to target the next batch.

So the 400000 updates I'm making now are values being calculated in R, 
though not particularly valuable:  so I think it makes sense to use R to 
do the database updates.

It turns out that the batch method of updating is at least 20 times 
faster than the original.  With the original, we gave up after a day of 
waiting; the new version takes about an hour.  Since we only need to do 
this part once, this is good enough.

Thanks to everyone for their help and suggestions.

Duncan Murdoch

> 
> Steve Miller
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of hadley wickham
> Sent: Friday, May 19, 2006 2:20 PM
> To: Duncan Murdoch
> Cc: Robert.McGehee at geodecapital.com; r-help at stat.math.ethz.ch;
> br44114 at gmail.com
> Subject: Re: [R] Fast update of a lot of records in a database?
> 
>> put the updates into a temporary table called updates
>>
>> UPDATE bigtable AS a
>>    FROM updates AS b
>>    WHERE a.id = b.id
>>    SET a.col1 = b.col1
> 
> I don't think this will be any faster - why would creating a new table
> be faster than updating existing rows?
> 
> I've never had a problem with using large numbers of SQL update
> statements (in the order of hundreds of thousands) to update a table
> and having them complete in a reasonable time (a few minutes).  How
> heavily indexed is the field you are updating?  You may be able to get
> some speed improvements by turning off indices before the update and
> back on again afterwards (highly dependent on your database system
> though).
> 
> I would strongly suspect your bottleneck lies elsewhere (eg. when
> generating the statements in R, or using ODBC to send them)
> 
> Hadley
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html