[R] exporting tables from an access database using parallel foreach

Vivek Sutradhara viveksutra at gmail.com
Sat Nov 21 18:55:13 CET 2015


Hi John and Jeff,
Thanks a lot for your help. I agree that row numbers are not a standard
feature in SQL. What I am looking for is some kind of a hack. After all,
the sqlFetch command is able to return a specific number of rows. And the
sqlFetchMore command is able to take up the baton from that row onwards to
futher return rows corresponding to the max parameter.

I wonder if it is possible to straight away hop to a certain row number
(without going through sqlfetch and sqlFetchMore and without loading any
data into memory) and then return the contents corresponding to a certain
number of rows. The question is : is there a "catch" for accessing a row
location, and what could be the "hook" for that? I am interested in the the
recent updated rows to a table after a certain date. Is it possible to
identify them in a quick way? Running sql queries on such large tables
appears to take too long a time.

 I understand that there is no provision to do this by available methods.
But, is it possible to get under the hood and find some hack?

Jeff, I will take your suggestion and try my luck at the R-sig-db mailing
list.
Thanks,
Vivek

2015-11-20 20:25 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:

> Row numbers are not a standard feature in SQL, and as far as I know the
> Access Jet engine does not support them. You are supposed to use the key
> columns to partition your data, but that may require knowing how many
> records fall within convenient bin sizes if the data are not uniformly
> distributed. You can find that out using SQL group by queries.
>
> Note that you the resource you appear to be limited by is the database
> engine. Parallel processing (more CPUs) is unlikely to yield any
> improvement, and is in fact likely to slow you down.
>
> This looks like a good topic for the R-sig-db mailing list if you have
> further questions about R and databases, or find a SQL support forum if you
> need to learn more about using SQL in general.
>
> On November 20, 2015 10:32:31 AM PST, Vivek Sutradhara <
> viveksutra at gmail.com> wrote:
>
>> Hi John,
>> Thanks a lot for your quick reply. And thanks for drawing my attention to
>> the openslsx package. I will certainly look into it when I work with Excel.
>> But right now, my problems are with Microsoft Access.
>>
>> There are huge tables there which I am not able to export to excel, csv or
>> text files with native access methods. The only solution that has worked so
>> far is to incrementally extract data with the the help of RODBC. This was a
>> huge leap in my attempts to export the tables. Once I have the data in form
>> of rds files (which are compressed as well), I have found that it is much
>> easier to work with them.
>>
>> But my wishes have suddenly expanded and I want to find out if it is
>> possible to go beyond the normal capabilities of RODBC (the sqlFetch
>> command does not have a provision for specifying the row number range). I
>> am a newbie with parallel methods (using the 4 cores on my pc) but I
>> am
>> hoping to progress with that for processing the data from the multiple
>> chunks of data (the first step will be just to filter and gather the data
>> of relevance).
>>
>> I hope that I have explained what I am looking for.
>> Thanks,
>> Vivek
>>
>> 2015-11-20 19:09 GMT+01:00 John McKown <john.archie.mckown at gmail.com>:
>>
>>  A possibility could be to not use ODBC, but the CRAN package openslsx (
>>>  https://cran.revolutionanalytics.com/web/packages/openxlsx/index.html ).
>>>  Then use the read.xlsx() function.
>>>  <quote>
>>>  Description Read data from an Excel file or Workbook object into a
>>>  data.frame
>>>
>>>  Usage read.xlsx(xlsxFile, sheet = 1, startRow = 1, colNames = TRUE,
>>>  rowNames = FALSE, detectDates = FALSE,
>>> skipEmptyRows = TRUE, rows = NULL,
>>>  cols = NULL, check.names = FALSE, namedRegion = NULL)
>>>
>>>  Arguments xlsxFile An xlsx file or Workbook object sheet The name or index
>>>  of the sheet to read data from.
>>>  startRow first row to begin looking for data. Empty rows at the top of a
>>>  file are always skipped, regardless of the value of startRow.
>>>  colNames If TRUE, the first row of data will be used as column names.
>>>  rowNames If TRUE, first column of data will be used as row names.
>>>  detectDates If TRUE, attempt to recognise dates and perform conversion.
>>>  skipEmptyRows If TRUE, empty rows are skipped else empty rows after the
>>>  first row containing data will return a row of NAs.
>>>  rows A numeric vector specifying which rows in the Excel file to read. If
>>>  NULL, all rows are read.
>>>  cols A numeric vector specifying which columns in the Excel file to read.
>>>  If NULL, all columns are read.
>>>  check.names logical. If TRUE then
>>> the names of the variables in the data
>>>  frame are checked to ensure that they are syntactically valid variable
>>>  names
>>>  namedRegion A named region in the Workbook. If not NULL startRow, rows and
>>>  cols paramters are ignored.
>>>  </quote>
>>>
>>>  On Fri, Nov 20, 2015 at 11:38 AM, Vivek Sutradhara <viveksutra at gmail.com>
>>>  wrote:
>>>
>>>  Hi
>>>>  I want to extract data from a Microsoft access database having many tables
>>>>  with more than 1e7 rows. I find that the following code works to export a
>>>>  table to a rds file :
>>>>  #####################
>>>>  setwd('C:/sFolder')
>>>>  library(RODBC);library(DBI)
>>>>  ch<-odbcConnect("sample")
>>>>
>>>>  #No. of rows in the table not known
>>>>  rowN<-1e6  # no. of rows defined
>>>>  db<-sqlFetch(ch,"Table1",max=rowN,as.is=TRUE)
>>>>
>>>> file<-paste0('Table1',1,'.rds')
>>>>  df1<-saveRDS(db,file1)
>>>>
>>>>  rm(db);gc()   # garbage collection to free up the memory
>>>>
>>>>  # To successively obtain more chunks from the access database
>>>>  for (i in 2:10) {
>>>>    rm(df);gc()
>>>>    df<-sqlFetchMore(ch,"Table1",max=rowN,as.is=TRUE)
>>>>    file<-paste0('Table1',i,'.rds')
>>>>    df1<-saveRDS(df,file)
>>>>    if (dim(df)[1]<rowN)
>>>>      break
>>>>  }
>>>>  rm(df);gc()
>>>>  odbcCloseAll()
>>>>  ##############################
>>>>
>>>>  I would like to know the following :
>>>>  1. Is there any way to extract data from a table by just specifying the
>>>>  row
>>>>  number range. I have extracted data before. Instead of repeating the
>>>>  operations, I would just like to obtain data from, let's say, 8e6 to 9e6
>>>>  row range. I cannot do this now. I have to successively use the
>>>>  sqlfetchMore command. I would like to know if it is possible to straight
>>>>
>>>> away go to the 8e6 to 9e6 row range.
>>>>  2. Is it possible to use the foreach package in the extraction step (in
>>>>  place of the for loop above). I am planning to use the foreach command in
>>>>  parallel later for processing the data in the multiple files. I just
>>>>  wonder
>>>>  if it is possible to do parallel processing for the data extraction also.
>>>>  Thanks,
>>>>  Vivek Sutradhara
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ------------------------------
>>>>
>>>>  R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>  https://stat.ethz.ch/mailman/listinfo/r-help
>>>>  PLEASE do read the posting guide
>>>>  http://www.R-project.org/posting-guide.html
>>>>  and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>>
>>>
>>>  --
>>>
>>>  Schrodinger's backup: The condition of any backup is unknown until
>>> a
>>>  restore is attempted.
>>>
>>>  Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.
>>>
>>>  He's about as useful as a wax frying pan.
>>>
>>>  10 to the 12th power microphones = 1 Megaphone
>>>
>>>  Maranatha! <><
>>>  John McKown
>>
>>
>>
>>  [[alternative HTML version deleted]]
>>
>> ------------------------------
>>
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list