[R] duckdb table from multiple csv files

John Kane jrkr|de@u @end|ng |rom gm@||@com
Mon May 25 15:45:46 CEST 2026


If Naresh only need to work with a few files at a time it is possible to
load (concatenate) them by substituting

read_csv('data/**/file1.csv', 'data/**/file2.csv', 'data/**/file3.csv') and
so on.
for
read_csv('data/**/*.csv')

It seems a bit unreasonable with ~10,000 files and it looks likely to be
error-prone but if only a fairly small number of subsets of those 10,000
files are needed, it should work.

There is nothing like brute force and a larger hammer.

On Mon, 25 May 2026 at 07:45, Jan van der Laan <rhelp using eoos.dds.nl> wrote:

>
>
>
> On 5/25/26 04:46, Naresh Gurbuxani wrote:
>
> >>
> >> " If all the data were in a few files, then in memory duckdb would
> work."
> >>
> > I only need a subset of data at any time.  Duckdb allows a virtual table
> for each file.  This not practical with thousands of files.  With a few
> large files, this can work.  Here the goal is to establish a connection,
> not to load all data at once.
>
> It the files have the same columns, you can also also open all files
> into one virtual database using duckdb. The code below creates a virtual
> table view called 'flights' with the data from all csv files in data/.
>
> con <- duckdb::dbConnect(duckdb::duckdb())
>
> sql <- paste0("CREATE OR REPLACE VIEW flights AS "
>    "SELECT * FROM read_csv('data/**/*.csv');")
> DBI::dbExecute(con, sql)
>
> dbListTables(con)
>
> dbGetQuery(con, "SELECT * FROM flights;")
>
>
> duckdb is fast and will do things in parallel, but for every query it
> will have to go through all files. Going through 200GB of data will take
> time. So, if you have to query the data repeatedly it is probably going
> to speed up your code significantly if you resave your data in another
> format.
>
> HTH,
>
> Jan
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
John Kane
Kingston ON Canada

	[[alternative HTML version deleted]]



More information about the R-help mailing list