[Rd] readLines() segfaults on large file & question on how to work around

Jennifer Lyon jennifer.s.lyon at gmail.com
Sun Sep 3 20:50:49 CEST 2017


Jeroen:

Thank you for pointing me to ndjson, which I had not heard of and is
exactly my case.

My experience:
jsonlite::stream_in - segfaults
ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old
      so it won't compile the package
corpus::read_ndjson - works!!! Of course it does a different simplification
     than jsonlite::fromJSON, so I have to change some code, but it works
     beautifully at least in simple tests. The memory-map option may be of
     use in the future.

Another correspondent said that strings in R can only be 2^31-1 long, which
is why any "solution" that tries to load the whole file into R first as a
string, will fail.

Thanks for suggesting a path forward for me!

Jen

On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <jeroenooms at gmail.com> wrote:

> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <jennifer.s.lyon at gmail.com>
> wrote:
> > I have a 2.1GB JSON file. Typically I use readLines() and
> > jsonlite:fromJSON() to extract data from a JSON file.
>
> If your data consists of one json object per line, this is called
> 'ndjson'. There are several packages specialized to read ndjon files:
>
>  - corpus::read_ndjson
>  - ndjson::stream_in
>  - jsonlite::stream_in
>
> In particular the 'corpus' package handles large files really well
> because it has an option to memory-map the file instead of reading all
> of its data into memory.
>
> If the data is too large to read, you can preprocess it using
> https://stedolan.github.io/jq/ to extract the fields that you need.
>
> You really don't need hadoop/spark/etc for this.
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list