[Rd] readLines() segfaults on large file & question on how to work around

Sun Sep 3 01:27:43 CEST 2017

Jennifer, Why do you try Sparkr?

https://spark.apache.org/docs/1.6.1/api/R/read.json.html

On 2 September 2017 at 23:15, Jennifer Lyon <jennifer.s.lyon at gmail.com> wrote:
> Thank you for your suggestion. Unfortunately, while R doesn't segfault
> calling readr::read_file() on the test file I described, I get the error
> message:
>
> Error in read_file_(ds, locale) : negative length vectors are not allowed
>
> Jen
>
> On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn <istazahn at gmail.com> wrote:
>
>> As s work-around I  suggest readr::read_file.
>>
>> --Ista
>>
>>
>> On Sep 2, 2017 2:58 PM, "Jennifer Lyon" <jennifer.s.lyon at gmail.com> wrote:
>>
>>> Hi:
>>>
>>> I have a 2.1GB JSON file. Typically I use readLines() and
>>> jsonlite:fromJSON() to extract data from a JSON file.
>>>
>>> When I try and read in this file using readLines() R segfaults.
>>>
>>> I believe the two salient issues with this file are
>>> 1). Its size
>>> 2). It is a single line (no line breaks)
>>>
>>> I can reproduce this issue as follows
>>> #Generate a big file with no line breaks
>>> # In R
>>> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")
>>>
>>> # in unix shell
>>> cp alpha.txt file.txt
>>> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
>>> file.txt; done
>>>
>>> This generates a 2.3GB file with no line breaks
>>>
>>> in R:
>>> > moo <- readLines("file.txt")
>>>
>>>  *** caught segfault ***
>>> address 0x7cffffff, cause 'memory not mapped'
>>>
>>> Traceback:
>>>  1: readLines("file.txt")
>>>
>>> Possible actions:
>>> 1: abort (with core dump, if enabled)
>>> 2: normal R exit
>>> 3: exit R without saving workspace
>>> 4: exit R saving workspace
>>> Selection: 3
>>>
>>> I conclude:
>>>  I am potentially running up against a limit in R, which should give a
>>> reasonable error, but currently just segfaults.
>>>
>>> My question:
>>> Most of the content of the JSON is an approximately 100K x 6K JSON
>>> equivalent of a dataframe, and I know R can handle much bigger than this
>>> size. I am expecting these JSON files to get even larger. My R code lives
>>> in a bigger system, and the JSON comes in via stdin, so I have absolutely
>>> no control over the data format. I can imagine trying to incrementally
>>> parse the JSON so I don't bump up against the limit, but I am eager for
>>> suggestions of simpler solutions.
>>>
>>> Also, I apologize for the timing of this bug report, as I know folks are
>>> working to get out the next release of R, but like so many things I have
>>> no
>>> control over when bugs leap up.
>>>
>>> Thanks.
>>>
>>> Jen
>>>
>>> > sessionInfo()
>>> R version 3.4.1 (2017-06-30)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>> Running under: Ubuntu 14.04.5 LTS
>>>
>>> Matrix products: default
>>> BLAS: R-3.4.1/lib/libRblas.so
>>> LAPACK:R-3.4.1/lib/libRlapack.so
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> loaded via a namespace (and not attached):
>>> [1] compiler_3.4.1
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel