[R] readLines without skipNul=TRUE causes crash

Anthony Damico ajdamico at gmail.com
Sat Jul 15 17:33:50 CEST 2017


hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not -- pretty
sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
not doing something remarkably stupid.  the text file itself is 4GB so
cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
previous message, i think most or all of it needs to be there to trigger
the segfault.  thanks!


On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com> wrote:

> hi, thanks Dr. Murdoch
>
>
> i'd appreciate if anyone on r-help could help me narrow this down?  i
> believe the segfault occurs because there's a single line with 4GB and also
> embedded nuls, but i am not sure how to artificially construct that?
>
>
> the lodown package can be removed from my example..  it is just for file
> download cacheing, so `lodown::cachaca` can be replaced with
> `download.file`  my current example requires a huge download, so sort of
> painful to repeat but i'm pretty confident that's not the issue.
>
>
> the archive::archive_extract() function unzips a (probably corrupt) .RAR
> file and creates a text file with 80,937 lines.  this file is 4GB:
>
>     > file.size(infile)
>     [1] 4078192743 <(407)%20819-2743>
>
>
> i am pretty sure that nearly all of that 4GB is contained on a single line
> in the file.  here's what happens when i create a file connection and scan
> through..
>
>     > file_con <- file( infile , 'r' )
>     >
>     > first_80936_lines <- readLines( file_con , n = 80936 )
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "1000023930632009"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "36F2924009PAULO"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "AFONSO"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "BA11"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "00000"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "00"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "2924009PAULO"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "AFONSO"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "BA1111"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "467.20"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "346.10"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "414.40"
>     > scan( w , n = 1 , what = character() )
>     Error in scan(w, n = 1, what = character()) :
>       could not allocate memory (2048 Mb) in C function
> 'R_AllocStringBuffer'
>
>
>
> making a huge single-line file does not reproduce the problem, i think the
> embedded nuls have something to do with it--
>
>
>     # WARNING do not run with less than 64GB RAM
>     tf <- tempfile()
>     a <- rep( "a" , 1000000000 )
>     b <- paste( a , collapse = '' )
>     writeLines( b , tf ) ; rm( b ) ; gc()
>     d <- readLines( tf )
>
>
>
> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
> wrote:
>
>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>
>>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>>> i think i should submit to https://bugs.r-project.org/  unless others
>>> have
>>> advice?  thanks
>>>
>>
>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>> self-contained example, not using the lodown and archive packages.  I
>> imagine you can do this by uploading the file you downloaded, or enough of
>> a subset of it to trigger the segfault.  If you can't do that, then likely
>> the bug is with one of those packages, not with R.
>>
>> Duncan Murdoch
>>
>>
>>>
>>>
>>>
>>>
>>> install.packages( "devtools" )
>>> devtools::install_github("ajdamico/lodown")
>>> devtools::install_github("jimhester/archive")
>>>
>>>
>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>
>>> tf <- tempfile()
>>>
>>> # large download!  cachaca saves on your local disk if already downloaded
>>> lodown::cachaca( '
>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>>> mode
>>> = 'wb' )
>>>
>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>
>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>> full.names =
>>> TRUE  )
>>>
>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>
>>> # works
>>> R.utils::countLines( infile )
>>>
>>> # works with warning
>>> my_file <- readLines( infile , skipNul = TRUE )
>>>
>>> # crash
>>> my_file <- readLines( infile )
>>>
>>>
>>> # run just before crash
>>> sessionInfo()
>>> # R version 3.4.1 (2017-06-30)
>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> # Running under: Windows 10 x64 (build 15063)
>>>
>>> # Matrix products: default
>>>
>>> # locale:
>>> # [1] LC_COLLATE=English_United States.1252
>>> # [2] LC_CTYPE=English_United States.1252
>>> # [3] LC_MONETARY=English_United States.1252
>>> # [4] LC_NUMERIC=C
>>> # [5] LC_TIME=English_United States.1252
>>>
>>> # attached base packages:
>>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> # loaded via a namespace (and not attached):
>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>  withr_1.0.2
>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>> memoise_1.1.0
>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>> lodown_0.1.0
>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>> R.oo_1.21.0
>>> # [17] archive_0.0.0.9000
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list