[R] readLines without skipNul=TRUE causes crash

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Sun Jul 16 00:29:15 CEST 2017


I am not able to reproduce your segfault on a Windows 7 platform either:

##########################
fn1 <- "d:/DADOS_ENEM_2009.txt"
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
##             d:/DADOS_ENEM_2009.txt
## "83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721


On Sat, 15 Jul 2017, Jeff Newmiller wrote:

> I am not able to reproduce this on a Linux platform:
>
> #######################3
> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 
> 2009/DADOS_ENEM_2009.txt"
> sessionInfo()
> ## R version 3.4.1 (2017-06-30)
> ## Platform: x86_64-pc-linux-gnu (64-bit)
> ## Running under: Ubuntu 14.04.5 LTS
> ##
> ## Matrix products: default
> ## BLAS: /usr/lib/libblas/libblas.so.3.0
> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
> ##
> ## locale:
> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> ##
> ## attached base packages:
> ## [1] stats     graphics  grDevices utils     datasets  methods   base
> ##
> ## loaded via a namespace (and not attached):
> ## [1] compiler_3.4.1
> tools::md5sum( fn1 )
> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
> ##                                                "83e61c96092285b60d7bf6b0dbc7072e"
> dat <- readLines( fn1 )
> length( dat )
> ## [1] 4148721
>
> No segfault occurs.
>
> On Sat, 15 Jul 2017, Anthony Damico wrote:
>
>> hi, i realized that the segfault happens on the text file in a new R
>> session.  so, creating the segfault-generating text file requires a
>> contributed package, but prompting the actual segfault does not -- pretty
>> sure that means this is a base R bug?  submitted here:
>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
>> not doing something remarkably stupid.  the text file itself is 4GB so
>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
>> previous message, i think most or all of it needs to be there to trigger
>> the segfault.  thanks!
>> 
>> 
>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com> 
>> wrote:
>> 
>>> hi, thanks Dr. Murdoch
>>> 
>>> 
>>> i'd appreciate if anyone on r-help could help me narrow this down?  i
>>> believe the segfault occurs because there's a single line with 4GB and 
>>> also
>>> embedded nuls, but i am not sure how to artificially construct that?
>>> 
>>> 
>>> the lodown package can be removed from my example..  it is just for file
>>> download cacheing, so `lodown::cachaca` can be replaced with
>>> `download.file`  my current example requires a huge download, so sort of
>>> painful to repeat but i'm pretty confident that's not the issue.
>>> 
>>> 
>>> the archive::archive_extract() function unzips a (probably corrupt) .RAR
>>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>>
>>>    > file.size(infile)
>>>     [1] 4078192743 <(407)%20819-2743>
>>> 
>>> 
>>> i am pretty sure that nearly all of that 4GB is contained on a single line
>>> in the file.  here's what happens when i create a file connection and scan
>>> through..
>>>
>>>    > file_con <- file( infile , 'r' )
>>>    >
>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "1000023930632009"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "36F2924009PAULO"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "AFONSO"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "BA11"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "00000"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "00"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "2924009PAULO"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "AFONSO"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "BA1111"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "467.20"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "346.10"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "414.40"
>>>    > scan( w , n = 1 , what = character() )
>>>     Error in scan(w, n = 1, what = character()) :
>>>       could not allocate memory (2048 Mb) in C function
>>> 'R_AllocStringBuffer'
>>> 
>>> 
>>> 
>>> making a huge single-line file does not reproduce the problem, i think the
>>> embedded nuls have something to do with it--
>>> 
>>>
>>>     # WARNING do not run with less than 64GB RAM
>>>     tf <- tempfile()
>>>     a <- rep( "a" , 1000000000 )
>>>     b <- paste( a , collapse = '' )
>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>>     d <- readLines( tf )
>>> 
>>> 
>>> 
>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
>>> wrote:
>>> 
>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>> 
>>>>> hello, the last line of the code below causes a segfault for me on 
>>>>> 3.4.1.
>>>>> i think i should submit to https://bugs.r-project.org/  unless others
>>>>> have
>>>>> advice?  thanks
>>>>> 
>>>> 
>>>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>>>> self-contained example, not using the lodown and archive packages.  I
>>>> imagine you can do this by uploading the file you downloaded, or enough 
>>>> of
>>>> a subset of it to trigger the segfault.  If you can't do that, then 
>>>> likely
>>>> the bug is with one of those packages, not with R.
>>>> 
>>>> Duncan Murdoch
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> install.packages( "devtools" )
>>>>> devtools::install_github("ajdamico/lodown")
>>>>> devtools::install_github("jimhester/archive")
>>>>> 
>>>>> 
>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>> 
>>>>> tf <- tempfile()
>>>>> 
>>>>> # large download!  cachaca saves on your local disk if already 
>>>>> downloaded
>>>>> lodown::cachaca( '
>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>>>>> mode
>>>>> = 'wb' )
>>>>> 
>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>> 
>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>> full.names =
>>>>> TRUE  )
>>>>> 
>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>> 
>>>>> # works
>>>>> R.utils::countLines( infile )
>>>>> 
>>>>> # works with warning
>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>> 
>>>>> # crash
>>>>> my_file <- readLines( infile )
>>>>> 
>>>>> 
>>>>> # run just before crash
>>>>> sessionInfo()
>>>>> # R version 3.4.1 (2017-06-30)
>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>> 
>>>>> # Matrix products: default
>>>>> 
>>>>> # locale:
>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>> # [4] LC_NUMERIC=C
>>>>> # [5] LC_TIME=English_United States.1252
>>>>> 
>>>>> # attached base packages:
>>>>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>> 
>>>>> # loaded via a namespace (and not attached):
>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>>  withr_1.0.2
>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>>> memoise_1.1.0
>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>>> lodown_0.1.0
>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>>> R.oo_1.21.0
>>>>> # [17] archive_0.0.0.9000
>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>> 
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>> ng-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>> 
>>>>> 
>>>> 
>>> 
>>
>> 	[[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>                                      Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k



More information about the R-help mailing list