[R] readLines without skipNul=TRUE causes crash

William Dunlap wdunlap at tibco.com
Sun Jul 16 01:28:03 CEST 2017


I see the problem on Windows 10, R-3.4.0, R.exe.  It is not compiled for
debugging but gdb gives some information when I attach the debugger after
the 'R..has stopped working' popup appears.  I don't know how reliable it
is:

(gdb) info threads
  Id   Target Id         Frame
* 4    Thread 11848.0x1500 0x00007ffe38dc8861 in ntdll!DbgBreakPoint ()
from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  3    Thread 11848.0x2e90 0x00007ffe38dc87e4 in
ntdll!ZwWaitForWorkViaWorkerFactory ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  2    Thread 11848.0x3618 0x00007ffe38dc5154 in
ntdll!ZwWaitForSingleObject ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  1    Thread 11848.0x1808 0x000000006c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
(gdb) thread 1
[Switching to thread 1 (Thread 11848.0x1808)]
#0  0x000000006c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
(gdb) where
#0  0x000000006c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#1  0x000000006c7d8919 in R_initAssignSymbols () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#2  0x000000006c7ef961 in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#3  0x000000006c7f1b70 in R_cmpfun1 () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#4  0x000000006c7f1ef2 in Rf_applyClosure () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#5  0x000000006c7efaf7 in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#6  0x000000006c7f3816 in R_execMethod () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#7  0x000000006c7efcdf in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#8  0x000000006c81053c in Rf_ReplIteration () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#9  0x000000006c810902 in Rf_ReplIteration () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#10 0x000000006c810992 in run_Rmainloop () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#11 0x000000000040171c in ?? ()
#12 0x000000000040155a in ?? ()
#13 0x00000000004013e8 in ?? ()
#14 0x000000000040151b in ?? ()
#15 0x00007ffe37868102 in KERNEL32!BaseThreadInitThunk () from
/cygdrive/c/WINDOWS/system32/KERNEL32.DLL
#16 0x00007ffe38d7c5b4 in ntdll!RtlUserThreadStart () from
/cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#17 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Jul 15, 2017 at 3:29 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
wrote:

> I am not able to reproduce your segfault on a Windows 7 platform either:
>
> ##########################
> fn1 <- "d:/DADOS_ENEM_2009.txt"
> sessionInfo()
> ## R version 3.4.1 (2017-06-30)
> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> ##
> ## Matrix products: default
> ##
> ## locale:
> ## [1] LC_COLLATE=English_United States.1252
> ## [2] LC_CTYPE=English_United States.1252
> ## [3] LC_MONETARY=English_United States.1252
> ## [4] LC_NUMERIC=C
> ## [5] LC_TIME=English_United States.1252
> ##
> ## attached base packages:
> ## [1] stats     graphics  grDevices utils     datasets  methods   base
> ##
> ## loaded via a namespace (and not attached):
> ## [1] compiler_3.4.1
> tools::md5sum( fn1 )
> ##             d:/DADOS_ENEM_2009.txt
> ## "83e61c96092285b60d7bf6b0dbc7072e"
> dat <- readLines( fn1 )
> length( dat )
> ## [1] 4148721
>
>
> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>
> I am not able to reproduce this on a Linux platform:
>>
>> #######################3
>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> ## Running under: Ubuntu 14.04.5 LTS
>> ##
>> ## Matrix products: default
>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> ##
>> ## locale:
>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> ##
>> ## attached base packages:
>> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt
>> ##
>> "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>> No segfault occurs.
>>
>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>
>> hi, i realized that the segfault happens on the text file in a new R
>>> session.  so, creating the segfault-generating text file requires a
>>> contributed package, but prompting the actual segfault does not -- pretty
>>> sure that means this is a base R bug?  submitted here:
>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
>>> am
>>> not doing something remarkably stupid.  the text file itself is 4GB so
>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in
>>> the
>>> previous message, i think most or all of it needs to be there to trigger
>>> the segfault.  thanks!
>>>
>>>
>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com>
>>> wrote:
>>>
>>> hi, thanks Dr. Murdoch
>>>>
>>>>
>>>> i'd appreciate if anyone on r-help could help me narrow this down?  i
>>>> believe the segfault occurs because there's a single line with 4GB and
>>>> also
>>>> embedded nuls, but i am not sure how to artificially construct that?
>>>>
>>>>
>>>> the lodown package can be removed from my example..  it is just for file
>>>> download cacheing, so `lodown::cachaca` can be replaced with
>>>> `download.file`  my current example requires a huge download, so sort of
>>>> painful to repeat but i'm pretty confident that's not the issue.
>>>>
>>>>
>>>> the archive::archive_extract() function unzips a (probably corrupt) .RAR
>>>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>>>
>>>>    > file.size(infile)
>>>>     [1] 4078192743 <(407)%20819-2743>
>>>>
>>>>
>>>> i am pretty sure that nearly all of that 4GB is contained on a single
>>>> line
>>>> in the file.  here's what happens when i create a file connection and
>>>> scan
>>>> through..
>>>>
>>>>    > file_con <- file( infile , 'r' )
>>>>    >
>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "1000023930632009"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "36F2924009PAULO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "AFONSO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "BA11"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "00000"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "00"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "2924009PAULO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "AFONSO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "BA1111"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "467.20"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "346.10"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "414.40"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Error in scan(w, n = 1, what = character()) :
>>>>       could not allocate memory (2048 Mb) in C function
>>>> 'R_AllocStringBuffer'
>>>>
>>>>
>>>>
>>>> making a huge single-line file does not reproduce the problem, i think
>>>> the
>>>> embedded nuls have something to do with it--
>>>>
>>>>
>>>>     # WARNING do not run with less than 64GB RAM
>>>>     tf <- tempfile()
>>>>     a <- rep( "a" , 1000000000 )
>>>>     b <- paste( a , collapse = '' )
>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>>>     d <- readLines( tf )
>>>>
>>>>
>>>>
>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>>>> murdoch.duncan at gmail.com>
>>>> wrote:
>>>>
>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>>>
>>>>> hello, the last line of the code below causes a segfault for me on
>>>>>> 3.4.1.
>>>>>> i think i should submit to https://bugs.r-project.org/  unless others
>>>>>> have
>>>>>> advice?  thanks
>>>>>>
>>>>>>
>>>>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>>>>> self-contained example, not using the lodown and archive packages.  I
>>>>> imagine you can do this by uploading the file you downloaded, or
>>>>> enough of
>>>>> a subset of it to trigger the segfault.  If you can't do that, then
>>>>> likely
>>>>> the bug is with one of those packages, not with R.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> install.packages( "devtools" )
>>>>>> devtools::install_github("ajdamico/lodown")
>>>>>> devtools::install_github("jimhester/archive")
>>>>>>
>>>>>>
>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>>>
>>>>>> tf <- tempfile()
>>>>>>
>>>>>> # large download!  cachaca saves on your local disk if already
>>>>>> downloaded
>>>>>> lodown::cachaca( '
>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf
>>>>>> ,
>>>>>> mode
>>>>>> = 'wb' )
>>>>>>
>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>>>
>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>>> full.names =
>>>>>> TRUE  )
>>>>>>
>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>>>
>>>>>> # works
>>>>>> R.utils::countLines( infile )
>>>>>>
>>>>>> # works with warning
>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>>>
>>>>>> # crash
>>>>>> my_file <- readLines( infile )
>>>>>>
>>>>>>
>>>>>> # run just before crash
>>>>>> sessionInfo()
>>>>>> # R version 3.4.1 (2017-06-30)
>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>>>
>>>>>> # Matrix products: default
>>>>>>
>>>>>> # locale:
>>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>>> # [4] LC_NUMERIC=C
>>>>>> # [5] LC_TIME=English_United States.1252
>>>>>>
>>>>>> # attached base packages:
>>>>>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> # loaded via a namespace (and not attached):
>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>>>  withr_1.0.2
>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>>>> memoise_1.1.0
>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>>>> lodown_0.1.0
>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>>>> R.oo_1.21.0
>>>>>> # [17] archive_0.0.0.9000
>>>>>>
>>>>>>         [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>> ng-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                      Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> rocks...1k
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> ------------------------------------------------------------
> ---------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list