[R] readLines without skipNul=TRUE causes crash

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Mon Jul 17 16:23:25 CEST 2017


I'll pass. Just because some non-CRAN "archive" package has bugs or your disk storage is flaky does not mean that any of dozens or hundreds of other compression tools (e.g. the built-in Windows "Send to compressed folder" pop-up menu) won't get it right, and we would know if it did fail because of the md5sum.
-- 
Sent from my phone. Please excuse my brevity.

On July 17, 2017 5:00:48 AM PDT, Anthony Damico <ajdamico at gmail.com> wrote:
>hi, thanks again for taking the time.  since corrupted compression
>prompted
>the segfault for me in the first place, i've just posted the text file
>as-is.  it's a 2.4GB file so to be avoided on a metered internet
>connection.  i've updated the bugzilla report at
>https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 with more
>relevant info.  these lines of code crash both windows R 3.4.1 and also
>linux R 3.3.3 for me.  thanks again
>
>
>    # consider changing `tempfile()` to a permanent location
>    # so you don't lose the large downloaded file after the crash
>    tf <- tempfile()
> download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt"
>, tf , mode = 'wb' )
>    sessionInfo()
>    x <- readLines( tf )
>
>
>
>
>On Sun, Jul 16, 2017 at 2:22 PM, Jeff Newmiller
><jdnewmil at dcn.davis.ca.us>
>wrote:
>
>> I am stuck. The archive package won't compile for me on Ubuntu, and
>the
>> CRANextra repo seems to be down so I cannot install packages on
>Windows
>> right now. Perhaps you can zip the corrupt text file and put it
>online
>> somewhere? Don't use the archive package to pack it since there seem
>to be
>> issues with that tool on your machine.
>>
>> I would discourage you from harassing the Brazilian government about
>their
>> RAR file because the RAR file seems fine (no NUL characters appear in
>the
>> text file) when extracted using the file-roller archive tool on
>Ubuntu.
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 16, 2017 9:37:17 AM PDT, Anthony Damico <ajdamico at gmail.com>
>> wrote:
>> >hi, yep, there are two problems -- but i think only the segfault is
>> >within
>> >the scope of a base R issue?  i need to look closer at the corrupted
>> >decompression and figure out whether i should talk to the brazilian
>> >government agency that creates that .rar file or open an issue with
>the
>> >archive package maintainer.  my goal in this thread is only to
>figure
>> >out
>> >how to replicate the goofy text file so the r team can turn it into
>an
>> >error instead of a segfault.
>> >
>> >the original example i sent stores the .txt file somewhere inside
>the
>> >tempdir(), but when i copy it over elsewhere on my machine, the
>> >md5sum()
>> >gives the same result.  thanks again for looking at this
>> >
>> >    > tools::md5sum(infile)
>> >
>> >C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_
>> folder/Microdados
>> >ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
>> >    "30beb57419486108e98d42ec7a2f8b19"
>> >
>> >
>> >    > tools::md5sum( "S:/temp/crash.txt" )
>> >                     S:/temp/crash.txt
>> >    "30beb57419486108e98d42ec7a2f8b19"
>> >
>> >
>> >
>> >
>> >On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
>> ><jdnewmil at dcn.davis.ca.us>
>> >wrote:
>> >
>> >> So you are saying there are two problems... one that produces a
>> >corrupt
>> >> file from a valid compressed file, and one that segfaults when
>> >presented
>> >> with that corrupt file? Can you please confirm the file name and
>run
>> >md5sum
>> >> on it and share the result so we can tell when the file problem
>has
>> >been
>> >> reproduced?
>> >> --
>> >> Sent from my phone. Please excuse my brevity.
>> >>
>> >> On July 16, 2017 3:21:21 AM PDT, Anthony Damico
><ajdamico at gmail.com>
>> >> wrote:
>> >> >hi, thank you for attempting this. it looks like your unix
>machine
>> >> >unzipped
>> >> >the txt file without corruption -- if you copied over the same
>txt
>> >file
>> >> >to
>> >> >windows 7, i don't think that would reproduce the problem?  i
>think
>> >it
>> >> >needs to be the corrupted text file where   R.utils::countLines(
>> >> >txtfile
>> >> >)   gives 809367.  i am able to reproduce on two distinct windows
>> >> >machines
>> >> >but no guarantee i'm not doing something dumb
>> >> >
>> >> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
>> >> ><jdnewmil at dcn.davis.ca.us>
>> >> >wrote:
>> >> >
>> >> >> I am not able to reproduce your segfault on a Windows 7
>platform
>> >> >either:
>> >> >>
>> >> >> ##########################
>> >> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> >> >> sessionInfo()
>> >> >> ## R version 3.4.1 (2017-06-30)
>> >> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> >> >> ##
>> >> >> ## Matrix products: default
>> >> >> ##
>> >> >> ## locale:
>> >> >> ## [1] LC_COLLATE=English_United States.1252
>> >> >> ## [2] LC_CTYPE=English_United States.1252
>> >> >> ## [3] LC_MONETARY=English_United States.1252
>> >> >> ## [4] LC_NUMERIC=C
>> >> >> ## [5] LC_TIME=English_United States.1252
>> >> >> ##
>> >> >> ## attached base packages:
>> >> >> ## [1] stats     graphics  grDevices utils     datasets 
>methods
>> >> >base
>> >> >> ##
>> >> >> ## loaded via a namespace (and not attached):
>> >> >> ## [1] compiler_3.4.1
>> >> >> tools::md5sum( fn1 )
>> >> >> ##             d:/DADOS_ENEM_2009.txt
>> >> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> >> >> dat <- readLines( fn1 )
>> >> >> length( dat )
>> >> >> ## [1] 4148721
>> >> >>
>> >> >>
>> >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>> >> >>
>> >> >> I am not able to reproduce this on a Linux platform:
>> >> >>>
>> >> >>> #######################3
>> >> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados
>Enem
>> >> >>> 2009/DADOS_ENEM_2009.txt"
>> >> >>> sessionInfo()
>> >> >>> ## R version 3.4.1 (2017-06-30)
>> >> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> >> >>> ## Running under: Ubuntu 14.04.5 LTS
>> >> >>> ##
>> >> >>> ## Matrix products: default
>> >> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> >> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> >> >>> ##
>> >> >>> ## locale:
>> >> >>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> >> >>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> >> >>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> >> >>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> >> >>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> >> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >> >>> ##
>> >> >>> ## attached base packages:
>> >> >>> ## [1] stats     graphics  grDevices utils     datasets 
>methods
>> >> >base
>> >> >>> ##
>> >> >>> ## loaded via a namespace (and not attached):
>> >> >>> ## [1] compiler_3.4.1
>> >> >>> tools::md5sum( fn1 )
>> >> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> >> >>> 2009/DADOS_ENEM_2009.txt
>> >> >>> ##
>> >> >>> "83e61c96092285b60d7bf6b0dbc7072e"
>> >> >>> dat <- readLines( fn1 )
>> >> >>> length( dat )
>> >> >>> ## [1] 4148721
>> >> >>>
>> >> >>> No segfault occurs.
>> >> >>>
>> >> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>> >> >>>
>> >> >>> hi, i realized that the segfault happens on the text file in a
>> >new R
>> >> >>>> session.  so, creating the segfault-generating text file
>> >requires a
>> >> >>>> contributed package, but prompting the actual segfault does
>not
>> >--
>> >> >pretty
>> >> >>>> sure that means this is a base R bug?  submitted here:
>> >> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
>> >> >hopefully i
>> >> >>>> am
>> >> >>>> not doing something remarkably stupid.  the text file itself
>is
>> >4GB
>> >> >so
>> >> >>>> cannot upload it to bugzilla, and from the
>R_AllocStringBugger
>> >> >error in
>> >> >>>> the
>> >> >>>> previous message, i think most or all of it needs to be there
>to
>> >> >trigger
>> >> >>>> the segfault.  thanks!
>> >> >>>>
>> >> >>>>
>> >> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
>> >> ><ajdamico at gmail.com>
>> >> >>>> wrote:
>> >> >>>>
>> >> >>>> hi, thanks Dr. Murdoch
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> i'd appreciate if anyone on r-help could help me narrow this
>> >down?
>> >> > i
>> >> >>>>> believe the segfault occurs because there's a single line
>with
>> >4GB
>> >> >and
>> >> >>>>> also
>> >> >>>>> embedded nuls, but i am not sure how to artificially
>construct
>> >> >that?
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> the lodown package can be removed from my example..  it is
>just
>> >> >for file
>> >> >>>>> download cacheing, so `lodown::cachaca` can be replaced with
>> >> >>>>> `download.file`  my current example requires a huge
>download,
>> >so
>> >> >sort of
>> >> >>>>> painful to repeat but i'm pretty confident that's not the
>> >issue.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> the archive::archive_extract() function unzips a (probably
>> >> >corrupt) .RAR
>> >> >>>>> file and creates a text file with 80,937 lines.  this file
>is
>> >4GB:
>> >> >>>>>
>> >> >>>>>    > file.size(infile)
>> >> >>>>>     [1] 4078192743 <(407)%20819-2743>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> i am pretty sure that nearly all of that 4GB is contained on
>a
>> >> >single
>> >> >>>>> line
>> >> >>>>> in the file.  here's what happens when i create a file
>> >connection
>> >> >and
>> >> >>>>> scan
>> >> >>>>> through..
>> >> >>>>>
>> >> >>>>>    > file_con <- file( infile , 'r' )
>> >> >>>>>    >
>> >> >>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "1000023930632009"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "36F2924009PAULO"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "AFONSO"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "BA11"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "00000"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "00"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "2924009PAULO"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "AFONSO"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "BA1111"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "467.20"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "346.10"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "414.40"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Error in scan(w, n = 1, what = character()) :
>> >> >>>>>       could not allocate memory (2048 Mb) in C function
>> >> >>>>> 'R_AllocStringBuffer'
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> making a huge single-line file does not reproduce the
>problem,
>> >i
>> >> >think
>> >> >>>>> the
>> >> >>>>> embedded nuls have something to do with it--
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>     # WARNING do not run with less than 64GB RAM
>> >> >>>>>     tf <- tempfile()
>> >> >>>>>     a <- rep( "a" , 1000000000 )
>> >> >>>>>     b <- paste( a , collapse = '' )
>> >> >>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>> >> >>>>>     d <- readLines( tf )
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>> >> >>>>> murdoch.duncan at gmail.com>
>> >> >>>>> wrote:
>> >> >>>>>
>> >> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>> >> >>>>>>
>> >> >>>>>> hello, the last line of the code below causes a segfault
>for
>> >me
>> >> >on
>> >> >>>>>>> 3.4.1.
>> >> >>>>>>> i think i should submit to https://bugs.r-project.org/
>> >unless
>> >> >others
>> >> >>>>>>> have
>> >> >>>>>>> advice?  thanks
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>> Segfaults are usually worth reporting as bugs.  Try to come
>up
>> >> >with a
>> >> >>>>>> self-contained example, not using the lodown and archive
>> >> >packages.  I
>> >> >>>>>> imagine you can do this by uploading the file you
>downloaded,
>> >or
>> >> >>>>>> enough of
>> >> >>>>>> a subset of it to trigger the segfault.  If you can't do
>that,
>> >> >then
>> >> >>>>>> likely
>> >> >>>>>> the bug is with one of those packages, not with R.
>> >> >>>>>>
>> >> >>>>>> Duncan Murdoch
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> install.packages( "devtools" )
>> >> >>>>>>> devtools::install_github("ajdamico/lodown")
>> >> >>>>>>> devtools::install_github("jimhester/archive")
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>> >> >>>>>>>
>> >> >>>>>>> tf <- tempfile()
>> >> >>>>>>>
>> >> >>>>>>> # large download!  cachaca saves on your local disk if
>> >already
>> >> >>>>>>> downloaded
>> >> >>>>>>> lodown::cachaca( '
>> >> >>>>>>>
>> >http://download.inep.gov.br/microdados/microdados_enem2009.rar'
>> >> >, tf
>> >> >>>>>>> ,
>> >> >>>>>>> mode
>> >> >>>>>>> = 'wb' )
>> >> >>>>>>>
>> >> >>>>>>> archive::archive_extract( tf , dir = normalizePath(
>> >file_folder
>> >> >) )
>> >> >>>>>>>
>> >> >>>>>>> unzipped_files <- list.files( file_folder , recursive =
>TRUE
>> >,
>> >> >>>>>>> full.names =
>> >> >>>>>>> TRUE  )
>> >> >>>>>>>
>> >> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files ,
>value =
>> >> >TRUE )
>> >> >>>>>>>
>> >> >>>>>>> # works
>> >> >>>>>>> R.utils::countLines( infile )
>> >> >>>>>>>
>> >> >>>>>>> # works with warning
>> >> >>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>> >> >>>>>>>
>> >> >>>>>>> # crash
>> >> >>>>>>> my_file <- readLines( infile )
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> # run just before crash
>> >> >>>>>>> sessionInfo()
>> >> >>>>>>> # R version 3.4.1 (2017-06-30)
>> >> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >> >>>>>>> # Running under: Windows 10 x64 (build 15063)
>> >> >>>>>>>
>> >> >>>>>>> # Matrix products: default
>> >> >>>>>>>
>> >> >>>>>>> # locale:
>> >> >>>>>>> # [1] LC_COLLATE=English_United States.1252
>> >> >>>>>>> # [2] LC_CTYPE=English_United States.1252
>> >> >>>>>>> # [3] LC_MONETARY=English_United States.1252
>> >> >>>>>>> # [4] LC_NUMERIC=C
>> >> >>>>>>> # [5] LC_TIME=English_United States.1252
>> >> >>>>>>>
>> >> >>>>>>> # attached base packages:
>> >> >>>>>>> # [1] stats     graphics  grDevices utils     datasets
>> >methods
>> >> > base
>> >> >>>>>>>
>> >> >>>>>>> # loaded via a namespace (and not attached):
>> >> >>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>> >> >>>>>>>  withr_1.0.2
>> >> >>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>> >> >>>>>>> memoise_1.1.0
>> >> >>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>> >> >>>>>>> lodown_0.1.0
>> >> >>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1       
>devtools_1.13.2
>> >> >>>>>>> R.oo_1.21.0
>> >> >>>>>>> # [17] archive_0.0.0.9000
>> >> >>>>>>>
>> >> >>>>>>>         [[alternative HTML version deleted]]
>> >> >>>>>>>
>> >> >>>>>>> ______________________________________________
>> >> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
>more,
>> >> >see
>> >> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>>>>>> PLEASE do read the posting guide
>> >http://www.R-project.org/posti
>> >> >>>>>>> ng-guide.html
>> >> >>>>>>> and provide commented, minimal, self-contained,
>reproducible
>> >> >code.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>         [[alternative HTML version deleted]]
>> >> >>>>
>> >> >>>> ______________________________________________
>> >> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>> >see
>> >> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>>> PLEASE do read the posting guide
>http://www.R-project.org/posti
>> >> >>>> ng-guide.html
>> >> >>>> and provide commented, minimal, self-contained, reproducible
>> >code.
>> >> >>>>
>> >> >>>>
>> >> >>> ------------------------------------------------------------
>> >> >>> ---------------
>> >> >>> Jeff Newmiller                        The     .....      
>.....
>> >Go
>> >> >>> Live...
>> >> >>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
>##.#.
>> >> >Live
>> >> >>> Go...
>> >> >>>                                      Live:   OO#.. Dead: OO#..
>> >> >Playing
>> >> >>> Research Engineer (Solar/Batteries            O.O#.      
>#.O#.
>> >> >with
>> >> >>> /Software/Embedded Controllers)               .OO#.      
>.OO#.
>> >> >>> rocks...1k
>> >> >>>
>> >> >>> ______________________________________________
>> >> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>see
>> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>> PLEASE do read the posting guide
>http://www.R-project.org/posti
>> >> >>> ng-guide.html
>> >> >>> and provide commented, minimal, self-contained, reproducible
>> >code.
>> >> >>>
>> >> >>>
>> >> >> ------------------------------------------------------------
>> >> >> ---------------
>> >> >> Jeff Newmiller                        The     .....       .....
>> >Go
>> >> >Live...
>> >> >> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
>> >Live
>> >> >> Go...
>> >> >>                                       Live:   OO#.. Dead: OO#..
>> >> >Playing
>> >> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>> >with
>> >> >> /Software/Embedded Controllers)               .OO#.       .OO#.
>> >> >rocks...1k
>> >> >> ------------------------------------------------------------
>> >> >> ---------------
>> >> >>
>> >>
>>



More information about the R-help mailing list