[R] readLines without skipNul=TRUE causes crash

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Sun Jul 16 20:22:09 CEST 2017


I am stuck. The archive package won't compile for me on Ubuntu, and the CRANextra repo seems to be down so I cannot install packages on Windows right now. Perhaps you can zip the corrupt text file and put it online somewhere? Don't use the archive package to pack it since there seem to be issues with that tool on your machine. 

I would discourage you from harassing the Brazilian government about their RAR file because the RAR file seems fine (no NUL characters appear in the text file) when extracted using the file-roller archive tool on Ubuntu.
-- 
Sent from my phone. Please excuse my brevity.

On July 16, 2017 9:37:17 AM PDT, Anthony Damico <ajdamico at gmail.com> wrote:
>hi, yep, there are two problems -- but i think only the segfault is
>within
>the scope of a base R issue?  i need to look closer at the corrupted
>decompression and figure out whether i should talk to the brazilian
>government agency that creates that .rar file or open an issue with the
>archive package maintainer.  my goal in this thread is only to figure
>out
>how to replicate the goofy text file so the r team can turn it into an
>error instead of a segfault.
>
>the original example i sent stores the .txt file somewhere inside the
>tempdir(), but when i copy it over elsewhere on my machine, the
>md5sum()
>gives the same result.  thanks again for looking at this
>
>    > tools::md5sum(infile)
>
>C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_folder/Microdados
>ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
>    "30beb57419486108e98d42ec7a2f8b19"
>
>
>    > tools::md5sum( "S:/temp/crash.txt" )
>                     S:/temp/crash.txt
>    "30beb57419486108e98d42ec7a2f8b19"
>
>
>
>
>On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
><jdnewmil at dcn.davis.ca.us>
>wrote:
>
>> So you are saying there are two problems... one that produces a
>corrupt
>> file from a valid compressed file, and one that segfaults when
>presented
>> with that corrupt file? Can you please confirm the file name and run
>md5sum
>> on it and share the result so we can tell when the file problem has
>been
>> reproduced?
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 16, 2017 3:21:21 AM PDT, Anthony Damico <ajdamico at gmail.com>
>> wrote:
>> >hi, thank you for attempting this. it looks like your unix machine
>> >unzipped
>> >the txt file without corruption -- if you copied over the same txt
>file
>> >to
>> >windows 7, i don't think that would reproduce the problem?  i think
>it
>> >needs to be the corrupted text file where   R.utils::countLines(
>> >txtfile
>> >)   gives 809367.  i am able to reproduce on two distinct windows
>> >machines
>> >but no guarantee i'm not doing something dumb
>> >
>> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
>> ><jdnewmil at dcn.davis.ca.us>
>> >wrote:
>> >
>> >> I am not able to reproduce your segfault on a Windows 7 platform
>> >either:
>> >>
>> >> ##########################
>> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> >> sessionInfo()
>> >> ## R version 3.4.1 (2017-06-30)
>> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> >> ##
>> >> ## Matrix products: default
>> >> ##
>> >> ## locale:
>> >> ## [1] LC_COLLATE=English_United States.1252
>> >> ## [2] LC_CTYPE=English_United States.1252
>> >> ## [3] LC_MONETARY=English_United States.1252
>> >> ## [4] LC_NUMERIC=C
>> >> ## [5] LC_TIME=English_United States.1252
>> >> ##
>> >> ## attached base packages:
>> >> ## [1] stats     graphics  grDevices utils     datasets  methods
>> >base
>> >> ##
>> >> ## loaded via a namespace (and not attached):
>> >> ## [1] compiler_3.4.1
>> >> tools::md5sum( fn1 )
>> >> ##             d:/DADOS_ENEM_2009.txt
>> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> >> dat <- readLines( fn1 )
>> >> length( dat )
>> >> ## [1] 4148721
>> >>
>> >>
>> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>> >>
>> >> I am not able to reproduce this on a Linux platform:
>> >>>
>> >>> #######################3
>> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> >>> 2009/DADOS_ENEM_2009.txt"
>> >>> sessionInfo()
>> >>> ## R version 3.4.1 (2017-06-30)
>> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> >>> ## Running under: Ubuntu 14.04.5 LTS
>> >>> ##
>> >>> ## Matrix products: default
>> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> >>> ##
>> >>> ## locale:
>> >>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> >>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> >>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> >>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> >>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >>> ##
>> >>> ## attached base packages:
>> >>> ## [1] stats     graphics  grDevices utils     datasets  methods
>> >base
>> >>> ##
>> >>> ## loaded via a namespace (and not attached):
>> >>> ## [1] compiler_3.4.1
>> >>> tools::md5sum( fn1 )
>> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> >>> 2009/DADOS_ENEM_2009.txt
>> >>> ##
>> >>> "83e61c96092285b60d7bf6b0dbc7072e"
>> >>> dat <- readLines( fn1 )
>> >>> length( dat )
>> >>> ## [1] 4148721
>> >>>
>> >>> No segfault occurs.
>> >>>
>> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>> >>>
>> >>> hi, i realized that the segfault happens on the text file in a
>new R
>> >>>> session.  so, creating the segfault-generating text file
>requires a
>> >>>> contributed package, but prompting the actual segfault does not
>--
>> >pretty
>> >>>> sure that means this is a base R bug?  submitted here:
>> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
>> >hopefully i
>> >>>> am
>> >>>> not doing something remarkably stupid.  the text file itself is
>4GB
>> >so
>> >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger
>> >error in
>> >>>> the
>> >>>> previous message, i think most or all of it needs to be there to
>> >trigger
>> >>>> the segfault.  thanks!
>> >>>>
>> >>>>
>> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
>> ><ajdamico at gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>> hi, thanks Dr. Murdoch
>> >>>>>
>> >>>>>
>> >>>>> i'd appreciate if anyone on r-help could help me narrow this
>down?
>> > i
>> >>>>> believe the segfault occurs because there's a single line with
>4GB
>> >and
>> >>>>> also
>> >>>>> embedded nuls, but i am not sure how to artificially construct
>> >that?
>> >>>>>
>> >>>>>
>> >>>>> the lodown package can be removed from my example..  it is just
>> >for file
>> >>>>> download cacheing, so `lodown::cachaca` can be replaced with
>> >>>>> `download.file`  my current example requires a huge download,
>so
>> >sort of
>> >>>>> painful to repeat but i'm pretty confident that's not the
>issue.
>> >>>>>
>> >>>>>
>> >>>>> the archive::archive_extract() function unzips a (probably
>> >corrupt) .RAR
>> >>>>> file and creates a text file with 80,937 lines.  this file is
>4GB:
>> >>>>>
>> >>>>>    > file.size(infile)
>> >>>>>     [1] 4078192743 <(407)%20819-2743>
>> >>>>>
>> >>>>>
>> >>>>> i am pretty sure that nearly all of that 4GB is contained on a
>> >single
>> >>>>> line
>> >>>>> in the file.  here's what happens when i create a file
>connection
>> >and
>> >>>>> scan
>> >>>>> through..
>> >>>>>
>> >>>>>    > file_con <- file( infile , 'r' )
>> >>>>>    >
>> >>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "1000023930632009"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "36F2924009PAULO"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "AFONSO"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "BA11"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "00000"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "00"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "2924009PAULO"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "AFONSO"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "BA1111"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "467.20"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "346.10"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "414.40"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Error in scan(w, n = 1, what = character()) :
>> >>>>>       could not allocate memory (2048 Mb) in C function
>> >>>>> 'R_AllocStringBuffer'
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> making a huge single-line file does not reproduce the problem,
>i
>> >think
>> >>>>> the
>> >>>>> embedded nuls have something to do with it--
>> >>>>>
>> >>>>>
>> >>>>>     # WARNING do not run with less than 64GB RAM
>> >>>>>     tf <- tempfile()
>> >>>>>     a <- rep( "a" , 1000000000 )
>> >>>>>     b <- paste( a , collapse = '' )
>> >>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>> >>>>>     d <- readLines( tf )
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>> >>>>> murdoch.duncan at gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>> >>>>>>
>> >>>>>> hello, the last line of the code below causes a segfault for
>me
>> >on
>> >>>>>>> 3.4.1.
>> >>>>>>> i think i should submit to https://bugs.r-project.org/ 
>unless
>> >others
>> >>>>>>> have
>> >>>>>>> advice?  thanks
>> >>>>>>>
>> >>>>>>>
>> >>>>>> Segfaults are usually worth reporting as bugs.  Try to come up
>> >with a
>> >>>>>> self-contained example, not using the lodown and archive
>> >packages.  I
>> >>>>>> imagine you can do this by uploading the file you downloaded,
>or
>> >>>>>> enough of
>> >>>>>> a subset of it to trigger the segfault.  If you can't do that,
>> >then
>> >>>>>> likely
>> >>>>>> the bug is with one of those packages, not with R.
>> >>>>>>
>> >>>>>> Duncan Murdoch
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> install.packages( "devtools" )
>> >>>>>>> devtools::install_github("ajdamico/lodown")
>> >>>>>>> devtools::install_github("jimhester/archive")
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>> >>>>>>>
>> >>>>>>> tf <- tempfile()
>> >>>>>>>
>> >>>>>>> # large download!  cachaca saves on your local disk if
>already
>> >>>>>>> downloaded
>> >>>>>>> lodown::cachaca( '
>> >>>>>>>
>http://download.inep.gov.br/microdados/microdados_enem2009.rar'
>> >, tf
>> >>>>>>> ,
>> >>>>>>> mode
>> >>>>>>> = 'wb' )
>> >>>>>>>
>> >>>>>>> archive::archive_extract( tf , dir = normalizePath(
>file_folder
>> >) )
>> >>>>>>>
>> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE
>,
>> >>>>>>> full.names =
>> >>>>>>> TRUE  )
>> >>>>>>>
>> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value =
>> >TRUE )
>> >>>>>>>
>> >>>>>>> # works
>> >>>>>>> R.utils::countLines( infile )
>> >>>>>>>
>> >>>>>>> # works with warning
>> >>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>> >>>>>>>
>> >>>>>>> # crash
>> >>>>>>> my_file <- readLines( infile )
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> # run just before crash
>> >>>>>>> sessionInfo()
>> >>>>>>> # R version 3.4.1 (2017-06-30)
>> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >>>>>>> # Running under: Windows 10 x64 (build 15063)
>> >>>>>>>
>> >>>>>>> # Matrix products: default
>> >>>>>>>
>> >>>>>>> # locale:
>> >>>>>>> # [1] LC_COLLATE=English_United States.1252
>> >>>>>>> # [2] LC_CTYPE=English_United States.1252
>> >>>>>>> # [3] LC_MONETARY=English_United States.1252
>> >>>>>>> # [4] LC_NUMERIC=C
>> >>>>>>> # [5] LC_TIME=English_United States.1252
>> >>>>>>>
>> >>>>>>> # attached base packages:
>> >>>>>>> # [1] stats     graphics  grDevices utils     datasets 
>methods
>> > base
>> >>>>>>>
>> >>>>>>> # loaded via a namespace (and not attached):
>> >>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>> >>>>>>>  withr_1.0.2
>> >>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>> >>>>>>> memoise_1.1.0
>> >>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>> >>>>>>> lodown_0.1.0
>> >>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>> >>>>>>> R.oo_1.21.0
>> >>>>>>> # [17] archive_0.0.0.9000
>> >>>>>>>
>> >>>>>>>         [[alternative HTML version deleted]]
>> >>>>>>>
>> >>>>>>> ______________________________________________
>> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>> >see
>> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>>>>> PLEASE do read the posting guide
>http://www.R-project.org/posti
>> >>>>>>> ng-guide.html
>> >>>>>>> and provide commented, minimal, self-contained, reproducible
>> >code.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>         [[alternative HTML version deleted]]
>> >>>>
>> >>>> ______________________________________________
>> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>see
>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>> PLEASE do read the posting guide http://www.R-project.org/posti
>> >>>> ng-guide.html
>> >>>> and provide commented, minimal, self-contained, reproducible
>code.
>> >>>>
>> >>>>
>> >>> ------------------------------------------------------------
>> >>> ---------------
>> >>> Jeff Newmiller                        The     .....       ..... 
>Go
>> >>> Live...
>> >>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
>> >Live
>> >>> Go...
>> >>>                                      Live:   OO#.. Dead: OO#..
>> >Playing
>> >>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>> >with
>> >>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> >>> rocks...1k
>> >>>
>> >>> ______________________________________________
>> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> PLEASE do read the posting guide http://www.R-project.org/posti
>> >>> ng-guide.html
>> >>> and provide commented, minimal, self-contained, reproducible
>code.
>> >>>
>> >>>
>> >> ------------------------------------------------------------
>> >> ---------------
>> >> Jeff Newmiller                        The     .....       ..... 
>Go
>> >Live...
>> >> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
>Live
>> >> Go...
>> >>                                       Live:   OO#.. Dead: OO#..
>> >Playing
>> >> Research Engineer (Solar/Batteries            O.O#.       #.O#. 
>with
>> >> /Software/Embedded Controllers)               .OO#.       .OO#.
>> >rocks...1k
>> >> ------------------------------------------------------------
>> >> ---------------
>> >>
>>



More information about the R-help mailing list