[R] Can file size affect how na.strings operates in a read.table call?

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Thu Nov 14 17:35:06 CET 2019


Consider the following sample:

#####
s <- "A,B,C
0,0,0
1,-99,-99
2,-99 ,-99
3, -99, -99
"

dta_notok <- read.csv( text = s
                      , header=TRUE
                      , na.strings = c( "-99", "" )
                      )

dta_ok <- read.csv( text = s
                   , header=TRUE
                   , na.strings = c( "-99", " -99"
                                   , "-99 ", ""
                                   )
                   )

library(data.table)

fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) )
fdta_ok <- as.data.frame( fdt_ok )
#####

Leading and trailing spaces cause problems. The data.table::fread function 
has a strip.white argument that defaults to TRUE, but the resulting object 
is a data.table which has different semantics than a data.frame.

On Thu, 14 Nov 2019, Sebastien Bihorel wrote:

> The data file is a csv file. Some text variables contain spaces.
> 
> "Check for extraneous spaces"
> Are there specific locations that would be more critical than others?
> 
> 
> ____________________________________________________________________________
> From: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
> Sent: Thursday, November 14, 2019 10:52
> To: Sebastien Bihorel <Sebastien.Bihorel using cognigencorp.com>; Sebastien
> Bihorel via R-help <r-help using r-project.org>; r-help using r-project.org
> <r-help using r-project.org>
> Subject: Re: [R] Can file size affect how na.strings operates in a
> read.table call?  
> Check for extraneous spaces. You may need more variations of the na.strings.
> 
> On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help
> <r-help using r-project.org> wrote:
> >Hi,
> >
> >I have this generic function to read ASCII data files. It is
> >essentially a wrapper around the read.table function. My function is
> >used in a large variety of situations and has no a priori knowledge
> >about the data file it is asked to read. Nothing is known about file
> >size, variable types, variable names, or data table dimensions.
> >
> >One argument of my function is na.strings which is passed down to
> >read.table.
> >
> >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
> >~ 160 columns) using na.strings = c('-99', '.') with the intention of
> >interpreting '.' and '-99'
> >strings as the internal missing data NA. Dots were converted to NA
> >appropriately. However, not all -99 values in the data were interpreted
> >as NA. In some variables, -99 were converted to NA, while in others -99
> >was read as a number. More surprisingly, when the data file was cut in
> >smaller chunks (ie, by dropping either rows or columns) saved in
> >multiple files, the function calls applied on the new data files
> >resulted in the correct conversion of the -99 values into NAs.
> >
> >In all cases, the data frames produced by read.table contained the
> >expected number of records.
> >
> >While, on face value, it appears that file size affects how the
> >na.strings argument operates, I wondering if there is something else at
> >play here.
> >
> >Unfortunately, I cannot share the data file for confidentiality reason
> >but was wondering if you could suggest some checks I could perform to
> >get to the bottom on this issue.
> >
> >Thank you in advance for your help and sorry for the lack of
> >reproducible example.
> >
> >
> >______________________________________________
> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
> 
> --
> Sent from my phone. Please excuse my brevity.
> 
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil using dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------


More information about the R-help mailing list