[R] Can file size affect how na.strings operates in a read.table call?

William Dunlap wdun|@p @end|ng |rom t|bco@com
Thu Nov 14 17:51:51 CET 2019


read.table (and friends) also have the strip.white argument:

> s <- "A,B,C\n0,0,0\n1,-99,-99\n2,-99 ,-99\n3, -99, -99\n"
> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=TRUE)
  A  B  C
1 0  0  0
2 1 NA NA
3 2 NA NA
4 3 NA NA
> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=FALSE)
  A   B   C
1 0   0   0
2 1  NA  NA
3 2 -99  NA
4 3 -99 -99

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Nov 14, 2019 at 8:35 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
wrote:

> Consider the following sample:
>
> #####
> s <- "A,B,C
> 0,0,0
> 1,-99,-99
> 2,-99 ,-99
> 3, -99, -99
> "
>
> dta_notok <- read.csv( text = s
>                       , header=TRUE
>                       , na.strings = c( "-99", "" )
>                       )
>
> dta_ok <- read.csv( text = s
>                    , header=TRUE
>                    , na.strings = c( "-99", " -99"
>                                    , "-99 ", ""
>                                    )
>                    )
>
> library(data.table)
>
> fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) )
> fdta_ok <- as.data.frame( fdt_ok )
> #####
>
> Leading and trailing spaces cause problems. The data.table::fread function
> has a strip.white argument that defaults to TRUE, but the resulting object
> is a data.table which has different semantics than a data.frame.
>
> On Thu, 14 Nov 2019, Sebastien Bihorel wrote:
>
> > The data file is a csv file. Some text variables contain spaces.
> >
> > "Check for extraneous spaces"
> > Are there specific locations that would be more critical than others?
> >
> >
> >
> ____________________________________________________________________________
> > From: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
> > Sent: Thursday, November 14, 2019 10:52
> > To: Sebastien Bihorel <Sebastien.Bihorel using cognigencorp.com>; Sebastien
> > Bihorel via R-help <r-help using r-project.org>; r-help using r-project.org
> > <r-help using r-project.org>
> > Subject: Re: [R] Can file size affect how na.strings operates in a
> > read.table call?
> > Check for extraneous spaces. You may need more variations of the
> na.strings.
> >
> > On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help
> > <r-help using r-project.org> wrote:
> > >Hi,
> > >
> > >I have this generic function to read ASCII data files. It is
> > >essentially a wrapper around the read.table function. My function is
> > >used in a large variety of situations and has no a priori knowledge
> > >about the data file it is asked to read. Nothing is known about file
> > >size, variable types, variable names, or data table dimensions.
> > >
> > >One argument of my function is na.strings which is passed down to
> > >read.table.
> > >
> > >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
> > >~ 160 columns) using na.strings = c('-99', '.') with the intention of
> > >interpreting '.' and '-99'
> > >strings as the internal missing data NA. Dots were converted to NA
> > >appropriately. However, not all -99 values in the data were interpreted
> > >as NA. In some variables, -99 were converted to NA, while in others -99
> > >was read as a number. More surprisingly, when the data file was cut in
> > >smaller chunks (ie, by dropping either rows or columns) saved in
> > >multiple files, the function calls applied on the new data files
> > >resulted in the correct conversion of the -99 values into NAs.
> > >
> > >In all cases, the data frames produced by read.table contained the
> > >expected number of records.
> > >
> > >While, on face value, it appears that file size affects how the
> > >na.strings argument operates, I wondering if there is something else at
> > >play here.
> > >
> > >Unfortunately, I cannot share the data file for confidentiality reason
> > >but was wondering if you could suggest some checks I could perform to
> > >get to the bottom on this issue.
> > >
> > >Thank you in advance for your help and sorry for the lack of
> > >reproducible example.
> > >
> > >
> > >______________________________________________
> > >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > >https://stat.ethz.ch/mailman/listinfo/r-help
> > >PLEASE do read the posting guide
> > >http://www.R-project.org/posting-guide.html
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> > --
> > Sent from my phone. Please excuse my brevity.
> >
> >
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil using dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> Go...
>                                        Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list