[R] Sanity check in loading large dataframe

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Mon Aug 9 16:19:40 CEST 2021


FWIW:

Yes, thanks for noting that.
My own preference is to always propagate NA's and manually decide how
to deal with them, but others may disagree.

Best,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sun, Aug 8, 2021 at 11:30 PM PIKAL Petr <petr.pikal using precheza.cz> wrote:
>
> Hi Bert
>
> Yes, in this case which is not necessary. But in case NAs are involved
> sometimes logical indexing is not a best choice as NA propagates to the
> result, which may be not wanted.
>
> x <- 1:10
> x[c(2,5)] <- NA
> y<- letters[1:10]
> y[x<5]
> [1] "a" NA  "c" "d" NA
> y[which(x<5)]
> [1] "a" "c" "d"
> dat <- data.frame(x,y)
> dat[x<5,]
>       x    y
> 1     1    a
> NA   NA <NA>
> 3     3    c
> 4     4    d
> NA.1 NA <NA>
>
> > dat[which(x<5),]
>   x y
> 1 1 a
> 3 3 c
> 4 4 d
>
> Both results are OK, but one has to consider this NA value propagation.
>
> Cheers
> Petr
>
> From: Bert Gunter <bgunter.4567 using gmail.com>
> Sent: Friday, August 6, 2021 1:29 PM
> To: PIKAL Petr <petr.pikal using precheza.cz>
> Cc: Luigi Marongiu <marongiu.luigi using gmail.com>; r-help <r-help using r-project.org>
> Subject: Re: [R] Sanity check in loading large dataframe
>
> ... but remove the which() and use logical indexing ...  ;-)
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Aug 6, 2021 at 12:57 AM PIKAL Petr <mailto:petr.pikal using precheza.cz>
> wrote:
> Hi
>
> You already got answer from Avi. I often use dim(data) to inspect how many
> rows/columns I have.
> After that I check if some columns contain all or many NA values.
>
> colSums(http://is.na(data))
> keep <- which(colSums(http://is.na(data))<nnn)
> cleaned.data <- data[, keep]
>
> Cheers
> Petr
>
>
> > -----Original Message-----
> > From: R-help <mailto:r-help-bounces using r-project.org> On Behalf Of Luigi
> > Marongiu
> > Sent: Friday, August 6, 2021 7:34 AM
> > To: Duncan Murdoch <mailto:murdoch.duncan using gmail.com>
> > Cc: r-help <mailto:r-help using r-project.org>
> > Subject: Re: [R] Sanity check in loading large dataframe
> >
> > Ok, so nothing to worry about. Yet, are there other checks I can
> implement?
> > Thank you
> >
> > On Thu, 5 Aug 2021, 15:40 Duncan Murdoch, <mailto:murdoch.duncan using gmail.com>
> > wrote:
> >
> > > On 05/08/2021 9:16 a.m., Luigi Marongiu wrote:
> > >  > Hello,
> > >  > I am using a large spreadsheet (over 600 variables).
> > >  > I tried `str` to check the dimensions of the spreadsheet and I got
> > > > ```  >> (str(df))  > 'data.frame': 302 obs. of  626 variables:
> > >  >   $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
> > >  > ....
> > >  > $ v1_medicamento___aceta    : int  1 NA NA NA NA NA NA NA NA NA ...
> > >  >    [list output truncated]
> > >  > NULL
> > >  > ```
> > >  > I understand that `[list output truncated]` means that there are
> > > more  > variables than those allowed by str to be displayed as rows.
> > > Thus I  > increased the row's output with:
> > >  > ```
> > >  >
> > >  >> (str(df, list.len=1000))
> > >  > 'data.frame': 302 obs. of  626 variables:
> > >  >   $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
> > >  > ...
> > >  > NULL
> > >  > ```
> > >  >
> > >  > Does `NULL` mean that some of the variables are not closed?
> > > (perhaps a  > missing comma somewhere)  > Is there a way to check the
> > > sanity of the data and avoid that some  > separator is not in the
> > > right place?
> > >  > Thank you
> > >
> > > The NULL is the value returned by str().  Normally it is not printed,
> > > but when you wrap str in parens as (str(df, list.len=1000)), that
> > > forces the value to print.
> > >
> > > str() is unusual in R functions in that it prints to the console as it
> > > runs and returns nothing.  Many other functions construct a value
> > > which is only displayed if you print it, but something like
> > >
> > > x <- str(df, list.len=1000)
> > >
> > > will print the same as if there was no assignment, and then assign
> > > NULL to x.
> > >
> > > Duncan Murdoch
> > >
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > mailto:R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> mailto:R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list