[R] Filtering an Entire Dataset based on Several Conditions

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Mon May 9 23:57:58 CEST 2022


Paul,

I read through the public replies you received and clearly some of us were not too clear on what you asked. Your subject line was not helpful as my first thought was that you wanted a single column examined for two conditions, as in EITHER less than 3 standard deviations above the mean OR more than three standard deviations below the mean. As someone else noted, using abs(whatever) makes it easy to do with a single condition. 

You could have simply said you want to remove outliers more than 3 standard deviations from the mean! I note in standard normally distributed data, the ones you want to exclude may be only 0.3% but in 39 such columns, quite a few rows will be an issue.

The voluminous data you shared makes it clear you have 39 columns of this. So it seems what you meant was you wanted to apply the same logic to all 39 columns and maybe that is what you meant by multiple conditions.

But what result do you want? Do you want to remove rows that are all-out of bounds or even a single one outside? Do you want to remove the row or just mark the outlier, perhaps with an NA?

Some have suggested you convert to a matrix where many tools are available, and back to a data.frame if needed. I note your solution becomes fairly trivial if you convert any values above 3.0 to an NA and then use complete.cases( to remove any NA rows. This assumes, of course, you have no NA to start with.

There are all kinds of ways to do things and if you were using the dplyr package from the tidyverse, which we are discouraged from talking about here, I can see possibilities including the rowwise() function.

One idea to consider is that you can use the max() and min() functions applied to rows or columns. You can convert your data (perhaps as a matrix) using t() and remove any rows where the max of the absolute value of "all row elements" exceeds 3. 

One thought is to re-scale your data again using a function that is a bit like a truncated normal distribution but instead of tossing outliers, it returns an NA. As noted, then complete.cases() handles your need. But, as noted, 

But in your case, all your columns are numeric and the same so fairly trivial code like: 

mydf[abs(mydf ) > 3] <- NA

mydf <- complete.cases(mydf)

just might do it for you.

Good luck!

-----Original Message-----
From: Paul Bernal <paulbernal07 using gmail.com>
To: Rui Barradas <ruipbarradas using sapo.pt>
Cc: R <r-help using r-project.org>
Sent: Mon, May 9, 2022 12:44 pm
Subject: Re: [R] Filtering an Entire Dataset based on Several Conditions

Dear Rui,

I was trying to dput() the datasets I am working on, but since it is a bit
large (42,000 rows by 60 columns) couldn´t retrieve all the structure of
the data to include it here, so I am attaching a couple of files. One is
the raw data (called trainFeatures42k), which is the data I need to
normalize, and the other is normalized_Data, which is the data normalized
(or at least I think I got to normalize it).

 Normalized_Data.csv
<https://drive.google.com/file/d/143I1O710gAqWjzx48Gt1bwUbrG0mbpfa/view?usp=drive_web>
 trainFeatures42k.xls
<https://drive.google.com/file/d/1deMzGMkJyeVsnRzTKirmm4VqIBRzbvzV/view?usp=drive_web>

I have tried some of the code you and other friends from the community have
kindly shared, but have not been able to filter values > -3 and < 3.

Thank you all for your valuable help always.
Best,
Paul

El lun, 9 may 2022 a las 4:22, Rui Barradas (<ruipbarradas using sapo.pt>)
escribió:

> Hello,
>
> Something like this?
> First normalize the data.
> Then a apply loop creates a logical matrix giving which numbers are in
> the range -3 to 3.
> If they are all TRUE then their sum by rows is equal to the number of
> columns. This creates a logical index i.
> Use that index i to subset the scaled data set.
>
> # test data set, remove the Species column (not numeric)
> df1 <- iris[-5]
>
> df1_norm <- scale(df1)
> i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)
>
> # returns a matrix
> df1_norm[i, ]
>
> # returns a data.frame
> as.data.frame(df1_norm[i,])
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 09:23 de 09/05/2022, Paul Bernal escreveu:
> > Dear friends,
> >
> > I have a dataframe which every single (i,j) entry (i standing for ith
> row,
> > j for jth column) has been normalized (converted to z-scores).
> >
> > Now I want to filter or subset the dataframe so that I only end up with
> a a
> > dataframe containing only entries greater than -3 or less than 3.
> >
> > How could I accomplish this?
> >
> > Best,
> > Paul
> >
> >      [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

    [[alternative HTML version deleted]]

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]



More information about the R-help mailing list