[R] R how to find outliers and zero mean columns?
Jim Lemon
drjimlemon at gmail.com
Thu Mar 31 04:43:14 CEST 2016
How about:
# if a data frame
names(X)[which_cols]
# and if you have rownames:
rownames(X)[which_rows]
My note about hackles was that packages generally don't know what
values are "abnormal" unless you specify them. Just like us. So you
have to specify what the range of "normal" values are, or what
specific values are "abnormal". There is a package named "outliers",
and while it would identify the 99999 value in the example I used, it
wouldn't do so for the -1.
Jim
On Thu, Mar 31, 2016 at 1:30 PM, Norman Pat <normanmath1 at gmail.com> wrote:
> Hi Jim,
> Thanks for your reply. I know these basic stuffs in R.
>
> But I want to know let say you have a data frame X with 300 features.
> From that 300 features I need to pullout the names of each feature
> that has zero values for all the observations in that sample.
>
> Here I am looking for a package or a function to do that.
>
> And how do I know whether there are abnormal values for each feature. Let
> say
> I have 300 features and 100000 observations. It is hard to look everything
> in the excel file. Instead of that I am looking for a package that does the
> work.
>
> I hope you understood.
>
> Thanks a lot
>
> Cheers
>
>
> On Thu, Mar 31, 2016 at 1:13 PM, Jim Lemon <drjimlemon at gmail.com> wrote:
>>
>> Hi Norman,
>> To check whether all values of an object (say "x") fulfill a certain
>> condition (==0):
>>
>> all(x==0)
>>
>> If your object (X) is indeed a data frame, you can only do this by
>> column, so if you want to get the results:
>>
>> X<-data.frame(A=c(0,1:10),B=c(0,2:10,99999),
>> C=c(0,-1,3:11),D=rep(0,11))
>> all_zeros<-function(x) return(all(x==0))
>> which_cols<-unlist(lapply(X,all_zeros))
>>
>> If your data frame (or a subset) contains all numeric values, you can
>> finesse the problem like this:
>>
>> which_rows<-apply(as.matrix(X),1,all_zeros)
>>
>> What you get is a list of logical (TRUE/FALSE) values from lapply, so
>> it has to be unlisted to get a vector of logical values like you get
>> with "apply".
>>
>> You can then use that vector to index (subset) the original data frame
>> by logically inverting it with ! (NOT):
>>
>> X[,!which_cols]
>> X[!which_rows,]
>>
>> Your "outliers" look suspiciously like missing values from certain
>> statistical packages. If you know the values you are looking for, you
>> can do something like:
>>
>> NA99999<-X==99999
>>
>> and then "remove" them by replacing those values with NA:
>>
>> X[NA99999]<-NA
>>
>> Be aware that all these hackles (diminutive of hacks) are pretty
>> specific to this example. Also remember that if this is homework, your
>> karma has just gone down the cosmic sinkhole.
>>
>> Jim
>>
>>
>> On Thu, Mar 31, 2016 at 9:56 AM, Norman Pat <normanmath1 at gmail.com> wrote:
>> > Hi team
>> >
>> > I am new to R so please help me to do this task.
>> >
>> > Please find the attached data sample. But in the original data frame I
>> > have 350 features and 400000 observations.
>> >
>> > I need to carryout these tasks.
>> >
>> > 1. How to Identify features (names) that have all zeros?
>> >
>> > 2. How to remove features that have all zeros from the dataset?
>> >
>> > 3. How to identify features (names) that have outliers such as 99999,-1
>> > in
>> > the data frame.
>> >
>> > 4. How to remove outliers?
>> >
>> >
>> > Many thanks
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list