[R] Need some suggestions for outlier detection in a matrix

arun smartpink111 at yahoo.com
Wed Jan 15 19:07:32 CET 2014


Hi Vivek,

chisq.out.test(as.numeric(mat1[1,]))$alternative
#[1] "highest value 3516 is an outlier"
as.numeric(gsub("[[:alpha:]]","",chisq.out.test(as.numeric(mat1[1,]))$alternative))
#[1] 3516

#removes the alphabetic characters so that only number remain.

Also, remember that it is just the alternative hypothesis.  If you wanted to subset the outliers, you have to compare the pvalue with the cut-off alpha.  If I take a cut-off limit as 0.15 (as none of the values are <0.05)


mat2 <-cbind(mat1,t(apply(mat1,1,function(x) {test <- chisq.out.test(as.numeric(x)); possible_outLier <- as.numeric(gsub("[[:alpha:]]","",test$alternative)); Pval=test$p.value; outLier <- if(Pval < 0.15 & !is.na(Pval)) possible_outLier else NA; c(Possible_outLier=possible_outLier,Pval=Pval,outLier=outLier)})))
 sum(!is.na(mat2[,"outLier"]))
#[1] 5208

head(mat2,6)
            Sample_118z.0 Sample_132z.0 Sample_141z.0 Sample_183z.0
XLOC_000001           626          3516          1277           770
XLOC_000002            82           342           185            72
XLOC_000003           361          2000           867           438
XLOC_000004            30           143            67            37
XLOC_000010             1             7             5             3
XLOC_000011            10            63            19            15
            Possible_outLier      Pval outLier
XLOC_000001             3516 0.1423296    3516
XLOC_000002              342 0.1707215      NA
XLOC_000003             2000 0.1517236      NA
XLOC_000004              143 0.1538803      NA
XLOC_000010                7 0.2452781      NA
XLOC_000011               63 0.1381038      63
A.K.

On Wednesday, January 15, 2014 12:15 PM, Vivek Das <vd4mmind at gmail.com> wrote:

Thanks a lot Arun, 

I understood the function but am not being able to understand what does the pattern recognition is happening with gsub("[[:
alpha:]]","",test$alternative)


what is the alpha doing here. Can you please let me know why you did this pattern matching with gsub taking :alpha: as the pattern?



----------------------------------------------------------

Vivek Das


On Wed, Jan 15, 2014 at 5:33 PM, arun <smartpink111 at yahoo.com> wrote:

Hi,
>Try:
>dat1 <- read.table("ZvsPGRT_frag_0filt.txt",sep="\t",header=TRUE,row.names=1)
>dat_Z <- dat1[,1:4] ## unnecessary to do cbind() here
>mat1 <- as.matrix(dat_Z)
> head(mat1,2)
>#            Sample_118z.0 Sample_132z.0 Sample_141z.0 Sample_183z.0
>#XLOC_000001           626          3516          1277           770
>#XLOC_000002            82           342           185            72
>library(outliers)
> ctest_mat1 <- t(apply(mat1,1,function(x) {test <- chisq.out.test(as.numeric(x)); c(outLier=as.numeric(gsub("[[:alpha:]]","",test$alternative)), Pval=test$p.value)}))
> mat2 <- cbind(mat1,ctest_mat1)
>head(mat2,2)
>#            Sample_118z.0 Sample_132z.0 Sample_141z.0 Sample_183z.0 outLier
>#XLOC_000001           626          3516          1277           770    3516
>#XLOC_000002            82           342           185            72     342
>#                 Pval
>#XLOC_000001 0.1423296
>#XLOC_000002 0.1707215
>
>
>A.K.
>
>
>
>
>
>On Wednesday, January 15, 2014 7:12 AM, Vivek Das <vd4mmind at gmail.com> wrote:
>
>HI Arun,
>
>I was wondering how to use the package outliers. There is a package which can help me identifying outliers for each row. So I have a matrix with rownames for first column and next 4 colmns have values. for each row I want to find the outlier and also the test statistic of it. So there is a package ‘outliers’. Which has this test chisq.out.test that  performs a chisquared test for detection of one outlier in a vector. So now I want to apply this for my matrix. and want to find out for each row which is the outlier and then what is the p.value associated to it. I was using the below code
>
>
>data<-read.table("my_file.txt",,sep='\t', header=T)
>## Selecting only the centers
>data_Z<-cbind(data[,1:5])
>mat1<- as.matrix(data_Z[,2:5])
>row.names(mat1)<- data_Z[,1]
>head(mat1)
>
>            Sample_118z.0 Sample_132z.0 Sample_141z.0 Sample_183z.0
>XLOC_000001           626          3516          1277           770
>XLOC_000002            82           342           185            72
>XLOC_000003           361          2000           867           438
>XLOC_000004            30           143            67            37
>XLOC_000010             1             7             5             3
>XLOC_000011            10            63            19            15
>
>ctest_mat1<-c()
>
>for (i in 1:length(mat1[,1]))
>{
>ctest_mat1<-c(ctest_mat1,chisq.out.test(as.numeric(mat1[i,])))
>
>}
>
>But this does not give me the outlier for each row. I mean it should be ideally but when am trying to combine it with the matrix mat1 with below command I get the error
>
>res <-cbind(mat1,ctest_mat1)
>Warning message:
>In .Method(..., deparse.level = deparse.level) :
>  number of rows of result is not a multiple of vector length (arg 2)
>
>I want my matrix  with the mat1 and also the columns for each row saying which is the outlier and the p- value associated with it.  I mean when I
>
>head(ctest_mat1)
>$statistic
>X-squared
> 2.152591
>
>$alternative
>[1] "highest value 3516 is an outlier"
>
>$p.value
>[1] 0.1423296
>
>$method
>[1] "chi-squared test for outlier"
>
>$data.name
>[1] "as.numeric(mat1[i, ])"
>
>$statistic
>X-squared
> 1.876596
>
>I get only the following for the first row. I want it was a matrix for all the rows and combine it with my mat1 so that I can then evaluate. Can you help me with that? I am also attaching the matrix. I hope you understood my point.
>
>
>
>----------------------------------------------------------
>
>Vivek Das
>




More information about the R-help mailing list