[R] Subsetting problem data, 2

arun smartpink111 at yahoo.com
Fri Jul 20 13:50:46 CEST 2012

Hi,

Just a doubt regarding the dataset.

Suppose, I include two more patients F and G with different missing values as in this new dataset and run the code.
Patient  Cycle  V1  V2  V3  V4  V5
A  1  0.4  0.1  0.5  1.5  NA
A  2  0.3  0.2  0.5  1.6  NA
A  3  0.3  NA  0.6  1.7  NA
A  4  0.4  NA  0.4  1.8  NA
A  5  0.5  0.2  0.5  1.5  NA
B  1  0.4  NA  NA  NA  NA
B  2  0.4  NA  NA  NA  NA
C  1  0.9  0.9  0.9  NA  NA
C  3  0.3  0.5  0.6  NA  NA
C  4  NA  NA  NA  NA  NA
C  5  0.4  NA  NA  NA  NA
D  1  0.2  0.5  NA  NA  NA
D  2  0.5  0.7  NA  NA  NA
D  4  0.6  0.4  NA  NA  NA
D  5  0.5  0.5  NA  NA  NA
E  1  0.1  NA  NA  NA  NA
E  2  0.5  0.3  NA  NA  NA
E  3  0.4  0.3  NA  NA  NA
F  1  0.2  NA   0.2 0.5 0.1
F  2  0.5  NA   0.4 NA   0.3
F  3  0.6  NA   NA  0.3  0.2
G  1  0.2   0.5  NA  0.5  0.2
G  3  0.4   0.3  0.4 NA  0.3
G  4  0.6   0.2  0.2  0.4 NA

nms <- names(dat1)[grep("^V[1-9]$", names(dat1))] dd <- split(dat1, dat1$Patient)
fun <- function(x) any(is.na(x)) && any(!is.na(x))
ix <- sapply(dd, function(x) Reduce(|, lapply(x[, nms], fun)))

dd[ix]
do.call(rbind, dd[ix])
Patient Cycle  V1  V2  V3  V4  V5
A.1        A     1 0.4 0.1 0.5 1.5  NA
A.2        A     2 0.3 0.2 0.5 1.6  NA
A.3        A     3 0.3  NA 0.6 1.7  NA
A.4        A     4 0.4  NA 0.4 1.8  NA
A.5        A     5 0.5 0.2 0.5 1.5  NA
C.8        C     1 0.9 0.9 0.9  NA  NA
C.9        C     3 0.3 0.5 0.6  NA  NA
C.10       C     4  NA  NA  NA  NA  NA
C.11       C     5 0.4  NA  NA  NA  NA
E.16       E     1 0.1  NA  NA  NA  NA
E.17       E     2 0.5 0.3  NA  NA  NA
E.18       E     3 0.4 0.3  NA  NA  NA
F.19       F     1 0.2  NA 0.2 0.5 0.1
F.20       F     2 0.5  NA 0.4  NA 0.3
F.21       F     3 0.6  NA  NA 0.3 0.2
G.22       G     1 0.2 0.5  NA 0.5 0.2
G.23       G     3 0.4 0.3 0.4  NA 0.3
G.24       G     4 0.6 0.2 0.2 0.4  NA

Then, patients F and G are included in the list.  But, according to your initial statement, V1 and V2 are the most important variables.  If B is not included in the list because B has missing values for both cycles of B, then do you know think F or G should be included in the list.  Only difference is that F and G have missing values in other variables which do not behave consistently.  Do you have situations like that?

A.K.

From: Lib Gray <libgray3827 at gmail.com>
I'm still getting the message (if this is what you were suggesting I try).
The data set I'm using has many more columns other than these variables;
could that be a problem? I didn't think it would affect it.

>pattern <- "L[1-8][12]"
> nms<-names(data)[grep(vars,names(data))]
