[R] subsetting a data.frame based on a specific group of columns

Boris Steipe boris.steipe at utoronto.ca
Fri Nov 6 16:45:18 CET 2015


Please learn to use dput() to post example data.

# This is your data:
data <- structure(c(1232, 0, 43, 357, 71, 919, 23, 9, 1111, 0, 811, 0, 
9871, 795, 76, 72, 743, 14), .Dim = c(3L, 6L), .Dimnames = list(
    NULL, c("X1", "X2", "X3", "Y1", "Y2", "Y3")))

data

# define groups and threshold explicitly
groupA <- c(1, 2, 3)
groupB <- c(4, 5, 6)
thrsh  <- 100


# Here's how you evaluate your condition on the member elements of your group
rowSums(data[ , groupA]) > thrsh

# note that you can cast a logical TRUE/FALSE into an integer 0/1
as.numeric(rowSums(data[ , groupA]) >= thrsh)

# ... which you can multiply with your data (*)
data[ , groupA] * as.numeric(rowSums(data[ , groupA]) > thrsh)

#  now you could write this into your matrix
data[ , groupA] <- data[ , groupA] * as.numeric(rowSums(data[ , groupA]) > thrsh)
# data[ , groupB] etc ... 

data

# ... but you would be repeating code, therefore better to write this
# as a function:

clearReadsBelowThreshold <- function(m, g, t) {
	m[ , g] <- m[ , g] * as.numeric(rowSums(m[ , g]) >= t)
     return(m)
}

data <- clearReadsBelowThreshold(data, groupA, thrsh)
data <- clearReadsBelowThreshold(data, groupB, thrsh)

data




(*) Note that R would do this conversion implicitly but omitting
    the conversion will cause confusion for those who read the code
    later. 



Cheers,
Boris





On Nov 6, 2015, at 8:53 AM, Assa Yeroslaviz <frymor at gmail.com> wrote:

> sorry, for the misunderstanding. here is a more elaborate description of
> what i would like to achieve.
> 
> I have a data set of counts from a RNA-Seq experiment and would like to
> filter reads with low counts. I don't want to set everything to 0
> automatically.
> 
> I would like to set each categorical group (e.g. condition) to 0, if and
> only if all replica in the group together have less than 100 reads.
> in my examples I used X and Y to represents the categories. Ususally they
> have a more distinct names like "control", "knockout1", "dKo" etc.
> 
> So what I really like to do is to check if the sum of all the "control"
> samples is lower than 100. If so, set all control sample to 0. This I would
> like to check *for each category* of every row of the data set.
> 
> I hope it is more clear now
> 
> thanks
> Assa
> 
> 
> On Fri, Nov 6, 2015 at 2:29 PM, jim holtman <jholtman at gmail.com> wrote:
> 
>> Is this what you want:
>> 
>>> x <- read.table(text = "X1    X2    X3    Y1    Y2    Y3
>> + 1232    357    23    0    9871    72
>> + 0    71    9    811    795    743
>> + 43    919    1111    0    76    14", header = TRUE)
>>> x
>>    X1  X2   X3  Y1   Y2  Y3
>> 1 1232 357   23   0 9871  72
>> 2    0  71    9 811  795 743
>> 3   43 919 1111   0   76  14
>>> 
>>> # create indices of columns that start with the same character
>>> indx <- split(seq(ncol(x)), substring(colnames(x), 1, 1))
>>> names(indx) <- NULL  # remove names so output not messed up
>>> 
>>> result <- lapply(indx, function(a){
>> +     row_sum <- rowSums(x[, a])
>> +     x[row_sum < 100, a] <- 0
>> +     x[, a]
>> + })
>>> # combine back together
>>> do.call(cbind, result)
>>    X1  X2   X3  Y1   Y2  Y3
>> 1 1232 357   23   0 9871  72
>> 2    0   0    0 811  795 743
>> 3   43 919 1111   0    0   0
>> 
>> 
>> Jim Holtman
>> Data Munger Guru
>> 
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>> 
>> On Fri, Nov 6, 2015 at 5:40 AM, Assa Yeroslaviz <frymor at gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I have a data frame with multiple columns, which are belong to several
>>> groups
>>> like that:
>>> X1    X2    X3    Y1    Y2    Y3
>>> 1232    357    23    0    9871    72
>>> 0    71    9    811    795    743
>>> 43    919    1111    0    76    14
>>> 
>>> I would like to filter such rows out, where the sums in one group is lower
>>> than a specifc value. For example, I would like to set all the values in a
>>> group of cloums to zero, if the sum in one group is less than 100
>>> In my example table I would like to set the values in the second row for
>>> the three X-columns to 0, so that the table looks like that:
>>> 
>>> X1    X2    X3    Y1    Y2    Y3
>>> 1232    357    23    0    9871    72
>>> 0    0    0    811    795    743
>>> 43    919    1111    0    0    0
>>> 
>>> the same apply also for the Y-values in the last column.
>>> Is there a more efficient way of doing it than going row by row and use
>>> the
>>> apply function on each of the subgroups I have in the columns?
>>> 
>>> thanks
>>> Assa
>>> 
>>>        [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list