[R] Help with Hmisc, cut2, split and quantile

Tue Mar 9 07:41:53 CET 2010

On 2010-03-08 18:00, Guy Green wrote:
>
> Hi Peter&  others,
>
> Thanks (Peter) - that gets me really close to what I was hoping for.
>
> The one problem I have is that the "cut" approach breaks the data into
> intervals based on the absolute value of the "Target" data, rather than
> their frequency.  In other words, if the data ranged from 0 to 50, the data
> would be separated into 0-5, 5-10 and so on, regardless of the frequency
> within those categories.  However I want to get the data into deciles.
>
> The code that does this (incorporating Peter's) is:
>
> read_data=read.table("C:/Sample table.txt", head = T)
> read_data$DEC<- with(read_data, cut(Target, breaks=10, labels=1:10))
> L<- split(read_data, read_data$DEC)
>
> This means that I can get separate data frames, such as L$'10', which comes
> out tidy, but only containing 2 data items (the sample has 63 rows, so each
> decile should have 6+ data items):
>       Actual    Target       DEC
> 9   0.572     0.3778386   10
> 31  0.299    0.3546606   10
>
> If I try to adjust this to get deciles using cut2(), I can break the data
> into deciles as follows:
>
> read_data=read.table("C:/Sample table.txt", head = T)
> read_data$DEC<- with(read_data, cut2(read_data$Target, g=10), labels=1:10)
> L<- split(read_data, read_data$DEC)
>
> However this time, while the data is broken into even data frames, the
> labels for the separate data frames are unuseable, e.g.:
> $`[ 0.26477, 0.37784]`
>      Actual    Target                 DEC
> 6   0.243   0.2650960    [ 0.26477, 0.37784]
> 9   0.572   0.3778386    [ 0.26477, 0.37784]
> 10 -0.049  0.3212681    [ 0.26477, 0.37784]
> 15  0.780  0.2778518    [ 0.26477, 0.37784]
> 31  0.299  0.3546606    [ 0.26477, 0.37784]
> 33  0.105  0.2647676    [ 0.26477, 0.37784]
>
> Could anyone suggest a way of rearranging this to make the labels useable
> again?  Sample data is reattached
> http://n4.nabble.com/file/n1585427/Sample_table.txt Sample_table.txt .

I think that the easiest way would be to relabel the levels of DEC:

  read_data$DEC <- factor(read_data$DEC, labels = 1:10)

or, since I would prefer letters as factor levels:

  read_data$DEC <- factor(read_data$DEC, labels = LETTERS[1:10])

Another way would be to use cut2() with onlycuts=TRUE to get the
breaks and then use these with cut() as in my original post:

  brks <- cut2(read_data$Target, g=10, onlycuts=TRUE)
  read_data$DEC<- with(read_data,
                       cut(Target, breaks=brks, labels=1:10))

But I still don't see why you want a list of separate data
frames. For most analyses, it's more convenient to just use the
factor variable to subset the data as needed.

  -Peter Ehlers

>
> Thanks,
> Guy
>
>
>
> Peter Ehlers wrote:
>>
>> On 2010-03-08 8:47, Guy Green wrote:
>>>
>>> Hello,
>>> I have a set of data with two columns: "Target" and "Actual".  A
>>> http://n4.nabble.com/file/n1584647/Sample_table.txt Sample_table.txt  is
>>> attached but the data looks like this:
>>>
>>> Actual	Target
>>> -0.125	0.016124906
>>> 0.135		0.120799865
>>> ...		...
>>> ...		...
>>>
>>> I want to be able to break the data into tables based on quantiles in the
>>> "Target" column.  I can see (using cut2, and also quantile) how to get
>>> the
>>> barrier points between the different quantiles, and I can see how I would
>>> achieve this if I was just looking to split up a vector.  However I am
>>> trying to break up the whole table based on those quantiles, not just the
>>> vector.
>>>
>>> However I would like to be able to break the table into ten separate
>>> tables,
>>> each with both "Actual" and "Target" data, based on the "Target" data
>>> deciles:
>>>
>>> top_decile = ...(top decile of "read_data", based on Target data)
>>> next_decile = ...and so on...
>>> bottom_decile = ...
>>
>> I would just add a factor variable indicating to which decile
>> a particular observation belongs:
>>
>>    dat$DEC<- with(dat, cut(Target, breaks=10, labels=1:10))
>>
>> If you really want to have separate data frames you can then
>> split on the decile:
>>
>>    L<- split(dat, dat$DEC)
>>
>>      -Peter Ehlers
>> --
>> Peter Ehlers
>> University of Calgary
>>
>>
>

-- 
Peter Ehlers
University of Calgary