[R] Selecting Variables

jim holtman jholtman at gmail.com
Tue Aug 5 20:07:00 CEST 2008


I think that you have to be a little more explicit with a description
of your data.  I am not clear as to what this means:

> There are lots of variables between each exposure and the values are nominal
> with upto 6 values..

Can you provide a more complete description.  How many columns of
exposure are there in your data?  How many unique IDs?  Depending on
these answers, you can probably read in a portion of your 5GB data
base and summarize the information and the aggregate it at then end
since I would expect that the length of the aggregated data is just
the number of unique IDs.



On Tue, Aug 5, 2008 at 11:54 AM, Michael Pearmain <mpearmain at google.com> wrote:
> Thanks for the help guys,
>
> i think i needed to be a bit more explicit however (sorry)
>
> There are lots of variables between each exposure and the values are nominal
> with upto 6 values..
> And to add to the problem the datasets i deal with range from anything upto
> 5G.
>
> My guess is that the melt function would be inefficient in this situation.
>
> I was looking at the agrep function to count the number Exposures in the
> names() , i wasn't sure of how to count if there was a value in each one but
> the y[complete.cases(y),] looks like a nice function.
>
> Is this a good path to follow?
>
>
>
>
> On Tue, Aug 5, 2008 at 3:09 PM, jim holtman <jholtman at gmail.com> wrote:
>>
>> I am not sure where the "Max" comes from, but this might be a start for
>> you:
>>
>> > x <- read.table(textConnection("ID Exposure_1 Exposure_2 Exposure_3
>> + 1        y                 y                y
>> + 2        y                 y                -
>> + 3        y                 -                 -"), header=TRUE,
>> na.strings='-')
>> > closeAllConnections()
>> > require(reshape)
>> > y <- melt(x, id.var='ID')
>> > # get rid of NAs
>> > y <- y[complete.cases(y),]
>> > y
>>  ID   variable value
>> 1  1 Exposure_1     y
>> 2  2 Exposure_1     y
>> 3  3 Exposure_1     y
>> 4  1 Exposure_2     y
>> 5  2 Exposure_2     y
>> 7  1 Exposure_3     y
>> > cbind(Unique=tapply(y$ID, y$ID, length))
>>  Unique
>> 1      3
>> 2      2
>> 3      1
>> >
>>
>>
>> On Tue, Aug 5, 2008 at 9:21 AM, Michael Pearmain <mpearmain at google.com>
>> wrote:
>> > Hi All,
>> >
>> > i have a dataset that i want to dynamically inspect for the number of
>> > variables that start with "Exposure_"  and then for these count the
>> > entries
>> > across each case i.e
>> >
>> > ID Exposure_1 Exposure_2 Exposure_3
>> > 1        y                 y                y
>> > 2        y                 y                -
>> > 3        y                 -                 -
>> >
>> > So the corresponding new variables that would be created are
>> >
>> > ID Max_Exposure Unique_Exposure
>> > 1           3                       3
>> > 2           3                       2
>> > 3           3                       1
>> >
>> > I know this may seem fairly basic but it will give me the starting point
>> > to
>> > develop more advanced things with loop and nat lang
>> >
>> > Thanks in advance
>> >
>> > Mike
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>
>
>
> --
> Michael Pearmain
> Senior Statistical Analyst
>
>
> 1st Floor, 180 Great Portland St. London W1W 5QZ
> t +44 (0) 2032191684
> mpearmain at google.com
> mpearmain at doubleclick.com
>
>
> Doubleclick is a part of the Google group of companies
>
> "If you received this communication by mistake, please don't forward it to
> anyone else (it may contain confidential or privileged information), please
> erase all copies of it, including all attachments, and please let the sender
> know it went to the wrong person. Thanks."
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list