[R] Data manipulation problem

Wed Apr 7 14:54:05 CEST 2010

So.. here we try again.

##generate dataset
age.cat<-seq(0,100,10)
year<-(1953:(1953+55))
data.vec<-sample(10000:10000,(age.cat*year))
data.matrix<-matrix(data.vec,c(length(age.cat),length(year))
rownames(data.matrix)<-age.cat
colnames(data.matrix)<-year

##divide into 5 year periods
age.div<-cut(year,seq(1950,2010,6),include.lowest=T) ##interval is
beyond my datainterval so I doubt the include.lowest matters

Now what I'd like to do is summarise the rows within the 5-year intervals.

I did read about apply in its different variants and Dahlgaard, but I
do not know understand how it could be applied in this setting.

I tried making an array and summarise by that (used the vector and
applied it into a
length(age.cat)*max(vector(table(age.div)*length(age.div) array. It
worked but required a bit of tweaking (inserting null columns) and I
find myself in this situation quite often whereby I need to add
multiple columns based on another vector so I'd be very interested in
another more general approach.

//M

On Tue, Apr 6, 2010 at 9:41 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Apr 6, 2010, at 3:30 PM, David Winsemius wrote:
>
>>
>> On Apr 6, 2010, at 9:56 AM, moleps islon wrote:
>>
>>> OK... next question.. Which is still a data manipulation problem so I
>>> believe the heading is still OK.
>>>
>>> ##So now I read my population data from excel.
>>
>> No, you read it from a text file and providing the first ten lines of that
>> text file should have been really easy. Read the Posting Guide for advice
>> about offering datasets either as structure() objects with dput or dump or
>> as attached files with "*.txt" extension (not .csv). Just change the file
>> name with your file browser.
>>
>>> pop<-read.csv("pop.csv")
>>>
>>> typeof(pop) ## yields a list
>>
>> Really? I would have guessed it to yield just "list".
>>
>>> where I have age-specific population rows
>>> and a yearly column population, where the years are suffixed by X
>>
>> And had you used class(pop) you would have learned it was a dataframe and
>> even more informative would have been str(pop).
>>>
>>> c<-(1953:2008)
>>
>> No, no, no. Do not use variable names that are important function names.
>> The R interpreter can (usually) keep things straight but it is our brains
>> that experience problems.  Other  function names to avoid: data, df, cut,
>> mean, sd, list, vector, matrix
>>
>>> names(pop)<-c
>>> c.div<-cut(c,break=seq(1950,2010,by=5)
>>
>> (You should have gotten an error here.) After fixing the error, did you
>> you notice that there were only 3 of the first level???
>>
>> Watch out for cut(). It uses the default convention of ( , ] , i.e. open
>> interval at right
>
>                                                                      er,
>        ^left^
>
>> which is backwards to what some (most?) of us think natural. Because of
>> that the lowest level gets dropped unless you take special precautions.
>>  That is undoubtedly why Harrell set up his Hmisc::cut2 to have the default
>> be [ , )
>>
>> Aggregating across columns? Certainly possible, but maybe not as natural a
>> fit to functions like split as would occur with working across rows. I
>> suppose you could use something like this untested (because _still_ no
>> sample dataset provided) code:
>>
>> apply(pop, 1,    # this works a row a time
>>   function(x) tapply(x, list(c.div), sum) ) )  # or use aggregate which
>> uses tapply
>>
>> I'm not sure it will work, since I don't know if the column names would
>> get carried over into "x" by apply(). You might need to create a separate
>> index that used the numeric positions of the columns rather than their
>> names. Perhaps use c.div <-  seq(0,(2008-1953)) %/% 5  or some such inside
>> tapply.
>>
>>>
>>> Now I'd like to sum the agespecific population over the individual
>>> levels of -c.div- and generate a new table for this with agespecific
>>> rows and columns containing the 5-year bins instead of the original
>>> yearly data. Do I have to program this from scratch or is it possible
>>> to use an already existing function?
>>
>> I think you ought to read more introductory material (and the Posting
>> Guide regarding how to offer example datasets). In this case there are many
>> functions that do data aggregation and most of them should be illustrated in
>> a good introductory text.
>>
>> --
>> David.
>>>
>>>
>>> //M
>>>
>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE
>>>
>>> On Mon, Apr 5, 2010 at 10:11 PM, moleps <moleps2 at gmail.com> wrote:
>>>>
>>>> Thx Erik,
>>>> I have no idea what went wrong with the other code snippet, but this one
>>>> works.. Appreciate it.
>>>>
>>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
>>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))
>>>>
>>>> M
>>>>
>>>>
>>>> On 5. apr. 2010, at 21.45, Erik Iverson wrote:
>>>>
>>>>> I don't know what your data are like, since you haven't given a
>>>>> reproducible example. I was imagining something like:
>>>>>
>>>>> ## generate fake data
>>>>> age <- sample(20:90, 100, replace = TRUE)
>>>>> year <- sample(1950:2000, 100, replace = TRUE)
>>>>>
>>>>> ##look at big table
>>>>> table(age, year)
>>>>>
>>>>> ## categorize data
>>>>> ## see include.lowest and right arguments to cut
>>>>> age.factor <- cut(age, breaks = seq(20, 90, by = 10),
>>>>>               include.lowest = TRUE)
>>>>>
>>>>> year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
>>>>>                include.lowest = TRUE)
>>>>>
>>>>> table(age.factor, year.factor)
>>>>>
>>>>> moleps wrote:
>>>>>>
>>>>>> I already did try the regression modeling approach. However the
>>>>>> epidemiologists (referee) turns out to be quite fond of comparing the
>>>>>> incidence rates to different standard populations, hence the need for this
>>>>>> labourius approach. And trying the "cutting" approach I ended up with :
>>>>>>>
>>>>>>> table (age5)
>>>>>>
>>>>>> age5
>>>>>> (0,5]   (5,10]  (10,15]  (15,20]  (20,25]  (25,30]  (30,35]  (35,40]
>>>>>>  (40,45]  (45,50]  (50,55]  (55,60]  (60,65]  (65,70] (70,75]  (75,80]
>>>>>>  (80,85] (85,100]       35       34       33       47       51      109
>>>>>>  157      231      362      511    745      926     1002      866      547
>>>>>>    247       82       18
>>>>>>>
>>>>>>> table (yr5)
>>>>>>
>>>>>> yr5
>>>>>> (1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975]
>>>>>> (1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000] (2000,2005]
>>>>>> (2005,2009]           3           5           5           5           5
>>>>>>       5           5           5         5           5           5
>>>>>> 3
>>>>>>>
>>>>>>> table (yr5,age5)
>>>>>>
>>>>>> Error in table(yr5, age5) : all arguments must have the same length
>>>>>> Sincerely,
>>>>>> M
>>>>>> On 5. apr. 2010, at 20.59, Bert Gunter wrote:
>>>>>>>
>>>>>>> You have tempted, and being weak, I yield to temptation:
>>>>>>>
>>>>>>> "Any good ideas?"
>>>>>>>
>>>>>>> Yes. Don't do this.
>>>>>>>
>>>>>>> (what you probably really want to do is fit a model with age as a
>>>>>>> factor,
>>>>>>> which can be done statistically e.g. by logistic regression; or
>>>>>>> graphically
>>>>>>> using conditioning plots, e.g. via trellis graphics (the lattice
>>>>>>> package).
>>>>>>> This avoids the arbitrariness and discontinuities of binning by age
>>>>>>> range.)
>>>>>>>
>>>>>>> Bert Gunter
>>>>>>> Genentech Nonclinical Biostatistics
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: r-help-bounces at r-project.org
>>>>>>> [mailto:r-help-bounces at r-project.org] On
>>>>>>> Behalf Of moleps
>>>>>>> Sent: Monday, April 05, 2010 11:46 AM
>>>>>>> To: r-help at r-project.org
>>>>>>> Subject: [R] Data manipulation problem
>>>>>>>
>>>>>>> Dear R´ers.
>>>>>>>
>>>>>>> I´ve got a dataset with age and year of diagnosis. In order to
>>>>>>> age-standardize the incidence I need to transform the data into a
>>>>>>> matrix
>>>>>>> with age-groups (divided in 5 or 10 years) along one axis and year
>>>>>>> divided
>>>>>>> into 5 years along the other axis. Each cell should contain the
>>>>>>> number of
>>>>>>> cases for that age group and for that period.
>>>>>>> I.e.
>>>>>>> My data format now is
>>>>>>> ID-age (to one decimal)-year(yearly data).
>>>>>>>
>>>>>>> What I´d like is
>>>>>>>
>>>>>>> age 1960-1965 1966-1970 etc...
>>>>>>> 0-5 3 8 10 15
>>>>>>> 6-10 2 5 8 13
>>>>>>> etc..
>>>>>>>
>>>>>>>
>>>>>>> Any good ideas?
>>>>>>>
>>>>>>> Regards,
>>>>>>> M
>>>
>>
>
> David Winsemius, MD
> West Hartford, CT
>
>

On Tue, Apr 6, 2010 at 9:41 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Apr 6, 2010, at 3:30 PM, David Winsemius wrote:
>
>>
>> On Apr 6, 2010, at 9:56 AM, moleps islon wrote:
>>
>>> OK... next question.. Which is still a data manipulation problem so I
>>> believe the heading is still OK.
>>>
>>> ##So now I read my population data from excel.
>>
>> No, you read it from a text file and providing the first ten lines of that
>> text file should have been really easy. Read the Posting Guide for advice
>> about offering datasets either as structure() objects with dput or dump or
>> as attached files with "*.txt" extension (not .csv). Just change the file
>> name with your file browser.
>>
>>> pop<-read.csv("pop.csv")
>>>
>>> typeof(pop) ## yields a list
>>
>> Really? I would have guessed it to yield just "list".
>>
>>> where I have age-specific population rows
>>> and a yearly column population, where the years are suffixed by X
>>
>> And had you used class(pop) you would have learned it was a dataframe and
>> even more informative would have been str(pop).
>>>
>>> c<-(1953:2008)
>>
>> No, no, no. Do not use variable names that are important function names.
>> The R interpreter can (usually) keep things straight but it is our brains
>> that experience problems.  Other  function names to avoid: data, df, cut,
>> mean, sd, list, vector, matrix
>>
>>> names(pop)<-c
>>> c.div<-cut(c,break=seq(1950,2010,by=5)
>>
>> (You should have gotten an error here.) After fixing the error, did you
>> you notice that there were only 3 of the first level???
>>
>> Watch out for cut(). It uses the default convention of ( , ] , i.e. open
>> interval at right
>
>                                                                      er,
>        ^left^
>
>> which is backwards to what some (most?) of us think natural. Because of
>> that the lowest level gets dropped unless you take special precautions.
>>  That is undoubtedly why Harrell set up his Hmisc::cut2 to have the default
>> be [ , )
>>
>> Aggregating across columns? Certainly possible, but maybe not as natural a
>> fit to functions like split as would occur with working across rows. I
>> suppose you could use something like this untested (because _still_ no
>> sample dataset provided) code:
>>
>> apply(pop, 1,    # this works a row a time
>>   function(x) tapply(x, list(c.div), sum) ) )  # or use aggregate which
>> uses tapply
>>
>> I'm not sure it will work, since I don't know if the column names would
>> get carried over into "x" by apply(). You might need to create a separate
>> index that used the numeric positions of the columns rather than their
>> names. Perhaps use c.div <-  seq(0,(2008-1953)) %/% 5  or some such inside
>> tapply.
>>
>>>
>>> Now I'd like to sum the agespecific population over the individual
>>> levels of -c.div- and generate a new table for this with agespecific
>>> rows and columns containing the 5-year bins instead of the original
>>> yearly data. Do I have to program this from scratch or is it possible
>>> to use an already existing function?
>>
>> I think you ought to read more introductory material (and the Posting
>> Guide regarding how to offer example datasets). In this case there are many
>> functions that do data aggregation and most of them should be illustrated in
>> a good introductory text.
>>
>> --
>> David.
>>>
>>>
>>> //M
>>>
>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE
>>>
>>> On Mon, Apr 5, 2010 at 10:11 PM, moleps <moleps2 at gmail.com> wrote:
>>>>
>>>> Thx Erik,
>>>> I have no idea what went wrong with the other code snippet, but this one
>>>> works.. Appreciate it.
>>>>
>>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
>>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))
>>>>
>>>> M
>>>>
>>>>
>>>> On 5. apr. 2010, at 21.45, Erik Iverson wrote:
>>>>
>>>>> I don't know what your data are like, since you haven't given a
>>>>> reproducible example. I was imagining something like:
>>>>>
>>>>> ## generate fake data
>>>>> age <- sample(20:90, 100, replace = TRUE)
>>>>> year <- sample(1950:2000, 100, replace = TRUE)
>>>>>
>>>>> ##look at big table
>>>>> table(age, year)
>>>>>
>>>>> ## categorize data
>>>>> ## see include.lowest and right arguments to cut
>>>>> age.factor <- cut(age, breaks = seq(20, 90, by = 10),
>>>>>               include.lowest = TRUE)
>>>>>
>>>>> year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
>>>>>                include.lowest = TRUE)
>>>>>
>>>>> table(age.factor, year.factor)
>>>>>
>>>>> moleps wrote:
>>>>>>
>>>>>> I already did try the regression modeling approach. However the
>>>>>> epidemiologists (referee) turns out to be quite fond of comparing the
>>>>>> incidence rates to different standard populations, hence the need for this
>>>>>> labourius approach. And trying the "cutting" approach I ended up with :
>>>>>>>
>>>>>>> table (age5)
>>>>>>
>>>>>> age5
>>>>>> (0,5]   (5,10]  (10,15]  (15,20]  (20,25]  (25,30]  (30,35]  (35,40]
>>>>>>  (40,45]  (45,50]  (50,55]  (55,60]  (60,65]  (65,70] (70,75]  (75,80]
>>>>>>  (80,85] (85,100]       35       34       33       47       51      109
>>>>>>  157      231      362      511    745      926     1002      866      547
>>>>>>    247       82       18
>>>>>>>
>>>>>>> table (yr5)
>>>>>>
>>>>>> yr5
>>>>>> (1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975]
>>>>>> (1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000] (2000,2005]
>>>>>> (2005,2009]           3           5           5           5           5
>>>>>>       5           5           5         5           5           5
>>>>>> 3
>>>>>>>
>>>>>>> table (yr5,age5)
>>>>>>
>>>>>> Error in table(yr5, age5) : all arguments must have the same length
>>>>>> Sincerely,
>>>>>> M
>>>>>> On 5. apr. 2010, at 20.59, Bert Gunter wrote:
>>>>>>>
>>>>>>> You have tempted, and being weak, I yield to temptation:
>>>>>>>
>>>>>>> "Any good ideas?"
>>>>>>>
>>>>>>> Yes. Don't do this.
>>>>>>>
>>>>>>> (what you probably really want to do is fit a model with age as a
>>>>>>> factor,
>>>>>>> which can be done statistically e.g. by logistic regression; or
>>>>>>> graphically
>>>>>>> using conditioning plots, e.g. via trellis graphics (the lattice
>>>>>>> package).
>>>>>>> This avoids the arbitrariness and discontinuities of binning by age
>>>>>>> range.)
>>>>>>>
>>>>>>> Bert Gunter
>>>>>>> Genentech Nonclinical Biostatistics
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: r-help-bounces at r-project.org
>>>>>>> [mailto:r-help-bounces at r-project.org] On
>>>>>>> Behalf Of moleps
>>>>>>> Sent: Monday, April 05, 2010 11:46 AM
>>>>>>> To: r-help at r-project.org
>>>>>>> Subject: [R] Data manipulation problem
>>>>>>>
>>>>>>> Dear R´ers.
>>>>>>>
>>>>>>> I´ve got a dataset with age and year of diagnosis. In order to
>>>>>>> age-standardize the incidence I need to transform the data into a
>>>>>>> matrix
>>>>>>> with age-groups (divided in 5 or 10 years) along one axis and year
>>>>>>> divided
>>>>>>> into 5 years along the other axis. Each cell should contain the
>>>>>>> number of
>>>>>>> cases for that age group and for that period.
>>>>>>> I.e.
>>>>>>> My data format now is
>>>>>>> ID-age (to one decimal)-year(yearly data).
>>>>>>>
>>>>>>> What I´d like is
>>>>>>>
>>>>>>> age 1960-1965 1966-1970 etc...
>>>>>>> 0-5 3 8 10 15
>>>>>>> 6-10 2 5 8 13
>>>>>>> etc..
>>>>>>>
>>>>>>>
>>>>>>> Any good ideas?
>>>>>>>
>>>>>>> Regards,
>>>>>>> M
>>>
>>
>
> David Winsemius, MD
> West Hartford, CT
>
>