[R] Data manipulation problem

David Winsemius dwinsemius at comcast.net
Wed Apr 7 15:11:20 CEST 2010


That code throws multiple errors. Can you at least test your code  
before posting?

(And, again, please avoid using function names as names for your  
objects.)

-- David.

On Apr 7, 2010, at 8:54 AM, moleps islon wrote:

> So.. here we try again.
>
> ##generate dataset
> age.cat<-seq(0,100,10)
> year<-(1953:(1953+55))
> data.vec<-sample(10000:10000,(age.cat*year))
> data.matrix<-matrix(data.vec,c(length(age.cat),length(year))
> rownames(data.matrix)<-age.cat
> colnames(data.matrix)<-year
>
> ##divide into 5 year periods
> age.div<-cut(year,seq(1950,2010,6),include.lowest=T) ##interval is
> beyond my datainterval so I doubt the include.lowest matters
>
> Now what I'd like to do is summarise the rows within the 5-year  
> intervals.
>
> I did read about apply in its different variants and Dahlgaard, but I
> do not know understand how it could be applied in this setting.
>
> I tried making an array and summarise by that (used the vector and
> applied it into a
> length(age.cat)*max(vector(table(age.div)*length(age.div) array. It
> worked but required a bit of tweaking (inserting null columns) and I
> find myself in this situation quite often whereby I need to add
> multiple columns based on another vector so I'd be very interested in
> another more general approach.
>
> //M
>
>
>
> On Tue, Apr 6, 2010 at 9:41 PM, David Winsemius <dwinsemius at comcast.net 
> > wrote:
>>
>> On Apr 6, 2010, at 3:30 PM, David Winsemius wrote:
>>
>>>
>>> On Apr 6, 2010, at 9:56 AM, moleps islon wrote:
>>>
>>>> OK... next question.. Which is still a data manipulation problem  
>>>> so I
>>>> believe the heading is still OK.
>>>>
>>>> ##So now I read my population data from excel.
>>>
>>> No, you read it from a text file and providing the first ten lines  
>>> of that
>>> text file should have been really easy. Read the Posting Guide for  
>>> advice
>>> about offering datasets either as structure() objects with dput or  
>>> dump or
>>> as attached files with "*.txt" extension (not .csv). Just change  
>>> the file
>>> name with your file browser.
>>>
>>>> pop<-read.csv("pop.csv")
>>>>
>>>> typeof(pop) ## yields a list
>>>
>>> Really? I would have guessed it to yield just "list".
>>>
>>>> where I have age-specific population rows
>>>> and a yearly column population, where the years are suffixed by X
>>>
>>> And had you used class(pop) you would have learned it was a  
>>> dataframe and
>>> even more informative would have been str(pop).
>>>>
>>>> c<-(1953:2008)
>>>
>>> No, no, no. Do not use variable names that are important function  
>>> names.
>>> The R interpreter can (usually) keep things straight but it is our  
>>> brains
>>> that experience problems.  Other  function names to avoid: data,  
>>> df, cut,
>>> mean, sd, list, vector, matrix
>>>
>>>> names(pop)<-c
>>>> c.div<-cut(c,break=seq(1950,2010,by=5)
>>>
>>> (You should have gotten an error here.) After fixing the error,  
>>> did you
>>> you notice that there were only 3 of the first level???
>>>
>>> Watch out for cut(). It uses the default convention of ( , ] ,  
>>> i.e. open
>>> interval at right
>>
>>                                                                      
>> er,
>>       ^left^
>>
>>> which is backwards to what some (most?) of us think natural.  
>>> Because of
>>> that the lowest level gets dropped unless you take special  
>>> precautions.
>>> That is undoubtedly why Harrell set up his Hmisc::cut2 to have the  
>>> default
>>> be [ , )
>>>
>>> Aggregating across columns? Certainly possible, but maybe not as  
>>> natural a
>>> fit to functions like split as would occur with working across  
>>> rows. I
>>> suppose you could use something like this untested (because  
>>> _still_ no
>>> sample dataset provided) code:
>>>
>>> apply(pop, 1,    # this works a row a time
>>>  function(x) tapply(x, list(c.div), sum) ) )  # or use aggregate  
>>> which
>>> uses tapply
>>>
>>> I'm not sure it will work, since I don't know if the column names  
>>> would
>>> get carried over into "x" by apply(). You might need to create a  
>>> separate
>>> index that used the numeric positions of the columns rather than  
>>> their
>>> names. Perhaps use c.div <-  seq(0,(2008-1953)) %/% 5  or some  
>>> such inside
>>> tapply.
>>>
>>>>
>>>> Now I'd like to sum the agespecific population over the individual
>>>> levels of -c.div- and generate a new table for this with  
>>>> agespecific
>>>> rows and columns containing the 5-year bins instead of the original
>>>> yearly data. Do I have to program this from scratch or is it  
>>>> possible
>>>> to use an already existing function?
>>>
>>> I think you ought to read more introductory material (and the  
>>> Posting
>>> Guide regarding how to offer example datasets). In this case there  
>>> are many
>>> functions that do data aggregation and most of them should be  
>>> illustrated in
>>> a good introductory text.
>>>
>>> --
>>> David.
>>>>
>>>>
>>>> //M
>>>>
>>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
>>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE
>>>>
>>>> On Mon, Apr 5, 2010 at 10:11 PM, moleps <moleps2 at gmail.com> wrote:
>>>>>
>>>>> Thx Erik,
>>>>> I have no idea what went wrong with the other code snippet, but  
>>>>> this one
>>>>> works.. Appreciate it.
>>>>>
>>>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
>>>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))
>>>>>
>>>>> M
>>>>>
>>>>>
>>>>> On 5. apr. 2010, at 21.45, Erik Iverson wrote:
>>>>>
>>>>>> I don't know what your data are like, since you haven't given a
>>>>>> reproducible example. I was imagining something like:
>>>>>>
>>>>>> ## generate fake data
>>>>>> age <- sample(20:90, 100, replace = TRUE)
>>>>>> year <- sample(1950:2000, 100, replace = TRUE)
>>>>>>
>>>>>> ##look at big table
>>>>>> table(age, year)
>>>>>>
>>>>>> ## categorize data
>>>>>> ## see include.lowest and right arguments to cut
>>>>>> age.factor <- cut(age, breaks = seq(20, 90, by = 10),
>>>>>>              include.lowest = TRUE)
>>>>>>
>>>>>> year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
>>>>>>               include.lowest = TRUE)
>>>>>>
>>>>>> table(age.factor, year.factor)
>>>>>>
>>>>>> moleps wrote:
>>>>>>>
>>>>>>> I already did try the regression modeling approach. However the
>>>>>>> epidemiologists (referee) turns out to be quite fond of  
>>>>>>> comparing the
>>>>>>> incidence rates to different standard populations, hence the  
>>>>>>> need for this
>>>>>>> labourius approach. And trying the "cutting" approach I ended  
>>>>>>> up with :
>>>>>>>>
>>>>>>>> table (age5)
>>>>>>>
>>>>>>> age5
>>>>>>> (0,5]   (5,10]  (10,15]  (15,20]  (20,25]  (25,30]  (30,35]   
>>>>>>> (35,40]
>>>>>>> (40,45]  (45,50]  (50,55]  (55,60]  (60,65]  (65,70] (70,75]   
>>>>>>> (75,80]
>>>>>>> (80,85] (85,100]       35       34       33       47        
>>>>>>> 51      109
>>>>>>> 157      231      362      511    745      926     1002       
>>>>>>> 866      547
>>>>>>>   247       82       18
>>>>>>>>
>>>>>>>> table (yr5)
>>>>>>>
>>>>>>> yr5
>>>>>>> (1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975]
>>>>>>> (1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000]  
>>>>>>> (2000,2005]
>>>>>>> (2005,2009]           3           5           5            
>>>>>>> 5           5
>>>>>>>      5           5           5         5           5           5
>>>>>>> 3
>>>>>>>>
>>>>>>>> table (yr5,age5)
>>>>>>>
>>>>>>> Error in table(yr5, age5) : all arguments must have the same  
>>>>>>> length
>>>>>>> Sincerely,
>>>>>>> M
>>>>>>> On 5. apr. 2010, at 20.59, Bert Gunter wrote:
>>>>>>>>
>>>>>>>> You have tempted, and being weak, I yield to temptation:
>>>>>>>>
>>>>>>>> "Any good ideas?"
>>>>>>>>
>>>>>>>> Yes. Don't do this.
>>>>>>>>
>>>>>>>> (what you probably really want to do is fit a model with age  
>>>>>>>> as a
>>>>>>>> factor,
>>>>>>>> which can be done statistically e.g. by logistic regression; or
>>>>>>>> graphically
>>>>>>>> using conditioning plots, e.g. via trellis graphics (the  
>>>>>>>> lattice
>>>>>>>> package).
>>>>>>>> This avoids the arbitrariness and discontinuities of binning  
>>>>>>>> by age
>>>>>>>> range.)
>>>>>>>>
>>>>>>>> Bert Gunter
>>>>>>>> Genentech Nonclinical Biostatistics
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: r-help-bounces at r-project.org
>>>>>>>> [mailto:r-help-bounces at r-project.org] On
>>>>>>>> Behalf Of moleps
>>>>>>>> Sent: Monday, April 05, 2010 11:46 AM
>>>>>>>> To: r-help at r-project.org
>>>>>>>> Subject: [R] Data manipulation problem
>>>>>>>>
>>>>>>>> Dear R´ers.
>>>>>>>>
>>>>>>>> I´ve got a dataset with age and year of diagnosis. In order to
>>>>>>>> age-standardize the incidence I need to transform the data  
>>>>>>>> into a
>>>>>>>> matrix
>>>>>>>> with age-groups (divided in 5 or 10 years) along one axis and  
>>>>>>>> year
>>>>>>>> divided
>>>>>>>> into 5 years along the other axis. Each cell should contain the
>>>>>>>> number of
>>>>>>>> cases for that age group and for that period.
>>>>>>>> I.e.
>>>>>>>> My data format now is
>>>>>>>> ID-age (to one decimal)-year(yearly data).
>>>>>>>>
>>>>>>>> What I´d like is
>>>>>>>>
>>>>>>>> age 1960-1965 1966-1970 etc...
>>>>>>>> 0-5 3 8 10 15
>>>>>>>> 6-10 2 5 8 13
>>>>>>>> etc..
>>>>>>>>
>>>>>>>>
>>>>>>>> Any good ideas?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> M
>>>>
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
>
> On Tue, Apr 6, 2010 at 9:41 PM, David Winsemius <dwinsemius at comcast.net 
> > wrote:
>>
>> On Apr 6, 2010, at 3:30 PM, David Winsemius wrote:
>>
>>>
>>> On Apr 6, 2010, at 9:56 AM, moleps islon wrote:
>>>
>>>> OK... next question.. Which is still a data manipulation problem  
>>>> so I
>>>> believe the heading is still OK.
>>>>
>>>> ##So now I read my population data from excel.
>>>
>>> No, you read it from a text file and providing the first ten lines  
>>> of that
>>> text file should have been really easy. Read the Posting Guide for  
>>> advice
>>> about offering datasets either as structure() objects with dput or  
>>> dump or
>>> as attached files with "*.txt" extension (not .csv). Just change  
>>> the file
>>> name with your file browser.
>>>
>>>> pop<-read.csv("pop.csv")
>>>>
>>>> typeof(pop) ## yields a list
>>>
>>> Really? I would have guessed it to yield just "list".
>>>
>>>> where I have age-specific population rows
>>>> and a yearly column population, where the years are suffixed by X
>>>
>>> And had you used class(pop) you would have learned it was a  
>>> dataframe and
>>> even more informative would have been str(pop).
>>>>
>>>> c<-(1953:2008)
>>>
>>> No, no, no. Do not use variable names that are important function  
>>> names.
>>> The R interpreter can (usually) keep things straight but it is our  
>>> brains
>>> that experience problems.  Other  function names to avoid: data,  
>>> df, cut,
>>> mean, sd, list, vector, matrix
>>>
>>>> names(pop)<-c
>>>> c.div<-cut(c,break=seq(1950,2010,by=5)
>>>
>>> (You should have gotten an error here.) After fixing the error,  
>>> did you
>>> you notice that there were only 3 of the first level???
>>>
>>> Watch out for cut(). It uses the default convention of ( , ] ,  
>>> i.e. open
>>> interval at right
>>
>>                                                                      er 
>> ,
>>        ^left^
>>
>>> which is backwards to what some (most?) of us think natural.  
>>> Because of
>>> that the lowest level gets dropped unless you take special  
>>> precautions.
>>>  That is undoubtedly why Harrell set up his Hmisc::cut2 to have  
>>> the default
>>> be [ , )
>>>
>>> Aggregating across columns? Certainly possible, but maybe not as  
>>> natural a
>>> fit to functions like split as would occur with working across  
>>> rows. I
>>> suppose you could use something like this untested (because  
>>> _still_ no
>>> sample dataset provided) code:
>>>
>>> apply(pop, 1,    # this works a row a time
>>>   function(x) tapply(x, list(c.div), sum) ) )  # or use aggregate  
>>> which
>>> uses tapply
>>>
>>> I'm not sure it will work, since I don't know if the column names  
>>> would
>>> get carried over into "x" by apply(). You might need to create a  
>>> separate
>>> index that used the numeric positions of the columns rather than  
>>> their
>>> names. Perhaps use c.div <-  seq(0,(2008-1953)) %/% 5  or some  
>>> such inside
>>> tapply.
>>>
>>>>
>>>> Now I'd like to sum the agespecific population over the individual
>>>> levels of -c.div- and generate a new table for this with  
>>>> agespecific
>>>> rows and columns containing the 5-year bins instead of the original
>>>> yearly data. Do I have to program this from scratch or is it  
>>>> possible
>>>> to use an already existing function?
>>>
>>> I think you ought to read more introductory material (and the  
>>> Posting
>>> Guide regarding how to offer example datasets). In this case there  
>>> are many
>>> functions that do data aggregation and most of them should be  
>>> illustrated in
>>> a good introductory text.
>>>
>>> --
>>> David.
>>>>
>>>>
>>>> //M
>>>>
>>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
>>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE
>>>>
>>>> On Mon, Apr 5, 2010 at 10:11 PM, moleps <moleps2 at gmail.com> wrote:
>>>>>
>>>>> Thx Erik,
>>>>> I have no idea what went wrong with the other code snippet, but  
>>>>> this one
>>>>> works.. Appreciate it.
>>>>>
>>>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
>>>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))
>>>>>
>>>>> M
>>>>>
>>>>>
>>>>> On 5. apr. 2010, at 21.45, Erik Iverson wrote:
>>>>>
>>>>>> I don't know what your data are like, since you haven't given a
>>>>>> reproducible example. I was imagining something like:
>>>>>>
>>>>>> ## generate fake data
>>>>>> age <- sample(20:90, 100, replace = TRUE)
>>>>>> year <- sample(1950:2000, 100, replace = TRUE)
>>>>>>
>>>>>> ##look at big table
>>>>>> table(age, year)
>>>>>>
>>>>>> ## categorize data
>>>>>> ## see include.lowest and right arguments to cut
>>>>>> age.factor <- cut(age, breaks = seq(20, 90, by = 10),
>>>>>>               include.lowest = TRUE)
>>>>>>
>>>>>> year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
>>>>>>                include.lowest = TRUE)
>>>>>>
>>>>>> table(age.factor, year.factor)
>>>>>>
>>>>>> moleps wrote:
>>>>>>>
>>>>>>> I already did try the regression modeling approach. However the
>>>>>>> epidemiologists (referee) turns out to be quite fond of  
>>>>>>> comparing the
>>>>>>> incidence rates to different standard populations, hence the  
>>>>>>> need for this
>>>>>>> labourius approach. And trying the "cutting" approach I ended  
>>>>>>> up with :
>>>>>>>>
>>>>>>>> table (age5)
>>>>>>>
>>>>>>> age5
>>>>>>> (0,5]   (5,10]  (10,15]  (15,20]  (20,25]  (25,30]  (30,35]   
>>>>>>> (35,40]
>>>>>>>  (40,45]  (45,50]  (50,55]  (55,60]  (60,65]  (65,70] (70,75]   
>>>>>>> (75,80]
>>>>>>>  (80,85] (85,100]       35       34       33       47        
>>>>>>> 51      109
>>>>>>>  157      231      362      511    745      926     1002       
>>>>>>> 866      547
>>>>>>>    247       82       18
>>>>>>>>
>>>>>>>> table (yr5)
>>>>>>>
>>>>>>> yr5
>>>>>>> (1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975]
>>>>>>> (1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000]  
>>>>>>> (2000,2005]
>>>>>>> (2005,2009]           3           5           5            
>>>>>>> 5           5
>>>>>>>       5           5           5         5            
>>>>>>> 5           5
>>>>>>> 3
>>>>>>>>
>>>>>>>> table (yr5,age5)
>>>>>>>
>>>>>>> Error in table(yr5, age5) : all arguments must have the same  
>>>>>>> length
>>>>>>> Sincerely,
>>>>>>> M
>>>>>>> On 5. apr. 2010, at 20.59, Bert Gunter wrote:
>>>>>>>>
>>>>>>>> You have tempted, and being weak, I yield to temptation:
>>>>>>>>
>>>>>>>> "Any good ideas?"
>>>>>>>>
>>>>>>>> Yes. Don't do this.
>>>>>>>>
>>>>>>>> (what you probably really want to do is fit a model with age  
>>>>>>>> as a
>>>>>>>> factor,
>>>>>>>> which can be done statistically e.g. by logistic regression; or
>>>>>>>> graphically
>>>>>>>> using conditioning plots, e.g. via trellis graphics (the  
>>>>>>>> lattice
>>>>>>>> package).
>>>>>>>> This avoids the arbitrariness and discontinuities of binning  
>>>>>>>> by age
>>>>>>>> range.)
>>>>>>>>
>>>>>>>> Bert Gunter
>>>>>>>> Genentech Nonclinical Biostatistics
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: r-help-bounces at r-project.org
>>>>>>>> [mailto:r-help-bounces at r-project.org] On
>>>>>>>> Behalf Of moleps
>>>>>>>> Sent: Monday, April 05, 2010 11:46 AM
>>>>>>>> To: r-help at r-project.org
>>>>>>>> Subject: [R] Data manipulation problem
>>>>>>>>
>>>>>>>> Dear R´ers.
>>>>>>>>
>>>>>>>> I´ve got a dataset with age and year of diagnosis. In order to
>>>>>>>> age-standardize the incidence I need to transform the data  
>>>>>>>> into a
>>>>>>>> matrix
>>>>>>>> with age-groups (divided in 5 or 10 years) along one axis and  
>>>>>>>> year
>>>>>>>> divided
>>>>>>>> into 5 years along the other axis. Each cell should contain the
>>>>>>>> number of
>>>>>>>> cases for that age group and for that period.
>>>>>>>> I.e.
>>>>>>>> My data format now is
>>>>>>>> ID-age (to one decimal)-year(yearly data).
>>>>>>>>
>>>>>>>> What I´d like is
>>>>>>>>
>>>>>>>> age 1960-1965 1966-1970 etc...
>>>>>>>> 0-5 3 8 10 15
>>>>>>>> 6-10 2 5 8 13
>>>>>>>> etc..
>>>>>>>>
>>>>>>>>
>>>>>>>> Any good ideas?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> M
>>>>
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list