[R] Data manipulation problem

David Winsemius dwinsemius at comcast.net
Tue Apr 6 21:41:48 CEST 2010


On Apr 6, 2010, at 3:30 PM, David Winsemius wrote:

>
> On Apr 6, 2010, at 9:56 AM, moleps islon wrote:
>
>> OK... next question.. Which is still a data manipulation problem so I
>> believe the heading is still OK.
>>
>> ##So now I read my population data from excel.
>
> No, you read it from a text file and providing the first ten lines  
> of that text file should have been really easy. Read the Posting  
> Guide for advice about offering datasets either as structure()  
> objects with dput or dump or as attached files with "*.txt"  
> extension (not .csv). Just change the file name with your file  
> browser.
>
>> pop<-read.csv("pop.csv")
>>
>> typeof(pop) ## yields a list
>
> Really? I would have guessed it to yield just "list".
>
>> where I have age-specific population rows
>> and a yearly column population, where the years are suffixed by X
>
> And had you used class(pop) you would have learned it was a  
> dataframe and even more informative would have been str(pop).
>>
>> c<-(1953:2008)
>
> No, no, no. Do not use variable names that are important function  
> names. The R interpreter can (usually) keep things straight but it  
> is our brains that experience problems.  Other  function names to  
> avoid: data, df, cut, mean, sd, list, vector, matrix
>
>> names(pop)<-c
>> c.div<-cut(c,break=seq(1950,2010,by=5)
>
> (You should have gotten an error here.) After fixing the error, did  
> you you notice that there were only 3 of the first level???
>
> Watch out for cut(). It uses the default convention of ( , ] , i.e.  
> open interval at right
                                                                        
er,            ^left^

> which is backwards to what some (most?) of us think natural. Because  
> of that the lowest level gets dropped unless you take special  
> precautions.  That is undoubtedly why Harrell set up his Hmisc::cut2  
> to have the default be [ , )
>
> Aggregating across columns? Certainly possible, but maybe not as  
> natural a fit to functions like split as would occur with working  
> across rows. I suppose you could use something like this untested  
> (because _still_ no sample dataset provided) code:
>
> apply(pop, 1,    # this works a row a time
>    function(x) tapply(x, list(c.div), sum) ) )  # or use aggregate  
> which uses tapply
>
> I'm not sure it will work, since I don't know if the column names  
> would get carried over into "x" by apply(). You might need to create  
> a separate index that used the numeric positions of the columns  
> rather than their names. Perhaps use c.div <-  seq(0,(2008-1953)) %/ 
> % 5  or some such inside tapply.
>
>>
>> Now I'd like to sum the agespecific population over the individual
>> levels of -c.div- and generate a new table for this with agespecific
>> rows and columns containing the 5-year bins instead of the original
>> yearly data. Do I have to program this from scratch or is it possible
>> to use an already existing function?
>
> I think you ought to read more introductory material (and the  
> Posting Guide regarding how to offer example datasets). In this case  
> there are many functions that do data aggregation and most of them  
> should be illustrated in a good introductory text.
>
> -- 
> David.
>>
>>
>> //M
>>
>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE
>>
>> On Mon, Apr 5, 2010 at 10:11 PM, moleps <moleps2 at gmail.com> wrote:
>>>
>>> Thx Erik,
>>> I have no idea what went wrong with the other code snippet, but  
>>> this one works.. Appreciate it.
>>>
>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =  
>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))
>>>
>>> M
>>>
>>>
>>> On 5. apr. 2010, at 21.45, Erik Iverson wrote:
>>>
>>>> I don't know what your data are like, since you haven't given a  
>>>> reproducible example. I was imagining something like:
>>>>
>>>> ## generate fake data
>>>> age <- sample(20:90, 100, replace = TRUE)
>>>> year <- sample(1950:2000, 100, replace = TRUE)
>>>>
>>>> ##look at big table
>>>> table(age, year)
>>>>
>>>> ## categorize data
>>>> ## see include.lowest and right arguments to cut
>>>> age.factor <- cut(age, breaks = seq(20, 90, by = 10),
>>>>                include.lowest = TRUE)
>>>>
>>>> year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
>>>>                 include.lowest = TRUE)
>>>>
>>>> table(age.factor, year.factor)
>>>>
>>>> moleps wrote:
>>>>> I already did try the regression modeling approach. However the  
>>>>> epidemiologists (referee) turns out to be quite fond of  
>>>>> comparing the incidence rates to different standard populations,  
>>>>> hence the need for this labourius approach. And trying the  
>>>>> "cutting" approach I ended up with :
>>>>>> table (age5)
>>>>> age5
>>>>> (0,5]   (5,10]  (10,15]  (15,20]  (20,25]  (25,30]  (30,35]   
>>>>> (35,40]  (40,45]  (45,50]  (50,55]  (55,60]  (60,65]  (65,70]  
>>>>> (70,75]  (75,80]  (80,85] (85,100]       35       34        
>>>>> 33       47       51      109      157      231      362       
>>>>> 511    745      926     1002      866      547      247        
>>>>> 82       18
>>>>>> table (yr5)
>>>>> yr5
>>>>> (1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975]  
>>>>> (1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000]  
>>>>> (2000,2005] (2005,2009]           3           5            
>>>>> 5           5           5           5           5            
>>>>> 5         5           5           5           3
>>>>>> table (yr5,age5)
>>>>> Error in table(yr5, age5) : all arguments must have the same  
>>>>> length
>>>>> Sincerely,
>>>>> M
>>>>> On 5. apr. 2010, at 20.59, Bert Gunter wrote:
>>>>>> You have tempted, and being weak, I yield to temptation:
>>>>>>
>>>>>> "Any good ideas?"
>>>>>>
>>>>>> Yes. Don't do this.
>>>>>>
>>>>>> (what you probably really want to do is fit a model with age as  
>>>>>> a factor,
>>>>>> which can be done statistically e.g. by logistic regression; or  
>>>>>> graphically
>>>>>> using conditioning plots, e.g. via trellis graphics (the  
>>>>>> lattice package).
>>>>>> This avoids the arbitrariness and discontinuities of binning by  
>>>>>> age range.)
>>>>>>
>>>>>> Bert Gunter
>>>>>> Genentech Nonclinical Biostatistics
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org 
>>>>>> ] On
>>>>>> Behalf Of moleps
>>>>>> Sent: Monday, April 05, 2010 11:46 AM
>>>>>> To: r-help at r-project.org
>>>>>> Subject: [R] Data manipulation problem
>>>>>>
>>>>>> Dear R´ers.
>>>>>>
>>>>>> I´ve got a dataset with age and year of diagnosis. In order to
>>>>>> age-standardize the incidence I need to transform the data into  
>>>>>> a matrix
>>>>>> with age-groups (divided in 5 or 10 years) along one axis and  
>>>>>> year divided
>>>>>> into 5 years along the other axis. Each cell should contain the  
>>>>>> number of
>>>>>> cases for that age group and for that period.
>>>>>> I.e.
>>>>>> My data format now is
>>>>>> ID-age (to one decimal)-year(yearly data).
>>>>>>
>>>>>> What I´d like is
>>>>>>
>>>>>> age 1960-1965 1966-1970 etc...
>>>>>> 0-5 3 8 10 15
>>>>>> 6-10 2 5 8 13
>>>>>> etc..
>>>>>>
>>>>>>
>>>>>> Any good ideas?
>>>>>>
>>>>>> Regards,
>>>>>> M
>>
>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list