[R] simple generation of artificial data with defined features

Fri Aug 22 19:56:16 CEST 2008

On the general question on how to create a dataset that matches the
frequencies in a table, function as.data.frame can be useful.  It takes as
argument an object of a class 'table' and returns a data frame of
frequencies.

Consider for example table 6.1 of Fleiss et al (3rd Ed):

> birth.weight <- c(10,15,40,135)
> attr(birth.weight, "class") <- "table"
> attr(birth.weight, "dim") <- c(2,2)
> attr(birth.weight, "dimnames") <- list(c("A", "Ab"), c("B", "Bb"))
> birth.weight
     B  Bb
A   10  40
Ab  15 135
> summary(birth.weight)
Number of cases in table: 200 
Number of factors: 2 
Test for independence of all factors:
        Chisq = 3.429, df = 1, p-value = 0.06408
> 
> bw.dt <- as.data.frame(birth.weight)

Observations (rows) in this table can then be replicated according to their
corresponding frequencies to yield the expanded dataset that conforms with
the original table. 

> bw.dt.exp <- bw.dt[rep(1:nrow(bw.dt), bw.dt$Freq), -ncol(bw.dt)]
> dim(bw.dt.exp)
[1] 200   2
> table(bw.dt.exp)
    Var2
Var1   B  Bb
  A   10  40
  Ab  15 135 

The above approach is not restricted to 2x2 tables, and should be
straightforward generate datasets that conform to arbitrary nxm frequency
tables.

-Christos Hatzis

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Greg Snow
> Sent: Friday, August 22, 2008 12:41 PM
> To: drflxms; r-help at r-project.org
> Subject: Re: [R] simple generation of artificial data with 
> defined features
> 
> I don't think that the election data is the right data to 
> demonstrate Kappa, you need subjects that are classified by 2 
> or more different raters/methods.  The election data could be 
> considered classifying the voters into which party they voted 
> for, but you only have 1 rater.  Maybe if you had some survey 
> data that showed which party each voter voted for in 2 or 
> more elections, then that may be a good example dataset.  
> Otherwise you may want to stick with the sample datasets.
> 
> There are other packages that compute Kappa values as well (I 
> don't know if others calculate this particular version), but 
> some of those take the summary data as input rather than the 
> raw data, which may be easier if you just have the summary tables.
> 
> 
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> (801) 408-8111
> 
> 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On Behalf Of drflxms
> > Sent: Friday, August 22, 2008 6:12 AM
> > To: r-help at r-project.org
> > Subject: [R] simple generation of artificial data with defined 
> > features
> >
> > Dear R-colleagues,
> >
> > I am quite a newbie to R fighting my stupidity to solve a probably 
> > quite simple problem of generating artificial data with defined 
> > features.
> >
> > I am conducting a study of inter-observer-agreement in 
> > child-bronchoscopy. One of the most important measures is Kappa 
> > according to Fleiss, which is very comfortable available in 
> R through 
> > the irr-package.
> > Unfortunately medical doctors like me don't really 
> understand much of 
> > statistics. Therefore I'd like to give the reader an easy 
> > understandable example of Fleiss-Kappa in the Methods part. 
> To achieve 
> > this, I obtained a table with the results of the German 
> election from 
> > 2005:
> >
> > party        number of votes    percent
> >
> > SPD        16194665            34,2
> > CDU        13136740            27,8
> > CSU        3494309            7,4
> > Gruene    3838326            8,1
> > FDP        4648144            9,8
> > PDS        4118194            8,7
> >
> > I want to show the agreement of voters measured by Fleiss-Kappa. To 
> > calculate this with the kappam.fleiss-function of irr, I need a 
> > data.frame like this:
> >
> >                 (id of 1st voter) (id of 2nd voter)
> >
> > party             spd                         cdu
> >
> > Of course I don't plan to calculate this with the million of cases 
> > mentioned in the table above (I am working on a small laptop). A 
> > division by 1000 would be more than perfect for this example. The 
> > exact format of the table is generally not so important, as I could 
> > reshape nearly every format with the help of the reshape-package.
> >
> > Unfortunately I could not figure out how to create such a 
> > fictive/artificial dataset as described above. Any 
> data.frame would be 
> > nice, that keeps at least the percentage. String-IDs of 
> parties could 
> > be substituted by numbers of course (would be even better 
> for function 
> > kappam.fleiss in irr!).
> >
> > I would appreciate any kind of help very much indeed.
> > Greetings from Munich,
> >
> > Felix Mueller-Sarnowski
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
>