[R] merge a list of data frames

Sam Steingold sds at gnu.org
Thu Sep 6 19:53:03 CEST 2012


> * David Winsemius <qjvafrzvhf at pbzpnfg.arg> [2012-09-06 10:30:16 -0700]:
>
>> these are the results of applying a model to the test data.
>> the first column is the ID
>
> In which case you should be using the 'by' argument to `merge`

I already do! see my initial message!

>> 3. sort by the sum/mean of the V3 columns and evaluate the combined
>> model using the lift quality metric
>> (http://dl.acm.org/citation.cfm?id=380995.381018)
>
> That's going to require more background (or more money since they want $15.00 for a pdf.

:-)
that I have already implemented, works just fine:

proficiency <- function (actual, prediction) {
  proficiency1(ea = entropy(table(actual)),
               ep = entropy(table(prediction)),
               ej = entropy(table(actual,prediction)))
}

proficiency1 <- function (ea, ep, ej) {
  mi <- ea + ep - ej
  list(joint = ej, actual = ea, prediction = ep, mutual = mi,
       proficiency = mi / ea, dependency = mi / ej)
}

detector.statistics <- function (tp,fn,fp,tn) {
  observationCount <- tp + fn + fp + tn
  predictedPositive <- tp + fp
  predictedNegative <- fn + tn
  actualPositive <- tp + fn
  actualNegative <- fp + tn
  correct <- tp + tn
  list(baseRate = actualPositive / observationCount,
       precision = if (tp == 0) 0 else tp / predictedPositive,
       specificity = if (tn == 0) 0 else tn / actualNegative,
       recall = if (tp == 0) 0 else tp / actualPositive,
       accuracy = correct / observationCount,
       lift = (tp * observationCount) / (predictedPositive * actualPositive),
       f1score = if (tp == 0) 0 else 2 * tp / (2 * tp + fp + fn),
       proficiency = proficiency1(ej = entropy(c(tp,fn,fp,tn)),
         ea = entropy(c(actualPositive,actualNegative)),
         ep = entropy(c(predictedPositive,predictedNegative))))
}

## v should be vector of 0&1 sorted according to some model
## Gregory Piatetsky-Shapiro, Samuel Steingold
## "Measuring Lift Quality in Database Marketing"
## http://sds.podval.org/data/l-quality.pdf
## http://www.sigkdd.org/explorations/issues/2-2-2000-12/piatetsky-shapiro.pdf
## SIGKDD Explorations, Vol. 2:2, (2000), 81-86
## tests: lift.quality(rbinom(10000,size=1,prob=0.1)) ==> ~0
##        lift.quality(rev(round((1:10000)/12000))) ==> 1
lift.quality <- function (v, plot = TRUE, file = NULL, main = "lift curve", thresholds = NULL) {
  target.count <- sum(v)
  total.count <- length(v)
  base.rate <- target.count / total.count
  target.level <- cumsum(v)/target.count
  lq <- ((2*sum(target.level) - 1)/total.count - 1) / (1 - base.rate)
  if (plot) {
    if (!is.null(file)) {
      pdf(file = file)
      on.exit(dev.off())
    }
    plot(x=(1:total.count)/total.count,y=target.level,type="l",
         main=paste(main,"( lift quality ",lq,")"),
         xlab="% cutoff",ylab="cumulative % hit")
  }
  if (is.null(thresholds)) thresholds = c(base.rate)
  list(lift.quality = lq,
       detector.statistics = sapply(thresholds, function (l) {
         cutoff <- round(l * total.count)
         tp <- round(target.level[cutoff] * target.count) # = sum(v[1:cutoff])
         fn <- target.count - tp
         fp <- cutoff - tp
         tn <- total.count - target.count - cutoff + tp
         detector.statistics(tp, fn, fp, tn)
       }))
}



>> I have many more score files (not just 4), so it is not practical for me
>> to rename the column to something unique.
>
> Which column?

the 3rd ("score") column.

Meanwhile I realised that the fastest way is actuall shell:
sort+cut+paste produced the csv file which can be loaded into R much
faster than the individual score files, so this issue is now purely
academic.  However, I appreciate the replies I got so far, it was quite
educational, thanks!
(I also appreciate comments on the code above)

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://www.memritv.org http://truepeace.org
http://openvotingconsortium.org http://ffii.org http://mideasttruth.com
Save your burned out bulbs for me, I'm building my own dark room.




More information about the R-help mailing list