[R] Intended use-case for data.matrix

Philip Charles ph|||p@ch@r|e@ @end|ng |rom ndm@ox@@c@uk
Wed Nov 4 12:48:58 CET 2020


Hi R gurus,

We do a lot of work with biological -omics datasets (genomics, proteomics etc).  The text file inputs to R typically contain a mixture of (mostly) character data and numeric data.  The number of columns (both character and numeric data) in the file vary with the number of samples measured (which makes use of colClasses , so a typical approach might be

1) read in the whole file as character matrix

#simulated result of read.table (with stringsAsFactors=FALSE)
raw <- data.frame(Accession=c('P04637','P01375','P00761'),Description=c('Cellular tumor antigen p53','Tumor necrosis factor','Trypsin'),Species=c('Homo sapiens','Homo sapiens','Sus scrofa'),Intensity.SampleA=c('919948','1346170','15870'),Intensity.SampleB=c('1625540','710272','83624'),Intensity.SampleC=c('1232780','1481040','62548'))

2) use grep to identify numeric columns based on column names and split the raw matrix

QUANT_COLS <- grepl('^Intensity\\.',colnames(raw))
META_COLS <- !QUANT_COLS
quant.df.char <- raw[,QUANT_COLS]
meta.df <- raw[, META_COLS]

3) convert the quantitation data frame to a numeric matrix

Prior to R version 4, my standard method for step 3 was to use data.matrix() for this last step.  After recently updating from v3.6.3, I've found that all my workflows using this function were giving wildly incorrect results. I figured out that data.matrix now yields a matrix of factor levels rather than the actual numeric values

> quant.df.char
  Intensity.SampleA Intensity.SampleB Intensity.SampleC
1            919948           1625540           1232780
2           1346170            710272           1481040
3             15870             83624             62548

> data.matrix(quant.df.char)
     Intensity.SampleA Intensity.SampleB Intensity.SampleC
[1,]                 3                 1                 1
[2,]                 1                 2                 2
[3,]                 2                 3                 3

The change in behaviour of this function is documented in the R v4.0.0 changelog, so it is clearly intentional:

"data.matrix() now converts character columns to factors and from this to integers."

Now, I know there are other ways to achieve the same conversion, e.g. sapply(quant.df.char, as.numeric). They aren't quite as straightforward to read in the code as data.matrix (sapply/lapply in particular I have to think though whether there will a need to transpose the result!), but the fact that this base function has been changed (without a way to replicate the previous behaviour) leads me to suspect that I have probably not previously been using data.matrix in the intended manner - and I may therefore be making similar mistakes elsewhere! I've certainly distributed/handed out R scripting examples in the past that will now give incorrect results when run on v4+ R.

What even more confusing to me (but possibly related as regards an answer) is that R v4 broke with long-standing convention to change default.stringsAsFactors() to FALSE. So on one hand the update took away what was (at least, from our perspective, with our data - I am sure some here may disagree!) a perennial source of confusion/bugs for R learners, by not introducing string factorisation during data import, and then on the other hand changed a base function to explicitly introduce string factorisation..  I can't see when converting a character dataset, not to factors but, straight to numeric factor levels might be that useful (but of course that doesn't mean it isn't!).

I've had a look through r-help and r-devel archives and couldn't spot any discussion of this, so apologies if this has been asked before. I'm also pretty sure my misunderstanding is with the intended use-case of data.matrix and R ethos around strings/factors rather than the rationale for the change, which is why I'm asking here.

Best wishes,

Phil

Philip Charles
Target Discovery Institute, Nuffield Department Of Medicine
University of Oxford




	[[alternative HTML version deleted]]



More information about the R-help mailing list