[Rd] sparse.model.matrix Generates Non-Existent Factor Levels if Ord.factor Columns Present

Dario Strbenac dstr7320 at uni.sydney.edu.au
Thu Feb 8 05:00:13 CET 2018


Good day,

Sometimes, sparse.model.matrix outputs a dgCMatrix which has column names consisting of factor levels that were not in the original dataset. The first factor appears to be correctly transformed, but the following factors don't. For example:

diamonds <- as.data.frame(ggplot2::diamonds)
> colnames(sparse.model.matrix(~ . -1, diamonds))
 [1] "carat"        "cutFair"      "cutGood"      "cutVery Good" "cutPremium"   "cutIdeal"     "color.L"      "color.Q"      "color.C"      "color^4"      "color^5"     
[12] "color^6"      "clarity.L"    "clarity.Q"    "clarity.C"    "clarity^4"    "clarity^5"    "clarity^6"    "clarity^7"    "depth"        "table"        "price"       
[23] "x"            "y"            "z"

The variables color and clarity don't have factor levels which have been suffixed to them in the transformed matrix. The values in those columns are also wrong. Changing the Ord.factor columns into simply being factors fixes the problem. 

> diamonds[, "cut"] <- factor(as.character(diamonds[, "cut"]))
> diamonds[, "color"] <- factor(as.character(diamonds[, "color"]))
> diamonds[, "clarity"] <- factor(as.character(diamonds[, "clarity"]))

> colnames(sparse.model.matrix(~ . -1, diamonds)) # No more invented factor levels.
 [1] "carat"        "cutFair"      "cutGood"      "cutIdeal"     "cutPremium"   "cutVery Good" "colorE"       "colorF"       "colorG"       "colorH"      
[11] "colorI"       "colorJ"       "clarityIF"    "claritySI1"   "claritySI2"   "clarityVS1"   "clarityVS2"   "clarityVVS1"  "clarityVVS2"  "depth"       
[21] "table"        "price"        "x"            "y"            "z"

Can it be made to work correctly for both plain and ordered factors?

> sessionInfo()
R Under development (unstable) (2018-02-06 r74231)
Platform: i386-w64-mingw32/i386 (32-bit)

other attached packages:
[1] Matrix_1.2-12

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2 scales_0.5.0     compiler_3.5.0   lazyeval_0.2.1  
 [5] plyr_1.8.4       pillar_1.1.0     gtable_0.2.0     tibble_1.4.2    
 [9] Rcpp_0.12.15     ggplot2_2.2.1    grid_3.5.0       rlang_0.1.6     
[13] munsell_0.4.3    lattice_0.20-35

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia



More information about the R-devel mailing list