[R] Preprocess and training model with text data

Fri Apr 15 15:15:34 CEST 2022

Hi everyone,

I am working on text categorization (my first project so learning) and my
dataset has several columns as text (detail about the data is pasted in the
bottom. I worked before on numeric data but my advisor now asked me to
perform predictive modeling on this text data.

I am doing some preprocessing such as tokenizing, lower case, stemming etc.

The following code is used for tokenization

train.tokens <- tokens(train$DESCRIPTION,, what = "word",
                       remove_numbers = TRUE, remove_punct = TRUE,
                       remove_symbols = TRUE, remove_hyphens = TRUE).   then

 train.tokens <- tokens_tolower(train.tokens)     then

train.tokens <- tokens_wordstem(train.tokens, language = "english")

I have two questions

(1)  If we have more text features (apart from DESCRIPTION), do I need to
repeat these steps for each feature? I tried the following but does not work

*train.tokens <- tokens(c(train$DESCRIPTION,,train$NAME) , what = "word",
remove_numbers = TRUE, remove_punct = TRUE,*
*     remove_symbols = TRUE, remove_hyphens = TRUE).*

(2) My second question, after we preprocess the data and create our
bag-of-words model like below

train.tokens.dfm <- dfm(train.tokens, tolower = FALSE)

train.tokens.matrix= as.matrix(train.tokens.dfm)

,  are we ready to *train our model and perform prediction*?

*My data as mentioned also in previous emails*

Rows: 1,819
Columns: 14
$ PLUGIN_RULE_KEY             <chr> "InsufficientBranchCoverage",
"InsufficientLin~
$ PLUGIN_CONFIG_KEY           <chr> "", "", "", "", "", "", "", "", "", "",
"S1120~
$ PLUGIN_NAME                 <chr> "common-java", "common-java",
"common-java", "~
$ DESCRIPTION                 <chr> "An issue is created on a file as soon
as the ~
$ SEVERITY                    <chr> "MAJOR", "MAJOR", "MAJOR", "MAJOR",
"MAJOR", "~
$ NAME                        <chr> "Branches should have sufficient
coverage by t~
$ DEF_REMEDIATION_FUNCTION    <chr> "LINEAR", "LINEAR", "LINEAR",
"LINEAR_OFFSET",~
$ REMEDIATION_GAP_MULT        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA~
$ DEF_REMEDIATION_BASE_EFFORT <chr> "", "", "", "10min", "", "", "5min",
"5min", "~
$ GAP_DESCRIPTION             <chr> "number of uncovered conditions",
"number of l~
$ SYSTEM_TAGS                 <chr> "bad-practice", "bad-practice",
"convention", ~
$ IS_TEMPLATE                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0~
$ DESCRIPTION_FORMAT          <chr> "HTML", "HTML", "HTML", "HTML", "HTML",
"HTML"~
$ TYPE                        <chr> "CODE_SMELL", "CODE_SMELL",
"CODE_SMELL", "COD~

	[[alternative HTML version deleted]]