[R] Error with text analysis data

Thu Apr 14 02:38:00 CEST 2022

Hi David,

I just used caret library and farff file format.

sapply(d, function(x){ length(unique(x)) } )
            PLUGIN_RULE_KEY           PLUGIN_CONFIG_KEY
PLUGIN_NAME
                       1817                        1211
     14
                DESCRIPTION                    SEVERITY
   NAME
                       1815                           5
   1813
   DEF_REMEDIATION_FUNCTION        REMEDIATION_GAP_MULT
DEF_REMEDIATION_BASE_EFFORT
                          4                           1
     15
            GAP_DESCRIPTION                 SYSTEM_TAGS
IS_TEMPLATE
                         15                         142
      2
         DESCRIPTION_FORMAT                        TYPE
                          2                           3

On Thu, Apr 14, 2022 at 2:30 AM David Winsemius <dwinsemius using comcast.net>
wrote:

>
> On 4/13/22 16:58, Neha gupta wrote:
>
>  summary(d)
>
>  PLUGIN_RULE_KEY    PLUGIN_CONFIG_KEY  PLUGIN_NAME        DESCRIPTION
>
>  Length:1819        Length:1819        Length:1819        Length:1819
>
>  Class :character   Class :character   Class :character   Class :character
>
>  Mode  :character   Mode  :character   Mode  :character   Mode  :character
>
>
>
>
>
>
>
>    SEVERITY             NAME           REMEDIATION_FUNCTION
>  Length:1819        Length:1819        Mode:logical
>  Class :character   Class :character   NA's:1819
>  Mode  :character   Mode  :character
>
> According to your code that column (and the two following it) should have
> been deleted with the lines:
>
> d$REMEDIATION_FUNCTION=NULL
> d$DEF_REMEDIATION_GAP_MULT=NULL
> d$REMEDIATION_BASE_EFFORT=NULL
>
> # And we're still only guessing that the caret package has been loaded since you have not offered the library calls that have set your workspace.
>
> Now we need:
> sapply(d, function(x){ length(unique(x)) } )
>
> # I generally use Hmisc::describe since it is a vast improvement over the default summary.data.frame
>
>
>
>
>
>
>
>  DEF_REMEDIATION_FUNCTION REMEDIATION_GAP_MULT DEF_REMEDIATION_GAP_MULT
>  Length:1819              Mode:logical         Length:1819
>  Class :character         NA's:1819            Class :character
>  Mode  :character                              Mode  :character
>
>
>
>  REMEDIATION_BASE_EFFORT
>
> So that column should also have been deleted since it is all NA..
>
>
> DEF_REMEDIATION_BASE_EFFORT GAP_DESCRIPTION
>  Mode:logical            Length:1819                 Length:1819
>  NA's:1819               Class :character            Class :character
>                          Mode  :character            Mode  :character
>
>
>
>  SYSTEM_TAGS         IS_TEMPLATE      DESCRIPTION_FORMAT     TYPE
>  Length:1819        Min.   :0.00000   Length:1819        Length:1819
>  Class :character   1st Qu.:0.00000   Class :character   Class :character
>  Mode  :character   Median :0.00000   Mode  :character   Mode  :character
>                     Mean   :0.02859
>                     3rd Qu.:0.00000
>                     Max.   :1.00000
>
>
> The IS_TEMPLATE column certainly does not look like "text".
>
> You have been posting in HTML. STOP DOING THAT. Read the Posting Guide
>
>
> --
> David.
>
>
> On Thu, Apr 14, 2022 at 1:33 AM David Winsemius <dwinsemius using comcast.net>
> wrote:
>
>>
>> On 4/13/22 13:07, Neha gupta wrote:
>> > Thank you Tim
>> >
>> > My purpose and aim is to train a model
>>
>>
>> It appears you are using the 'caret' package. You should have posted
>> code that loaded all the packages that you thought were essential to the
>> effort.
>>
>> > (based on the data I provided in my
>> > first email) and predict the output variable (TYPE variable which has
>> three
>> > different values like Severity, bugs, code smell etc) The data as you
>> can
>> > see also have a few text columns which is creating problems for me as I
>> > have never worked before with text data.
>> You offered what looks something like the output of `str(.)` on a
>> dataframe. It would also be a great help if you offered the output of
>> `summary(d)`
>> >
>> > I read a tutorial
>> What tutorial?
>> > that says stringAsFactor should be false, symbols etc
>> > should be removed, then data should be in tokens, then the text should
>> be
>> > placed in a matrix/table form, and then a model should be built. I am
>> not
>> > sure if these steps are required in my case. Its I think a word2vec
>> problem
>> > though I did not use it before.
>>
>>
>> I'm not sure what a "word2vec problem" might be.
>>
>> --
>>
>> David.
>>
>> >
>> > Best regards
>> >
>> > On Wed, Apr 13, 2022 at 9:53 PM Ebert,Timothy Aaron <tebert using ufl.edu>
>> wrote:
>> >
>> >> Is this a different question from the original post? It would be
>> better to
>> >> keep threads separate.
>> >>
>> >> Always pre-process the data. Clean the data of obvious mistakes. This
>> can
>> >> be simple typographical errors or complicated like an author that
>> wrote too
>> >> when they intended two or to. In old English texts spelling was not
>> >> standardized and the same word could have multiple spellings within one
>> >> book or chapter. Removing punctuation is probably a part of this,
>> though a
>> >> program like Grammarly would not work very well if it removed
>> punctuation.
>> >>
>> >>
>> >>
>> >> After that it depends on what you are trying to accomplish. Are you
>> >> interested in the number of times an author used the word “a” or “the”
>> and
>> >> is “The” different from “the?” Are you modeling word use frequency or
>> >> comparing vocabulary between texts.
>> >>
>> >>
>> >>
>> >> Too many choices.
>> >>
>> >>
>> >>
>> >> Tim
>> >>
>> >>
>> >>
>> >> *From:* Neha gupta <neha.bologna90 using gmail.com>
>> >> *Sent:* Wednesday, April 13, 2022 2:49 PM
>> >> *To:* Bill Dunlap <williamwdunlap using gmail.com>
>> >> *Cc:* Ebert,Timothy Aaron <tebert using ufl.edu>; r-help mailing list <
>> >> r-help using r-project.org>
>> >> *Subject:* Re: Error with text analysis data
>> >>
>> >>
>> >>
>> >> *[External Email]*
>> >>
>> >> Someone just told me that you need to pre process the data before model
>> >> construction. For instance, make the text to lower case, remove
>> >> punctuation, symbols etc and tokenize the text (give number to each
>> word).
>> >> Then create word of bags model (not sure about it), and then create a
>> >> model.
>> >>
>> >>
>> >>
>> >> Is it true to perform all these steps?
>> >>
>> >>
>> >>
>> >> Best regards
>> >>
>> >> On Wednesday, April 13, 2022, Bill Dunlap <williamwdunlap using gmail.com>
>> >> wrote:
>> >>
>> >>>   I would always suggest working until the model works, no errors and
>> no
>> >> NA values
>> >>
>> >>
>> >>
>> >> We agree on that.  However, the error gives you no hint about which
>> >> variables are causing the problem.  If it did, then it could only tell
>> >> about the first variable with the problem.  I think you would get to
>> your
>> >> working model faster if you got NA's for the constant columns and then
>> >> could drop them all at once (or otherwise deal with them).
>> >>
>> >>
>> >>
>> >> -Bill
>> >>
>> >>
>> >>
>> >> On Wed, Apr 13, 2022 at 9:40 AM Ebert,Timothy Aaron <tebert using ufl.edu>
>> >> wrote:
>> >>
>> >> I suspect that it is because you are looking at two types of error,
>> both
>> >> telling you that the model was not appropriate. In the “error in
>> contrasts”
>> >> there is nothing to contrast in the model. For a numerical constant the
>> >> program calculates the standard deviation and ends with a division by
>> zero.
>> >> Division by zero is undefined, or NA.
>> >>
>> >>
>> >>
>> >> I would always suggest working until the model works, no errors and no
>> NA
>> >> values. The reason is that I can get NA in several ways and I need to
>> >> understand why. If I just ignore the NA in my model I may be assuming
>> the
>> >> wrong thing.
>> >>
>> >>
>> >>
>> >> Tim
>> >>
>> >>
>> >>
>> >> *From:* Bill Dunlap <williamwdunlap using gmail.com>
>> >> *Sent:* Wednesday, April 13, 2022 12:23 PM
>> >> *To:* Ebert,Timothy Aaron <tebert using ufl.edu>
>> >> *Cc:* Neha gupta <neha.bologna90 using gmail.com>; r-help mailing list <
>> >> r-help using r-project.org>
>> >> *Subject:* Re: [R] Error with text analysis data
>> >>
>> >>
>> >>
>> >> *[External Email]*
>> >>
>> >> Constant columns can be the model when you do some subsetting or are
>> >> exploring a new dataset.  My objection is that constant columns of
>> numbers
>> >> and logicals are fine but those of characters and factors are not.
>> >>
>> >>
>> >>
>> >> -Bill
>> >>
>> >>
>> >>
>> >> On Wed, Apr 13, 2022 at 9:15 AM Ebert,Timothy Aaron <tebert using ufl.edu>
>> >> wrote:
>> >>
>> >> What is the goal of having a constant in the model? To me that seems
>> >> pointless. Also there is no variability in sexCode regardless of
>> whether
>> >> you call it integer or factor. So the model y ~ sexCode is just a
>> strange
>> >> way to look at the variability in y and it would be better to do
>> something
>> >> like summarize(y) or mean(y) if that was the goal.
>> >>
>> >> Tim
>> >>
>> >> -----Original Message-----
>> >> From: R-help <r-help-bounces using r-project.org> On Behalf Of Bill Dunlap
>> >> Sent: Wednesday, April 13, 2022 9:56 AM
>> >> To: Neha gupta <neha.bologna90 using gmail.com>
>> >> Cc: r-help mailing list <r-help using r-project.org>
>> >> Subject: Re: [R] Error with text analysis data
>> >>
>> >> [External Email]
>> >>
>> >> This sounds like what I think is a bug in
>> stats::model.matrix.default(): a
>> >> numeric column with all identical entries is fine but a constant
>> character
>> >> or factor column is not.
>> >>
>> >>> d <- data.frame(y=1:5, sex=rep("Female",5)) d$sexFactor <-
>> >>> factor(d$sex, levels=c("Male","Female")) d$sexCode <-
>> >>> as.integer(d$sexFactor) d
>> >>    y    sex sexFactor sexCode
>> >> 1 1 Female    Female       2
>> >> 2 2 Female    Female       2
>> >> 3 3 Female    Female       2
>> >> 4 4 Female    Female       2
>> >> 5 5 Female    Female       2
>> >>> lm(y~sex, data=d)
>> >> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
>> >>    contrasts can be applied only to factors with 2 or more levels
>> >>> lm(y~sexFactor, data=d)
>> >> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
>> >>    contrasts can be applied only to factors with 2 or more levels
>> >>> lm(y~sexCode, data=d)
>> >> Call:
>> >> lm(formula = y ~ sexCode, data = d)
>> >>
>> >> Coefficients:
>> >> (Intercept)      sexCode
>> >>            3           NA
>> >>
>> >> Calling traceback() after the error would clarify this.
>> >>
>> >> -Bill
>> >>
>> >>
>> >> On Tue, Apr 12, 2022 at 3:12 PM Neha gupta <neha.bologna90 using gmail.com>
>> >> wrote:
>> >>
>> >>> Hello everyone, I have text data with output variable have three
>> >> subgroups.
>> >>> I am using the following code but getting the error message (see error
>> >>> after the code).
>> >>>
>> >>> d=read.csv("SONAR_RULES.csv", stringsAsFactors = FALSE)
>> >>> d$REMEDIATION_FUNCTION=NULL d$DEF_REMEDIATION_GAP_MULT=NULL
>> >>> d$REMEDIATION_BASE_EFFORT=NULL
>> >>>
>> >>> index <- createDataPartition(d$TYPE, p = .70,list = FALSE) tr <-
>> >>> d[index, ] ts <- d[-index, ]
>> >>>
>> >>> ctrl <- trainControl(method = "cv",number=3, index = index, classProbs
>> >>> = TRUE, summaryFunction = multiClassSummary)
>> >>>
>> >>> ran <- train(TYPE ~ ., data = tr,
>> >>>                      method = "rpart",
>> >>>                      ## Will create 48 parameter combinations
>> >>>                      tuneLength = 3,
>> >>>                      na.action= na.pass,
>> >>>                      metric = "Accuracy",
>> >>>                      preProc = c("center", "scale", "nzv"),
>> >>>                      trControl = ctrl)
>> >>> getTrainPerf(ran)
>> >>>
>> >>> *It gives me error:*
>> >>>
>> >>>
>> >>> *Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
>> >>> contrasts can be applied only to factors with 2 or more levels*
>> >>>
>> >>>
>> >>> *My data is as follow*
>> >>>
>> >>> Rows: 1,819
>> >>> Columns: 14
>> >>> $ PLUGIN_RULE_KEY             <chr> "InsufficientBranchCoverage",
>> >>> "InsufficientLin~
>> >>> $ PLUGIN_CONFIG_KEY           <chr> "", "", "", "", "", "", "", "",
>> "",
>> >> "",
>> >>> "S1120~
>> >>> $ PLUGIN_NAME                 <chr> "common-java", "common-java",
>> >>> "common-java", "~
>> >>> $ DESCRIPTION                 <chr> "An issue is created on a file as
>> >> soon
>> >>> as the ~
>> >>> $ SEVERITY                    <chr> "MAJOR", "MAJOR", "MAJOR",
>> "MAJOR",
>> >>> "MAJOR", "~
>> >>> $ NAME                        <chr> "Branches should have sufficient
>> >>> coverage by t~
>> >>> $ DEF_REMEDIATION_FUNCTION    <chr> "LINEAR", "LINEAR", "LINEAR",
>> >>> "LINEAR_OFFSET",~
>> >>> $ REMEDIATION_GAP_MULT        <lgl> NA, NA, NA, NA, NA, NA, NA, NA,
>> NA,
>> >> NA,
>> >>> NA, NA~
>> >>> $ DEF_REMEDIATION_BASE_EFFORT <chr> "", "", "", "10min", "", "",
>> >>> "5min", "5min", "~
>> >>> $ GAP_DESCRIPTION             <chr> "number of uncovered conditions",
>> >>> "number of l~
>> >>> $ SYSTEM_TAGS                 <chr> "bad-practice", "bad-practice",
>> >>> "convention", ~
>> >>> $ IS_TEMPLATE                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> 0,
>> >> 0,
>> >>> 0, 0, 0~
>> >>> $ DESCRIPTION_FORMAT          <chr> "HTML", "HTML", "HTML", "HTML",
>> >> "HTML",
>> >>> "HTML"~
>> >>> $ TYPE                        <chr> "CODE_SMELL", "CODE_SMELL",
>> >>> "CODE_SMELL", "COD~
>> >>>
>> >>>          [[alternative HTML version deleted]]
>> >>>
>> >>> ______________________________________________
>> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
>> >>> man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
>> >>> Rzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxo
>> >>> RrVO5eP&s=f3IyuRfeDDjr_8UWlwyBTC5Yn4Y56QV4FjYC0GCWcVc&e=
>> >>> PLEASE do read the posting guide
>> >>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
>> >>> g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
>> >>> sRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWx
>> >>> oRrVO5eP&s=Vo6cRRCeqGApsiEGGtA6pndDHjOIuGFOs7BOkJMvuaw&e=
>> >>> and provide commented, minimal, self-contained, reproducible code.
>> >>>
>> >>          [[alternative HTML version deleted]]
>> >>
>> >> ______________________________________________
>> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxoRrVO5eP&s=f3IyuRfeDDjr_8UWlwyBTC5Yn4Y56QV4FjYC0GCWcVc&e=
>> >> PLEASE do read the posting guide
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxoRrVO5eP&s=Vo6cRRCeqGApsiEGGtA6pndDHjOIuGFOs7BOkJMvuaw&e=
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >>
>> >       [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>

	[[alternative HTML version deleted]]