[Rd] error handling in strcapture

Michael Lawrence lawrence.michael at gene.com
Tue Oct 4 23:59:33 CEST 2016


Once again, nice catch. I've committed a check for this.

Michael

On Tue, Oct 4, 2016 at 2:37 PM, William Dunlap <wdunlap at tibco.com> wrote:
> It is also not catching the cases where the number of capture expressions
> does not match the number of entries in proto.  I think all of the following
> should give an error about the mismatch.
>
>> strcapture("(.)(.)", c("ab", "cde", "fgh", "ij", "lm"),
>> proto=list(A="",B="",C=""))
>    A  B  C
> 1  a  b cd
> 2  d fg  f
> 3 ij  i  j
> 4  l  m ab
> Warning message:
> In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) :
>   data length [15] is not a sub-multiple or multiple of the number of rows
> [4]
>> strcapture("(.)(.)(.)", c("abc", "def", "ghi", "jkl", "mno"),
>> proto=list(A="",B=""))
>     A   B
> 1   a   b
> 2 def   d
> 3   f ghi
> 4   h   i
> 5   j   k
> 6 mno   m
> 7   o abc
> Warning message:
> In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) :
>   data length [20] is not a sub-multiple or multiple of the number of rows
> [7]
>> strcapture("(.)(.)(.)", c("abc", "def"), proto=list(A=""))
>   A
> 1 a
> 2 c
> 3 d
> 4 f
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Tue, Oct 4, 2016 at 2:21 PM, Michael Lawrence <lawrence.michael at gene.com>
> wrote:
>>
>> Hi Bill,
>>
>> This is a bug in regexec() and I will commit a fix.
>>
>> Thanks for the report,
>> Michael
>>
>> On Tue, Oct 4, 2016 at 1:40 PM, William Dunlap <wdunlap at tibco.com> wrote:
>> > I noticed a problem in the strcapture from R-devel (2016-09-27 r71386),
>> > when
>> > the text contains a missing value and perl=TRUE.
>> >
>> > {
>> >       # NA in text input should map to row of NA's in output, without
>> > warning
>> >       r9p <- strcapture(perl = TRUE, "(.).* ([[:digit:]]+)", c("One 1",
>> > NA,
>> > "Fifty 50"), data.frame(Initial=factor(), Number=numeric()))
>> >       e9p <- structure(list(Initial = structure(c(2L, NA, 1L), .Label =
>> > c("F", "O"), class = "factor"),
>> >                            Number = c(1, NA, 50)),
>> >                       row.names = c(NA, -3L),
>> >                       class = "data.frame")
>> >       all.equal(e9p, r9p)
>> >   }
>> > #Error in if (any(ind)) { : missing value where TRUE/FALSE needed
>> >
>> >
>> > Bill Dunlap
>> > TIBCO Software
>> > wdunlap tibco.com
>> >
>> > On Wed, Sep 21, 2016 at 2:32 PM, Michael Lawrence
>> > <lawrence.michael at gene.com> wrote:
>> >>
>> >> The new behavior is that it yields NAs when the pattern does not match
>> >> (like strptime) and for empty captures in a matching pattern it yields
>> >> the empty string, which is consistent with regmatches().
>> >>
>> >> Michael
>> >>
>> >> On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <wdunlap at tibco.com>
>> >> wrote:
>> >> > If there are any matches then strcapture can see if the pattern has
>> >> > the
>> >> > same
>> >> > number of capture expressions as the prototype has columns and give
>> >> > an
>> >> > error if not.  That seems appropriate.
>> >> >
>> >> > If there are no matches, then there is no easy way to see if the
>> >> > prototype
>> >> > is compatible with the pattern, so should strcapture just assume the
>> >> > best
>> >> > and fill in the prototype with NA's?
>> >> >
>> >> > Should there be warnings?  This is kind of like strptime(), which
>> >> > silently
>> >> > gives NA's when the format does not match the text input.
>> >> >
>> >> >
>> >> > Bill Dunlap
>> >> > TIBCO Software
>> >> > wdunlap tibco.com
>> >> >
>> >> > On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence
>> >> > <lawrence.michael at gene.com> wrote:
>> >> >>
>> >> >> Hi Bill,
>> >> >>
>> >> >> Thanks, another good suggestion. strcapture() now returns NAs for
>> >> >> non-matches. It's nice to have someone kicking the tires on that
>> >> >> function.
>> >> >>
>> >> >> Michael
>> >> >>
>> >> >> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel
>> >> >> <r-devel at r-project.org> wrote:
>> >> >> > Michael, thanks for looking at my first issue with
>> >> >> > utils::strcapture.
>> >> >> >
>> >> >> > Another issue is how it deals with lines that don't match the
>> >> >> > pattern.
>> >> >> > Currently it gives an error
>> >> >> >
>> >> >> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"),
>> >> >> > proto=list(Name="", Number=0))
>> >> >> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine",
>> >> >> > "Three
>> >> >> > 3"),
>> >> >> > :
>> >> >> >   number of matches does not always match ncol(proto)
>> >> >> >
>> >> >> > First, isn't the 'number of matches' the number of parenthesized
>> >> >> > subpatterns in the regular expression?  I thought that if the
>> >> >> > entire
>> >> >> > pattern matches then the subpatterns without matches would be
>> >> >> > shown as matches at position 0 with length 0.  Hence either the
>> >> >> > pattern is compatible with the prototype or it isn't, it does not
>> >> >> > depend
>> >> >> > on the text input.  E.g.,
>> >> >> >
>> >> >> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12",
>> >> >> >> "Z280"))
>> >> >> > [[1]]
>> >> >> > [1] 1 1 1 0
>> >> >> > attr(,"match.length")
>> >> >> > [1] 6 6 6 0
>> >> >> > attr(,"useBytes")
>> >> >> > [1] TRUE
>> >> >> >
>> >> >> > [[2]]
>> >> >> > [1] 1 1 0 1
>> >> >> > attr(,"match.length")
>> >> >> > [1] 2 2 0 2
>> >> >> > attr(,"useBytes")
>> >> >> > [1] TRUE
>> >> >> >
>> >> >> > [[3]]
>> >> >> > [1] -1
>> >> >> > attr(,"match.length")
>> >> >> > [1] -1
>> >> >> > attr(,"useBytes")
>> >> >> > [1] TRUE
>> >> >> >
>> >> >> > Second, an error message like 'some lines were bad' is not very
>> >> >> > helpful.
>> >> >> > Should it put NA's in all the columns of the current output row if
>> >> >> > the
>> >> >> > input line didn't match the pattern and perhaps warn the user that
>> >> >> > there
>> >> >> > were problems?  The user could then look for rows of NA's to see
>> >> >> > where
>> >> >> > the
>> >> >> > problems were.
>> >> >> >
>> >> >> > Bill Dunlap
>> >> >> > TIBCO Software
>> >> >> > wdunlap tibco.com
>> >> >> >
>> >> >> >         [[alternative HTML version deleted]]
>> >> >> >
>> >> >> > ______________________________________________
>> >> >> > R-devel at r-project.org mailing list
>> >> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >> >
>> >> >
>> >
>> >
>
>



More information about the R-devel mailing list