[Rd] error handling in strcapture

William Dunlap wdunlap at tibco.com
Wed Sep 21 23:21:21 CEST 2016


If there are any matches then strcapture can see if the pattern has the
same number of capture expressions as the prototype has columns and give an
error if not.  That seems appropriate.

If there are no matches, then there is no easy way to see if the prototype
is compatible with the pattern, so should strcapture just assume the best
and fill in the prototype with NA's?

Should there be warnings?  This is kind of like strptime(), which silently
gives NA's when the format does not match the text input.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence <lawrence.michael at gene.com
> wrote:

> Hi Bill,
>
> Thanks, another good suggestion. strcapture() now returns NAs for
> non-matches. It's nice to have someone kicking the tires on that
> function.
>
> Michael
>
> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel
> <r-devel at r-project.org> wrote:
> > Michael, thanks for looking at my first issue with utils::strcapture.
> >
> > Another issue is how it deals with lines that don't match the pattern.
> > Currently it gives an error
> >
> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"),
> > proto=list(Name="", Number=0))
> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three
> 3"),  :
> >   number of matches does not always match ncol(proto)
> >
> > First, isn't the 'number of matches' the number of parenthesized
> > subpatterns in the regular expression?  I thought that if the entire
> > pattern matches then the subpatterns without matches would be
> > shown as matches at position 0 with length 0.  Hence either the
> > pattern is compatible with the prototype or it isn't, it does not depend
> > on the text input.  E.g.,
> >
> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12", "Z280"))
> > [[1]]
> > [1] 1 1 1 0
> > attr(,"match.length")
> > [1] 6 6 6 0
> > attr(,"useBytes")
> > [1] TRUE
> >
> > [[2]]
> > [1] 1 1 0 1
> > attr(,"match.length")
> > [1] 2 2 0 2
> > attr(,"useBytes")
> > [1] TRUE
> >
> > [[3]]
> > [1] -1
> > attr(,"match.length")
> > [1] -1
> > attr(,"useBytes")
> > [1] TRUE
> >
> > Second, an error message like 'some lines were bad' is not very helpful.
> > Should it put NA's in all the columns of the current output row if the
> > input line didn't match the pattern and perhaps warn the user that there
> > were problems?  The user could then look for rows of NA's to see where
> the
> > problems were.
> >
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list