[R] Patterns on postal codes

Wed Jan 8 10:31:08 CET 2014

Or consider a different approach to the problem... figure out which regex 
patterns fit the data.

# test series ... I think your ANAAAN was supposed to be ANANAN
zipcode <- c("22942-0173", "32601", "N9Y2E6", "S7V 1J9", "0022942-0173", 
"32-601", "NN9Y2E6", "S7V  1J9")
# test series in data frame
zipdf <- data.frame( Zip=zipcode )
# default condition for category
zipdf$Category <- "Unknown"
# recognize US patterns ... test for "Unknown" is only there for 
consistency in this first search
zipdf[ with( zipdf, "Unknown"==Category & grepl( 
"^[[:digit:]]{5}(-[[:digit:]]{4})?$", Zip ) ), "Category" ] <- "US"
# recognize Canada patterns
zipdf[ with( zipdf, "Unknown"==Category & grepl( 
"^[[:alpha:]][[:digit:]][[:alpha:]] ?[[:digit:]][[:alpha:]][[:digit:]]$", 
Zip ) ), "Category" ] <- "CA"
# summarize categories
table(zipdf$Category)
# review un-recognized zips
zipdf[ "Unknown"==zipdf$Category, ]

Note that regular expressions have a wide variety of sources of 
documentation... there are whole books on them. The above patterns have 
some pattern flexibility... it can be easier to setup multiple simpler 
regex patterns that all map to the same category while you learn what 
patterns are in the data, though making multiple passes is slower which 
may be an issue for large amounts of data.

As an example, the US test above could be written as

zipdf[ with( zipdf, "Unknown"==Category & grepl( "^[[:digit:]]{5}$", Zip ) 
), "Category" ] <- "US"
zipdf[ with( zipdf, "Unknown"==Category & grepl( 
"^[[:digit:]]{5}-[[:digit:]]{4}$", Zip ) ), "Category" ] <- "US"

and get the same answer as the single test above but using twice the 
processing time.

On Wed, 8 Jan 2014, Frede Aakmann T?gersen wrote:

> Hi
>
> Something like this.
>
> ## 4 valid zips + 4 invalid zips
> zipcode <- c("22942-0173", "32601", "N9YZE6", "S7V 1J9", "0022942-0173", "32-601", "NN9YZE6", "S7V  1J9")
>
> tmp <- gsub("[[:space:]]", "_", zipcode)
> tmp <- gsub("[[:alpha:]]", "A", tmp)
> tmp <- gsub("[[:digit:]]", "N", tmp)
>
> tmp
> ## [1] "NNNNN-NNNN"   "NNNNN"        "ANAAAN"       "ANA_NAN"      "NNNNNNN-NNNN"
> ## [6] "NN-NNN"       "AANAAAN"      "ANA__NAN"
>
> patterns <- c("NNNNN-NNNN", "NNNNN", "ANAAAN", "ANA_NAN")
>
> zipcode[tmp %in% patterns]
> ## [1] "22942-0173" "32601"      "N9YZE6"     "S7V 1J9"
> zipcode[!tmp %in% patterns]
> ## [1] "0022942-0173" "32-601"       "NN9YZE6"      "S7V  1J9"
>
>
> Yours sincerely / Med venlig hilsen
>
>
> Frede Aakmann T?gersen
> Specialist, M.Sc., Ph.D.
> Plant Performance & Modeling
>
> Technology & Service Solutions
> T +45 9730 5135
> M +45 2547 6050
> frtog at vestas.com
> http://www.vestas.com
>
> Company reg. name: Vestas Wind Systems A/S
> This e-mail is subject to our e-mail disclaimer statement.
> Please refer to www.vestas.com/legal/notice
> If you have received this e-mail in error please contact the sender.
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>> On Behalf Of Jeff Johnson
>> Sent: 8. januar 2014 00:11
>> To: r-help at r-project.org
>> Subject: [R] Patterns on postal codes
>>
>> Hi all,
>>
>> I'm pretty new to R and have a question. I have a postal_code field which
>> can have a variety of values such as:
>> For US postal codes: 22942-0173 or 32601
>> For Canada postal codes: N9YZE6 or S7V 1J9
>>
>> What I want to do is represent these as patterns, such as:
>> US: NNNNN-NNNN or NNNNN
>> Canada: ANAAAN or ANA NAN
>> where N = any number and A = any alpha character, space = space, etc (other
>> characters such as ' should be represented as '.
>>
>> Ultimately I want to count these to see how many have a pattern of
>> NNNNN-NNNN, ANA NAN, etc so that I can visualize the outliers.
>>
>> Does anyone know if there is a built-in function in R to do this?
>> Currently, the str() function on the postal_code field shows a factor with
>> 90,993 levels which isn't particularly helpful.
>>
>> Thanks in advance!
>>
>> --
>> Jeff
>>
>> 	[[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k