[R] Help with Kmeans output and using broom to tidy etc..

Eric Berger er|cjberger @end|ng |rom gm@||@com
Tue May 12 21:07:10 CEST 2020


Please use dput()



On Tue, May 12, 2020 at 7:11 PM Poling, William <PolingW using aetna.com> wrote:

> Hello Eric, thank you so much for your consideration.
>
> Here are snippets of data that I hope will be helpful
>
> WHP
>
> geo1a <- geo1[, c(2:5)] <-- eliminating ID which is not useful for my
> purposes anyway
>
> #This is for R-Help use
> geo1a <- geo1a %>% top_n(25)
>
> state           city latitude longitude
> 1     ME      FAIRFIELD 44.64485 -69.65948
> 2     ME      JONESPORT 44.57935 -67.56743
> 3     ME        CASWELL 46.97529 -67.83023
> 4     ME      ELLSWORTH 44.52916 -68.38717
> 5     ME     VASSALBORO 44.45095 -69.60629
> 6     ME          UNION 44.20059 -69.26123
> 7     ME        PALERMO 44.45142 -69.41115
> 8     ME          ORONO 44.87426 -68.68327
> 9     ME    SANGERVILLE 45.10138 -69.33580
> 10    ME      ISLESBORO 44.29015 -68.90812
> 11    ME        TOPSHAM 43.93600 -69.96565
> 12    ME       FREEPORT 43.84089 -70.11160
> 13    ME      SKOWHEGAN 44.76687 -69.71644
> 14    ME    MILLINOCKET 45.65501 -68.70261
> 15    ME      ORRINGTON 44.72417 -68.74026
> 16    ME     ST. GEORGE 43.96726 -69.20827
> 17    ME FORT FAIRFIELD 46.80911 -67.88079
> 18    ME      MARS HILL 46.56580 -67.89006
> 19    ME       FREEPORT 43.85302 -70.03726
> 20    ME         EASTON 46.64143 -67.91203
> 21    ME     WATERVILLE 44.53621 -69.65913
> 22    ME      BRUNSWICK 43.87771 -69.96297
> 23    ME      BRUNSWICK 43.91719 -69.89905
> 24    ME      BUCKSPORT 44.60665 -68.81892
> 25    ME        FAYETTE 44.46380 -70.12047
>
>
> trnd1_tbla <- trnd1_tbl %>% top_n(25)
> print(trnd1_tbla)
> head(trnd1_tbla,n=25)
>
> A tibble: 25 x 5
>    city      state Basecountsum Basecount2 prop_of_total
>    <fct>     <fct>        <dbl>      <dbl>         <dbl>
>  1 ATLANTA   GA            2352         12       0.00510
>  2 BRADENTON FL            2352          8       0.00340
>  3 BROOKLYN  NY            2352         30       0.0128
>  4 CHARLOTTE NC            2352          8       0.00340
>  5 CHICAGO   IL            2352         17       0.00723
>  6 COLUMBUS  OH            2352         11       0.00468
>  7 CUMMING   GA            2352          8       0.00340
>  8 DALLAS    TX            2352          8       0.00340
>  9 ERIE      PA            2352         12       0.00510
> 10 HOUSTON   TX            2352         12       0.00510
> # ... with 15 more rows
>
> WHP
>
> From: Eric Berger <ericjberger using gmail.com>
> Sent: Tuesday, May 12, 2020 8:39 AM
> To: Poling, William <PolingW using aetna.com>
> Cc: r-help using r-project.org
> Subject: [EXTERNAL] Re: [R] Help with Kmeans output and using broom to
> tidy etc..
>
> **** External Email - Use Caution ****
> Can you create a reproducible example?
> Your question involves objects that are unknown to us. (geo1, trnd1_tbl)
>
> On Tue, May 12, 2020 at 2:41 PM Poling, William via R-help <mailto:
> r-help using r-project.org> wrote:
> #RStudio Version Version 1.2.1335 need this one--> 1.2.5019
> sessionInfo()
> # R version 4.0.0 Patched (2020-05-03 r78349)
> #Platform: x86_64-w64-mingw32/x64 (64-bit)
> #Running under: Windows 10 x64 (build 17763)
>
> Hello:
>
> I have data that I am trying to manipulate for Kmeans clustering.
>
> Original data looks like this
>
> str(geo1)
> # 'data.frame': 2352 obs. of  5 variables:
> # $ ID: Factor w/ 2352 levels "101040199600",..: 590 908 976 509 1674 690
> 1336 86 726 1702 ...
> # $ state           : Factor w/ 41 levels "AL","AR","AZ",..: 32 10 25 11 9
> 32 13 31 12 12 ...
> # $ city            : Factor w/ 1337 levels "ABBOTTSTOWN",..: 932 156 230
> 698 965 1330 515 727 1127 1304 ...
> # $ latitude        : num  40.4 31.2 40.8 42.1 26.8 ...
> # $ longitude       : num  -79.9 -81.5 -74 -91.6 -82.1 ...
>
> I created a subset adding column prop_of_total
> str(trnd1_tbl)
> tibble [1,457 x 5] (S3: tbl_df/tbl/data.frame)
>  $ city         : Factor w/ 1337 levels "ABBOTTSTOWN",..: 1 2 3 4 5 6 7 8
> 9 10 ...
>  $ state        : Factor w/ 41 levels "AL","AR","AZ",..: 32 36 10 28 12 36
> 10 11 26 38 ...
>  $ Basecountsum : num [1:1457] 2352 2352 2352 2352 2352 ...
>  $ Basecount2   : num [1:1457] 1 1 1 1 1 2 1 1 2 1 ...
>  $ prop_of_total: num [1:1457] 0.000425 0.000425 0.000425 0.000425
> 0.000425 ...
>
>
> Then I spread it
>
> trnd2_tbl <- trnd1_tbl %>%
>     dplyr::select(city, state, prop_of_total) %>%
>     spread(key = city, value = prop_of_total, fill = 0) #remove the NA's
> with fill
>
> str(trnd2_tbl)#tibble [41 x 1,338] (S3: tbl_df/tbl/data.frame)
>
> Then I run a Kmeans
>
> kmeans_obj1 <- trnd2_tbl  %>%
>   dplyr::select(- state) %>%
>   kmeans(centers = 20, nstart = 100)
>
> str(kmeans_obj1)
> List of 9
>  $ cluster     : int [1:41] 11 11 9 11 11 4 11 11 16 2 ...
>  $ centers     : num [1:20, 1:1337] 0 0 0 0 0 0 0 0 0 0 ...
>   ..- attr(*, "dimnames")=List of 2
>   .. ..$ : chr [1:20] "1" "2" "3" "4" ...
>   .. ..$ : chr [1:1337] "ABBOTTSTOWN" "ABILENE" "ACWORTH" "ADAMS" ...
>  $ totss       : num 0.00158
>  $ withinss    : num [1:20] 0 0 0 0 0 0 0 0 0 0 ...
>  $ tot.withinss: num 0.0000848
>  $ betweenss   : num 0.0015
>  $ size        : int [1:20] 1 1 1 1 1 1 1 1 1 1 ...
>  $ iter        : int 3
>  $ ifault      : int 0
>  - attr(*, "class")= chr "kmeans"
>
> Then I go and try to tidy:
>
> #Tidy, glance, augment
> #Just makes it easier to use or view the obj's in the obj list
>
>   broom::tidy(kmeans_obj1) %>% glimpse()
>
>         broom::glance(kmeans_obj1)
> ##A tibble: 1 x 4
> # totss tot.withinss betweenss  iter
> # <dbl>        <dbl>     <dbl> <int>
> #   1 0.00158    0.0000848   0.00150     3
>
> However, when I run this piece I get an error:
>
> broom::augment(kmeans_obj1, trnd2_tbl) %>%
>   dplyr::select(city, .cluster)
>
> #Error: Must subset columns with a valid subscript vector.
> # The subscript has the wrong type `data.frame<
>  # u: double
> #  x: double
> >`.
> i It must be numeric or character.
>
> Here is the back trace:
>
> rlang::last_error()
>
> # Backtrace:
> #   1. broom::augment(kmeans_obj1, trnd2_tbl)
> # 9. dplyr::select(., city, .cluster)
> # 11. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
> # 12. tidyselect:::eval_select_impl(...)
> # 20. tidyselect:::vars_select_eval(...)
> # 21. tidyselect:::walk_data_tree(expr, data_mask, context_mask)
> # 22. tidyselect:::eval_c(expr, data_mask, context_mask)
> # 23. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
> # 24. tidyselect:::walk_data_tree(new, data_mask, context_mask)
> # 25. tidyselect:::as_indices_sel_impl(...)
> # 26. tidyselect:::as_indices_impl(x, vars, strict = strict)
> # 27. vctrs::vec_as_subscript(x, logical = "error")
>
> I am not sure what I am supposed to fix?
>
> Maybe someone has had similar error and can advise me please?
>
> Thank you.
>
> WHP
>
>
>
>
>
>
>
> Proprietary
>
> NOTICE TO RECIPIENT OF INFORMATION:\ This e-mail may con...{{dropped:16}}
>
> ______________________________________________
> mailto:R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwMFaQ&c=wluqKIiwffOpZ6k5sqMWMBOn0vyYnlulRJmmvOXCFpM&r=j7MrcIQm2xjHa8v-2mTpmTCtKvneM2ExlYvnUWbsByY&m=sMhCVDVDKajwJ9te2qVsWXQ2aq4kAe7150EICM51Pw4&s=eSV6ISkAsnmonaRvNdtmx4Lr9vumgXwMYF87DoRP86s&e=
> PLEASE do read the posting guide
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwMFaQ&c=wluqKIiwffOpZ6k5sqMWMBOn0vyYnlulRJmmvOXCFpM&r=j7MrcIQm2xjHa8v-2mTpmTCtKvneM2ExlYvnUWbsByY&m=sMhCVDVDKajwJ9te2qVsWXQ2aq4kAe7150EICM51Pw4&s=8wmXM73ofNcrn1i9gF-qxOzj7zRJZSPcaA5qg0vggG4&e=
> and provide commented, minimal, self-contained, reproducible code.
>
> Proprietary
>
> NOTICE TO RECIPIENT OF INFORMATION:
> This e-mail may contain confidential or privileged information. If you
> think you have received this e-mail in error, please advise the sender by
> reply e-mail and then delete this e-mail immediately.
> This e-mail may also contain protected health information (PHI) with
> information about sensitive medical conditions, including, but not limited
> to, treatment for substance use disorders, behavioral health, HIV/AIDS, or
> pregnancy. This type of information may be protected by various federal
> and/or state laws which prohibit any further disclosure without the express
> written consent of the person to whom it pertains or as otherwise permitted
> by law. Any unauthorized further disclosure may be considered a violation
> of federal and/or state law. A general authorization for the release of
> medical or other information may NOT be sufficient consent for release of
> this type of information.
> Thank you. Aetna
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list