[R] Help with Binning Data

David Winsemius dwinsemius at comcast.net
Fri Sep 11 01:57:37 CEST 2015


On Sep 10, 2015, at 3:28 PM, Shouro Dasgupta wrote:

> Dear all,
> 
> I have 3-hourly temperature data from 1970-2010 for 122 cities in the US. I
> would like to bin this data by city-year-week. My idea is if the
> temperature for a particular city in a given week falls within a given
> range (-17.78 & -12.22), (-12.22 & -6.67), ... (37.78 & 43.33), then the
> corresponding bin would have a value of 1 and 0 otherwise.
> 
> The data looks like this. Basically, I need to generate a dummy variable
> for each temperature range. Any help will be greatly appreciated.

The urge to imitate other statistical package that rely on profusion of dummies should be resisted. R repression functions can handle factor variables and the `cut` function can deliver them along with appropriate use of `seq`:

  tmp2$Tcat <- cut( tmp2$avsft, breaks=seq (-17.78,  43.33, by= 5.55 ) )

> tmp2$Tcat
 [1] (-12.2,-6.68] (-17.8,-12.2] (-12.2,-6.68] (-6.68,-1.13]
 [5] (-1.13,4.42]  (4.42,9.97]   (-6.68,-1.13] (4.42,9.97]  
 [9] (9.97,15.5]   (-1.13,4.42] 
11 Levels: (-17.8,-12.2] (-12.2,-6.68] ... (37.7,43.3]


> tmp2[ , c("City", "Tcat")]
          City          Tcat
1        AKRON (-12.2,-6.68]
2       ALBANY (-17.8,-12.2]
3  ALBUQUERQUE (-12.2,-6.68]
4    ALLENTOWN (-6.68,-1.13]
5      ATLANTA  (-1.13,4.42]
6       AUSTIN   (4.42,9.97]
7    BALTIMORE (-6.68,-1.13]
8  BATON ROUGE   (4.42,9.97]
9     BERKELEY   (9.97,15.5]
10  BIRMINGHAM  (-1.13,4.42]

Must have been a cold snap in the southeast that New Years Day.


There.... isn't that much neater than have a messy bunch of dummies? If you really need to build them then look at `?model.frame`.

-- 
David. 
> 
> tmp2<- dput(head(tmp1,10))
>> structure(list(yearday = c(1970001L, 1970001L, 1970001L, 1970001L,
>> 1970001L, 1970001L, 1970001L, 1970001L, 1970001L, 1970001L),
>>    City = structure(1:10, .Label = c("AKRON", "ALBANY", "ALBUQUERQUE",
>>    "ALLENTOWN", "ATLANTA", "AUSTIN", "BALTIMORE", "BATON ROUGE",
>>    "BERKELEY", "BIRMINGHAM", "BOISE", "BOSTON", "BRIDGEPORT",
>>    "BUFFALO", "CAMBRIDGE", "CAMDEN", "CANTON", "CHARLOTTE",
>>    "CHATTANOOGA", "CHICAGO", "CINCINNATI", "CLEVELAND", "COLORADO
>> SPRINGS",
>>    "COLUMBUS", "CORPUS CHRISTI", "DALLAS", "DAYTON", "DENVER",
>>    "DES MOINES", "DETROIT", "DULUTH", "EL PASO", "ELIZABETH",
>>    "ERIE", "EVANSVILLE", "FALL RIVER", "FLINT", "FORT WAYNE",
>>    "FRESNO", "FT WORTH", "GARY", "GLENDALE", "GRAND RAPIDS",
>>    "HARTFORD", "HONOLULU", "HOUSTON", "INDIANAPOLIS", "JACKSONVILLE",
>>    "JERSEY CITY", "KANSAS CITY", "KANSAS ITY", "KNOXVILLE",
>>    "Lansing ", "LAS VEGAS", "LEXINGTON", "LINCOLN", "LITTLE ROCK",
>>    "LONG BEACH", "LOS ANGELES", "LOUISVILLE", "LOWELL", "LYNN",
>>    "MADISON", "MEMPHIS", "MIAMI", "MILWAUKEE", "MINNEAPOLIS",
>>    "MOBILE", "MONTGOMERY", "NASHVILLE", "NEW BEDFORD", "NEW HAVEN",
>>    "NEW ORLEANS", "NEW YORK CITY", "NEWARK", "NORFOLK", "OAKLAND",
>>    "OGDEN", "OKLAHOMA CITY", "OMAHA", "PASADENA", "PATERSON",
>>    "PEORIA", "PHILADELPHIA", "PHOENIX", "PITTSBURG", "PORTLAND",
>>    "PROVIDENCE", "PUEBLO", "READING", "RICHMOND", "ROCHESTER",
>>    "ROCKFORD", "SACRAMENTO", "SALT LAKE CITY", "SAN ANTONIO",
>>    "SAN CRUZ", "SAN DIEGO", "SAN FRANCISCO", "SAN JOSE", "SAVANNAH",
>>    "SCHENECTADY", "SCRANTON", "SEATTLE", "SHREVEPORT", "SOMERVILLE",
>>    "SOUTH BEND", "SPOKANE", "SPRINGFIELD", "ST LOUIS", "ST PAUL",
>>    "ST PETERSBURG", "SYRACUSE", "TACOMA", "TAMPA", "TOLEDO",
>>    "TRENTON", "TUCSON", "TULSA", "UTICA", "WASHINGTON", "WATERBURY",
>>    "WICHITA", "WILMINGTON", "WORCESTER", "YONKERS", "YOUNGSTOWN"
>>    ), class = "factor"), cell_number = c(17379L, 17027L, 19514L,
>>    17745L, 20256L, 21323L, 18104L, 21329L, 18779L, 20254L),
>>    longitude = c(-81.519005, -73.756232, -106.609991, -75.490183,
>>    -84.387982, -97.743061, -76.612189, -91.14032, -121.635963,
>>    -86.80249), latitude = c(41.081445, 42.652579, 35.110703,
>>    40.608431, 33.748995, 30.267153, 39.290385, 30.458283, 37.871744,
>>    33.520661), State = structure(c(29L, 28L, 27L, 32L, 10L,
>>    35L, 19L, 17L, 4L, 1L), .Label = c(" ALA", " ARIZ", " ARK",
>>    " CAL", " COLO", " CONN", " DC", " DEL", " FLA", " GA", " HAWAII",
>>    " ILL", " IND", " IOWA", " KANS", " KY", " LA", " MASS",
>>    " MD", " MICH", " MINN", " MO", " NC", " NEBR", " NEV", " NJ",
>>    " NM", " NY", " OHIO", " OKLA", " ORE", " PA", " RI", " TENN",
>>    " TEX", " UTAH", " VA", " WASH", " WIS", "CAL", "CONN", "IDAH",
>>    "KY", "MASS"), class = "factor"), avsft = c(-7.81, -16.06,
>>    -7.71999999999997, -1.88999999999999, 2.90000000000003, 5.12,
>>    -5.02999999999997, 9.33000000000004, 15.08, 2.89000000000004
>>    ), year = c(1970L, 1970L, 1970L, 1970L, 1970L, 1970L, 1970L,
>>    1970L, 1970L, 1970L), day = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
>>    1L, 1L, 1L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>>    0L), yearweek = c(197001L, 197001L, 197001L, 197001L, 197001L,
>>    197001L, 197001L, 197001L, 197001L, 197001L), week = c(1L,
>>    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("yearday",
>> "City", "cell_number", "longitude", "latitude", "State", "avsft",
>> "year", "day", "hour", "yearweek", "week"), row.names = c(NA,
>> 10L), class = "data.frame")
> 
> 
> Sincerely,
> 
> Shouro
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list