[R] getting data from a "vertical" table into a "2-dimensional" grid

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Sat Oct 22 03:49:16 CEST 2022


Well, if you know the column subscripts you need, just forget about the names!

I would just write a (one-liner) function to do it for any data frame:

myfun <- function(dat)tapply(dat[,3], dat[,1:2], sum)
## dat[,1:2] is a list because it's a data frame and all data frames are lists

myfun(data_original) ## then gives the result

## you could even make the indices arguments of the myfun if they
change in different
## data sets

-- Bert

On Fri, Oct 21, 2022 at 5:14 PM Kelly Thompson <kt1572757 using gmail.com> wrote:
>
> Bert,
> Thanks! I'm pretty sure what you provided gets me to what I was
> looking for, and is much simpler. I really appreciate your help.
>
> A follow-up question:
> I adjusted the code to not use "hard-coded" column names.
>
> mat2 <- with(data_original, tapply( get(names(data_original)[3]),
> list( get(names(data_original)[1]), get(names(data_original)[2])), sum
> ))
>
> Is there any better way to write that?
>
> Thanks again!
> -----
>
> For clarity and to improve upon what I previously wrote, and so I can
> practice writing questions like this and asking for help, here's a
> recap of my question and "reproducible code", and the "better way" you
> provided:
>
> I have data presented in a 3-column data frame as shown below in
> "data_original".
>
> I want to aggregate the data in column 3, with the "by" argument using
> the first and second columns of "data_original".
>
> I want the results of the aggregation in a matrix, as shown below in "mat1".
>
> As my end "result", I want a matrix with one row for each unique value
> of column1 of data_original and one column for each unique value of
> column2 of data_original.
>
> What I show below seems like one way this can be done.
>
> My question: Are there easier or better ways to do this, especially in
> Base R, and also in R packages?
>
>
> #create data
> set.seed(1)
> data_original <- data.frame(year = rep(1990:1999, length  = 50),
> category = sample(1:5, size = 50, replace = TRUE),  sales =
> sample(0:99999, size = 50 , replace = TRUE) )
> dim(data_original)
>
> #remove rows where data_original[,1] == 1990 & data_original[,2] == 5,
> to ensure there is at least one NA in the desired matrix (this is an
> "edge" case I want the code to "deal with" correctly.)
> data_original <- data_original[ (data_original[,1] == 1990 &
> data_original[,2] == 5) == FALSE, ]
> dim(data_original)
>
> #aggregate data
> data_aggregate_col3_by_col1_and_col2 <- aggregate(x =
> data_original[3], by = list(data_original[,1], data_original[,2]), FUN
> = sum)
> colnames(data_aggregate_col3_by_col1_and_col2) <- colnames(data_original)
> dim(data_aggregate_col3_by_col1_and_col2)
>
> data_expanded <-
> expand.grid(unique(data_aggregate_col3_by_col1_and_col2[,1]),
> unique(data_aggregate_col3_by_col1_and_col2[,2]))
> colnames(data_expanded) <- colnames(data_aggregate_col3_by_col1_and_col2)[1:2]
> dim(data_expanded)
>
> data_expanded <- merge(data_expanded,
> data_aggregate_col3_by_col1_and_col2, all = TRUE)
> dim(data_expanded)
>
> mat1 <- matrix(data = data_expanded[,3], nrow =
> length(unique(data_expanded[,1])), ncol =
> length(unique(data_expanded[,2])) , byrow = TRUE, dimnames = list(
> unique(data_expanded[,1]), unique(data_expanded[,2]) ) )
>
> #this is an easier way, using with and tapply
> mat2 <- with(data_original, tapply( get(names(data_original)[3]),
> list( get(names(data_original)[1]), get(names(data_original)[2])), sum
> ))
> #check that mat1 and mat 2 are "nearly equal"
> all.equal(mat1, mat2)
>
>
>
> Gunter <bgunter.4567 using gmail.com> wrote:
> >
> > "As my end result, I want a matrix or data frame, with one row for each
> > year, and one column for each category."
> >
> > If I understand you correctly, no reshaping gymnastics are needed --
> > just use ?tapply:
> >
> > set.seed(1)
> > do <- data.frame(year = rep(1990:1999, length  = 50),
> > category = sample(1:5, size = 50, replace = TRUE),
> > sales = sample(0:99999, size = 50 , replace = TRUE) )
> >
> >
> > with(do, tapply(sales, list(year, category),sum))
> >  ## which gives the matrix:
> >
> >          1      2      3     4     5
> > 1990  13283     NA  55083 87522 64877
> > 1991     NA  80963     NA 30100 28277
> > 1992   9391 202916     NA 55090    NA
> > 1993  29696 167344     NA    NA 17625
> > 1994  98015  99521     NA 70536 52252
> > 1995 157003     NA  26875    NA 11366
> > 1996  32986  88683   6562 79475 95282
> > 1997  13601     NA 134757 12398    NA
> > 1998  30537  51117  31333 20204    NA
> > 1999  39240  87845  62479    NA 98804
> >
> > If this is not what you wanted, you may need to explain further or
> > await a response from someone more insightful than I.
> >
> > Cheers,
> > Bert
> >
> >
> > On Fri, Oct 21, 2022 at 3:34 PM Kelly Thompson <kt1572757 using gmail.com> wrote:
> > >
> > > As my end result, I want a matrix or data frame, with one row for each
> > > year, and one column for each category.
> > >
> > > On Fri, Oct 21, 2022 at 6:23 PM Kelly Thompson <kt1572757 using gmail.com> wrote:
> > > >
> > > > # I think this might be a better example.
> > > >
> > > > # I have data presented in a "vertical" dataframe as shown below in
> > > > data_original.
> > > > # I want this data in a matrix or "grid", as shown below.
> > > > # What I show below seems like one way this can be done.
> > > >
> > > > # My question: Are there easier or better ways to do this, especially
> > > > in Base R, and also in R packages?
> > > >
> > > > #create data
> > > > set.seed(1)
> > > > data_original <- data.frame(year = rep(1990:1999, length  = 50),
> > > > category = sample(1:5, size = 50, replace = TRUE),  sales =
> > > > sample(0:99999, size = 50 , replace = TRUE) )
> > > > dim(data_original)
> > > >
> > > > #remove rows where data_original$year == 1990 & data_original$category
> > > > == 5, to ensure there is at least one NA in the "grid"
> > > > data_original <- data_original[ (data_original$year == 1990 &
> > > > data_original$category == 5) == FALSE, ]
> > > > dim(data_original)
> > > >
> > > > #aggregate data
> > > > data_aggregate_sum_by_year_and_category <- aggregate(x =
> > > > data_original$sales, by = list(year = data_original$year, category =
> > > > data_original$category), FUN = sum)
> > > > colnames(data_aggregate_sum_by_year_and_category) <- c('year',
> > > > 'category', 'sum_of_sales')
> > > > dim(data_aggregate_sum_by_year_and_category)
> > > >
> > > > data_expanded <- expand.grid(year =
> > > > unique(data_aggregate_sum_by_year_and_category$year), category =
> > > > unique(data_aggregate_sum_by_year_and_category$category))
> > > > dim(data_expanded)
> > > > data_expanded <- merge(data_expanded,
> > > > data_aggregate_sum_by_year_and_category, all = TRUE)
> > > > dim(data_expanded)
> > > >
> > > > mat <- matrix(data = data_expanded$sum_of_sales, nrow =
> > > > length(unique(data_expanded$year)), ncol =
> > > > length(unique(data_expanded$category)) , byrow = TRUE, dimnames =
> > > > list( unique(data_expanded$year), unique(data_expanded$category) ) )
> > > >
> > > >
> > > > data_original
> > > > data_expanded
> > > > mat
> > > >
> > > > On Fri, Oct 21, 2022 at 5:03 PM Kelly Thompson <kt1572757 using gmail.com> wrote:
> > > > >
> > > > > ###
> > > > > #I have data presented in a "vertical" data frame as shown below in
> > > > > data_original.
> > > > > #I want this data in a matrix or "grid", as shown below.
> > > > > #What I show below seems like one way this can be done.
> > > > >
> > > > > #My question: Are there easier or better ways to do this, especially
> > > > > in Base R, and also in R packages?
> > > > >
> > > > > #reproducible example
> > > > >
> > > > > data_original <- data.frame(year = c('1990', '1999', '1990', '1989'),
> > > > > size = c('s', 'l', 'xl', 'xs'),  n = c(99, 33, 3, 4) )
> > > > >
> > > > > data_expanded <- expand.grid(unique(data_original$year),
> > > > > unique(data_original$size), stringsAsFactors = FALSE )
> > > > > colnames(data_expanded) <- c('year', 'size')
> > > > > data_expanded <- merge(data_expanded, data_original, all = TRUE)
> > > > >
> > > > > mat <- matrix(data = data_expanded $n, nrow =
> > > > > length(unique(data_expanded $year)), ncol =
> > > > > length(unique(data_expanded $size)) , byrow = TRUE, dimnames = list(
> > > > > unique(data_expanded$year), unique(data_expanded$size) ) )
> > > > >
> > > > > data_original
> > > > > data_expanded
> > > > > mat
> > >
> > > ______________________________________________
> > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list