[R] adding additional information to histogram

Jim Lemon jim at bitwrit.com.au
Fri Jan 27 09:51:12 CET 2012


On 01/27/2012 03:12 AM, Raphael Bauduin wrote:
> Hi,
>
> I am a beginner with R, and I think the answer to my question will
> seem obvious, but after searching and trying without success I've
> decided to post to the list.
>
> I am working with data loaded from a csv filewith these fields:
>    order_id, item_value
> As an order can have multiple items, an order_id may be present
> multiple times in the CSV.
>
> I managed to compute the total value  and the number of items for each order:
>
>    oli<- read.csv("/tmp/order_line_items_data.csv", header=TRUE)
>    orders_values<- tapply(oli[[2]], oli[[1]], sum)
>    items_per_order<- tapply(oli[[2]], oli[[1]], length)
>
> I then can display the histogram of the order values:
>
>    hist(orders_values, breaks=c(10*0:20,800), xlim=c(0,200), prob=TRUE)
>
> Now on this histogram, I would like to display the average number of
> items of the orders in each group (defined with the breaks).
> So for the bar of orders with value 0 to 10, I'd like to display the
> average number of items of these orders.
>
Hi Raph,
As this looks a tiny bit like homework, I'll only provide suggestions. 
You have the value and number of items for each order. What you need to 
do is to match them in groups. In order to do that, you want a factor 
that will show the group for each value-items pair. The "cut" function 
will give you such a factor, using the breaks above. You seem to 
understand the *apply functions, so you can use one of these to return 
the mean number of items for each value group. Alternatively, you could 
use the factor in the "by" function to get the mean number of items.

You should now have a factor that can be sent to "table" to get the 
number of orders in each value range, and a vector of the corresponding 
mean numbers of items in each value grouping. Why you could even use the 
same trick to calculate the mean price of the orders in each value 
grouping...

I would use "barplot" to display all this information, as it is a bit 
easier to place the mean number on items on the bars (if you check the 
return value for barplot).

Jim



More information about the R-help mailing list