5 ways to do some calculation by groups

In a previous post I showed  4+1 ways to do some calculation on a quantitative variable that is associated to a qualitative variable. However, there’s another very useful (and powerful) way to do that. Again, I’ll use the same dummy example to calculate the average prices (quantitative) by brands (qualitative) of different products:

# here's the data
item = toupper(letters[1:15])
set.seed(321)
brand = sample(c("Alpha", "Beta"), length(item), replace=TRUE)
price = round(10 * runif(length(item)), 2)
units = sample(1:3, length(item), replace=TRUE)
some_data = data.frame(item, brand, price, units)

You should get a table like this one

     item   brand   price   units
1       A    Beta    2.02       1
2       B    Beta    6.33       3
3       C   Alpha    4.04       3
4       D   Alpha    2.91       3
5       E   Alpha    6.40       2
6       F   Alpha    6.36       2
7       G   Alpha    9.90       3
8       H   Alpha    9.31       1
9       I   Alpha    4.86       3
10      J    Beta    5.75       2
11      K    Beta    7.51       3
12      L   Alpha    9.93       1
13      M    Beta    4.31       2
14      N   Alpha    1.24       3
15      O    Beta    5.94       3

Five ways to obtain the average price by brand:

  • data.table
  • boolean selection (shown in a previous post)
  • tapply (shown in a previous post)
  • ddply (shown in a previous post)
  • sql query (shown in a previous post)

Option 1: using the function data.table
This option is the reason why I’m updating this post, and it has to do with the package data.table which I think deserves to be seriously studied for anyone doing data manipulations in R (please check the vignette of the package, and install the package).

# load data.table
library(data.table)
# convert the data to a data.table
some_table = data.table(some_data)
# indicate mean of price by brand
some_table[, mean(price), by=brand]
     brand V1
[1,] Beta  5.310000
[2,] Alpha 6.105556

Option 2: using boolean selection

# using boolean selection
with(some_data, mean(price[brand=="Alpha"]))
with(some_data, mean(price[brand=="Beta"]))

Option 3: using tapply

# using tapply
with(some_data, tapply(price, brand, mean))

Option 4: using ddply

# using ddply
require(plyr)
ddply(some_data, .(brand), summarise, mean_price=mean(price))

Option 5: using an SQL query

# Using an SQL query
require(sqldf)
sqldf("SELECT brand, AVG(price) AS mean_price FROM some_data GROUP BY brand")
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s