How to Summarize a Data Frame by Groups in R

Sometimes, when you’re analyzing a data set and you want to get a complete picture of it, you want calculate the metrics on all the observations for each variable. Let’s say, for example, that you run a small zoo and want to inventory the cost of all your animals. To calculate this in a spreadsheet, you simply, sum the cost of all the rows in the “cost” column.

But, what do you do if you want to know the cost broken down by each category of animal? In a spreadsheet, you would subtotal the “cost” column by the column referencing the animal’s class. The result would be something like below–with the original data set on the left and the subtotaled data set on the right.

subtotal

If you’re working with databases, it’s even easier to achieve the result with a simple SQL query. If the table you’re working in is called “animals,” the query would look something like this:

SELECT Class, sum(Cost) FROM animals GROUP_BY Class;

And the result is pretty much the same. So, the question is, if you can do this in spreadsheets and databases, can you do it in R? You bet you can.

In the dplyr package, you can create subtotals by combining the group_by() function and the summarise() function. Let’s start with an example. Below is the first part of the mtcars data frame that is provided in the base R package.

mtcars1

Now, suppose we interested in purchasing a car. We’re interested in 3 things regarding the car we’re seeking to purchase: the fuel economy, the power, and the speed. However, we’re unsure how the number of cylinders relates to these variables. How can we view the averages by cylinder? Let’s look at some code.

### group mtcars by cylinders and return some averages
cars <- mtcars %>%
select(cyl, mpg, hp, qsec) %>%
group_by(cyl) %>%
summarise(mpg = mean(mpg), hp = mean(hp), qsec = mean(qsec))

Here, we’ve used piping with dplyr functions to crew a data set showing us the average mpg, hp, and qsec (seconds it takes to go 1/4 a mile) for each amount of cylinders. When we print the new “cars” data frame, here’s what we get:

mtcars2

This gives us a summary that can help us more easily make a decision. If we want to see this plotted on graphs, we can run this code…

### draw plots of cylinders vs other variables
par(mfrow = c(1,3), lwd = 4)
plot(cars$cyl,cars$mpg, main = "Cylinders Vs Miles Per Gallon", xlab = "Miles Per Gallon", ylab = "Number of Cylinders", type = "h", col = "Red")
plot(cars$cyl,cars$hp, main = "Cylinders vs Horsepower", xlab = "Horsepower", ylab = "Number of Cylinders", type = "h", col = "Blue")
plot(cars$cyl,cars$qsec, main = "Cylinders vs Quarter Mile Time", xlab = "Quarter Mile Time in Seconds", ylab = "Number of Cylinders", type = "h", col = "Green")

And here’s what we end up with…

mtcars3

It looks like increasing the number of cylinders decreases the fuel economy, but it increases the power and speed.

Also worth noting, the mpg and qsec change at roughly the same rate going from 6 to 8 cylinders than they do going from 4 to 6 cylinders. However, the horsepower jumps up more dramatically when going from 6 to 8 cylinders than going from 4 to 6 cylinders. So, if power is relatively more important to you, you may want to go ahead and go with 8 cylinders.

Create a Correlation Matrix in R

So, in my last post, I showed how to create two histograms from a certain data set and then how to plot the two variables to see if there is any relationship. Visually, it was easy to tell that there was a negative relationship between the weight of an automobile and the fuel economy of an automobile. But, is there a more objective way to understand the relationship? Is there a number we can assign to it?

Yes, it turns out there is. This number is called Pearson’s Correlation Coefficient or, in the vernacular, simply the “correlation.” Essentially, this number measures the percentage of fluctuation in one variable that can be explained by another variable. A correlation of 1 means the variables move in perfect unison, a correlation of -1 means the variables move in the complete opposite direction, and a correlation of 0 means there is no relationship at all between the two variables.

So, how to we retrieve the correlation between two variables in R? Let’s write some code…

motorcars <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", stringsAsFactors = FALSE)

cor(motorcars$wt, motorcars$mpg)

plot(motorcars$wt, motorcars$mpg)

First, we import the same data set we used last time. When we view the data set (using colnames() or head()), we see that the column names for the variables we are trying to measure are “wt” and “mpg.” Now, all we need to do is subset these two variables with the dollar sign and place them within the cor() function.

mccor

When we run this code, we can see that the correlation is -0.87, which means that the weight and the mpg move in exactly opposite directions roughly 87% of the time. So, that’s it. You’ve run a correlation in R. If you plot the two variables using the plot() function, you can see that this relationship is fairly clear visually.

mtplot

But, wait? Could there be other things that are related to the fuel economy of the vehicle, besides weight? What else is in the data set? Let’s have a look. When we run the head() function on motorcars, we get the first 6 rows of every column in the data set.

headmc

What if we want to see how all of these variables are related to one another? Well, we could run a correlation on every single combination we can think of, but that would be tedious. Is there a way we can view all the correlations with a single line of code? Yes, there is.

mc_data <- motorcars[,2:length(motorcars)]
round(cor(mc_data),2)

First, we create a separate data frame that only includes the data from motorcars (subsets everything to the right of the vehicle model name). Then, we simply run a correlation on the new data frame, which we’ve called “mc_data.” To clean things up a bit, I’ve nested the cor() function within the round() function to round the result to two decimal places. When we enter this code, here’s what we get:

corrmat

We can see that there are several other variables that are related to mpg, such as cyl, disp, and hp. Now, we can plot the variables that are most correlated with miles per gallon using this code (refer to previous post for explanation).

par(mfrow = c(2,2))
plot(motorcars$wt, motorcars$mpg)
plot(motorcars$cyl, motorcars$mpg)
plot(motorcars$disp, motorcars$mpg)
plot(motorcars$hp, motorcars$mpg)

And, here’s what we get as a result…

mtplots

Create Histograms and Scatter Plots in R for Exploratory Data Analysis

No matter how sophisticated you get with your statistical analysis, you’ll usually start off exploring your data the same way. If you’re looking at a single variable, you’ll want to create a histogram to look at the distribution. If you’re trying to compare two variables to see if there is a relationship between them, you’ll want to create a scatter plot. R makes both of these really easy. Let’s look at some code…


motorcars <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", stringsAsFactors = FALSE)

par(mfrow = c(1,3))

hist(motorcars$wt, main = "Weight", xlab = "Weight (in 1000s)")
hist(motorcars$mpg, main = "MPG", xlab = "Miles Per Gallon")

plot(motorcars$wt, motorcars$mpg, main = "Weight Vs MPG", xlab = "Weight (in 1000s)", ylab = "Miles Per Gallon")

First, we pull some data from the web and read it into a data frame called “motorcars.” This data frame compares 32 cars across 11 different variables. You can get an overview of the data frame by calling str(), summary(), or head() on the data frame (i.e. str(motorcars)).

In this example, we are wanting to look specifically at the weight of each car and the miles per gallon clocked for each car. We want to see how the distributions of these variables are spread out, and we also want to see if there is any relationship between them. Does a heavier car actually have an effect on the car’s fuel economy, or is that just an urban legend?

After we pull the data, we use the par() function to describe how we want the plots to be displayed. The mfrow argument creates a matrix of rows and columns that serves as the layout for the plots. In this case, we’ll want three plots laid side-by-side (1 row and 3 columns).

Next, we create the histograms, subsetting columns we are looking for by wedging the $ between the data frame’s name and the column’s name. Technically, this is all we have to do to create the histogram. The other arguments I’ve included are optional arguments to give histograms names and x-axis labels.

Finally, we’ll create the scatter plot with the plot() function. The plot() function takes two arguments: the first is what we want plotted on the x-axis, and the second is what we want plotted on the y-axis. In this case, we’ve chosen to use the weight on the x-axis and the miles per gallon on the y-axis.

After you run this code, here are the plots that are generated:

Effect of Vehicle Weight on Miles Per Gallon

So, what do you think?

What is the weight range of most motor vehicles? What MPG do most motor vehicles have? Is there any relationship between the two of them?