# How to Summarize a Data Frame by Groups in R

Sometimes, when you’re analyzing a data set and you want to get a complete picture of it, you want calculate the metrics on all the observations for each variable. Let’s say, for example, that you run a small zoo and want to inventory the cost of all your animals. To calculate this in a spreadsheet, you simply, sum the cost of all the rows in the “cost” column.

But, what do you do if you want to know the cost broken down by each category of animal? In a spreadsheet, you would subtotal the “cost” column by the column referencing the animal’s class. The result would be something like below–with the original data set on the left and the subtotaled data set on the right.

If you’re working with databases, it’s even easier to achieve the result with a simple SQL query. If the table you’re working in is called “animals,” the query would look something like this:

SELECT Class, sum(Cost) FROM animals GROUP_BY Class;

And the result is pretty much the same. So, the question is, if you can do this in spreadsheets and databases, can you do it in R? You bet you can.

In the dplyr package, you can create subtotals by combining the group_by() function and the summarise() function. Let’s start with an example. Below is the first part of the mtcars data frame that is provided in the base R package.

Now, suppose we interested in purchasing a car. We’re interested in 3 things regarding the car we’re seeking to purchase: the fuel economy, the power, and the speed. However, we’re unsure how the number of cylinders relates to these variables. How can we view the averages by cylinder? Let’s look at some code.

```### group mtcars by cylinders and return some averages
cars <- mtcars %>%
select(cyl, mpg, hp, qsec) %>%
group_by(cyl) %>%
summarise(mpg = mean(mpg), hp = mean(hp), qsec = mean(qsec))
```

Here, we’ve used piping with dplyr functions to crew a data set showing us the average mpg, hp, and qsec (seconds it takes to go 1/4 a mile) for each amount of cylinders. When we print the new “cars” data frame, here’s what we get:

This gives us a summary that can help us more easily make a decision. If we want to see this plotted on graphs, we can run this code…

```### draw plots of cylinders vs other variables
par(mfrow = c(1,3), lwd = 4)
plot(cars\$cyl,cars\$mpg, main = "Cylinders Vs Miles Per Gallon", xlab = "Miles Per Gallon", ylab = "Number of Cylinders", type = "h", col = "Red")
plot(cars\$cyl,cars\$hp, main = "Cylinders vs Horsepower", xlab = "Horsepower", ylab = "Number of Cylinders", type = "h", col = "Blue")
plot(cars\$cyl,cars\$qsec, main = "Cylinders vs Quarter Mile Time", xlab = "Quarter Mile Time in Seconds", ylab = "Number of Cylinders", type = "h", col = "Green")
```

And here’s what we end up with…

It looks like increasing the number of cylinders decreases the fuel economy, but it increases the power and speed.

Also worth noting, the mpg and qsec change at roughly the same rate going from 6 to 8 cylinders than they do going from 4 to 6 cylinders. However, the horsepower jumps up more dramatically when going from 6 to 8 cylinders than going from 4 to 6 cylinders. So, if power is relatively more important to you, you may want to go ahead and go with 8 cylinders.

# 5 Ways to Subset a Data Frame in R

Often, when you’re working with a large data set, you will only be interested in a small portion of it for your particular analysis. So, how do you sort through all the extraneous variables and observations and extract only those you need? Well, R has several ways of doing this in a process it calls “subsetting.”

The most basic way of subsetting a data frame in R is by using square brackets such that in:

example[x,y]

example is the data frame we want to subset, ‘x’ consists of the rows we want returned, and ‘y’ consists of the columns we want returned. Let’s pull some data from the web and see how this is done on a real data set.

```### import education expenditure data set and assign column names
education <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) <- c("X","State","Region","Urban.Population","Per.Capita.Income","Minor.Population","Education.Expenditures")
View(education)
```

Here’s what the first part of our data set looks like after I’ve imported the data and appropriately named its columns.

Now, let’s suppose we oversee the Midwestern division of schools and that we are charged with calculating how much money was spent per child for each state in our region. We would need three variables: State, Minor.Population, and Education.Expenditures. However, we would only need the observations from the rows that correspond to Region 2. Here’s the basic way to retrieve that data in R:

```ed_exp1 <- education[c(10:21),c(2,6:7)]
```

To create the new data frame ‘ed_exp1,’ we subsetted the ‘education’ data frame by extracting rows 10-21, and columns 2, 6, and 7. Pretty simple, right?

Another way to subset the data frame with brackets is by omitting row and column references. Take a look at this code:

```ed_exp2 <- education[-c(1:9,22:50),-c(1,3:5)]
```

Here, instead of subsetting the rows and columns we wanted returned, we subsetted the rows and columns we did not want returned and then omitted them with the “-” sign. If we now call ed_exp1 and ed_exp2, we can see that both data frames return the same subset of the original education data frame.

Now, these basic ways of subsetting a data frame in R can become tedious with large data sets. You have to know the exact column and row references you want to extract. It’s pretty easy with 7 columns and 50 rows, but what if you have 70 columns and 5,000 rows? How do you find which columns and rows you need in that case? Here’s another way to subset a data frame in R…

```ed_exp3 <- education[which(education\$Region == 2),names(education) %in% c("State","Minor.Population","Education.Expenditures")]
```

Now, we have a few things going on here. First, we are using the same basic bracketing technique to subset the education data frame as we did with the first two examples. This time, however, we are extracting the rows we need by using the which() function. This function returns the indices where the Region column of the education data from is 2. That gives us the rows we need. We retrieve the columns of the subset by using the %in% operator on the names of the education data frame.

Now, you may look at this line of code and think that it’s too complicated. There’s got to be an easier way to do that. Well, you would be right. There is another basic function in R that allows us to subset a data frame without knowing the row and column references. The name? You guessed it: subset().

```ed_exp4 <- subset(education, Region == 2, select = c("State","Minor.Population","Education.Expenditures"))
```

The subset() function takes 3 arguments: the data frame you want subsetted, the rows corresponding to the condition by which you want it subsetted, and the columns you want returned. In our case, we take a subset of education where “Region” is equal to 2 and then we select the “State,” “Minor.Population,” and “Education.Expenditure” columns.

When we subset the education data frame with either of the two aforementioned methods, we get the same result as we did with the first two methods:

Now, there’s just one more method to share with you. This last method, once you’ve learned it well, will probably be the most useful for you in manipulating data. Let’s take a look at the code and then we’ll go over it…

```install.packages("dplyr")
library(dplyr)
ed_exp5 <- select(filter(education, Region == 2),c(State,Minor.Population:Education.Expenditures))
```

This last method is not part of the basic R environment. To use it, you’ve got to install and download the dplyr package. If you’re going to be working with data in R, though, this is a package you will definitely want. It is among the most downloaded packages in the R environment and, as you start using it, you’ll quickly see why.

So, once we’ve downloaded dplyr, we create a new data frame by using two different functions from this package:

• filter: the first argument is the data frame; the second argument is the condition by which we want it subsetted. The result is the entire data frame with only the rows we wanted.
• select: the first argument is the data frame; the second argument is the names of the columns we want selected from it. We don’t have to use the names() function, and we don’t even have to use quotation marks. We simply list the column names as objects.

In this example, we’ve wrapped the filter function in the selection function to return our data frame. In other words, we’ve first taken the rows where the Region is 2 as a subset. Then, we took the columns we wanted from only those rows. The result gives us a data frame consisting of the data we need for our 12 states of interest:

So, to recap, here are 5 ways we can subset a data frame in R:

1. Subset using brackets by extracting the rows and columns we want
2. Subset using brackets by omitting the rows and columns we don’t want
3. Subset using brackets in combination with the which() function and the %in% operator
4. Subset using the subset() function
5. Subset using the filter() and select() functions from the dplyr package

That’s it! Happy subsetting!

# Nesting Functions in R with the Piping Operator

One of the most useful (and most popular) applications in R are the functions available in the dplyr package. With functions like select, filter, arrange, and mutate, you can restructure a data set to get it looking just the way you want it. The problem is that doing so can take multiple steps. As a result, you either end up creating a bunch of extraneous objects to keep your activities organized, or you end up nesting your activities in one long convoluted line of nested functions. Is there a better way to create cleaner code with dplyr? Let’s have a look…

```### import education expenditure data set and assign column names
education <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) <- c("X","State","Region","Urban.Population","Per.Capita.Income","Minor.Population","Education.Expenditures")
View(education)
```

First, we’ve taken a data set on education expenditures by state and given the columns appropriate names. For a more detailed explanation on ways to subset this data set, visit this post. Here’s a snapshot of what the first half of the data set looks like:

Now, let’s supposed we are tasked with answering a very specific question:

Which states in the Midwestern region of the United States have the highest and lowest education expenditures per minority resident?

Let’s use the dplyr functions to filter this information from the data set–one step at a time…

```### Filter for Region 2
ed_exp1 <- filter(education, Region == 2)

### Select the State, Minor Population, and Education Expenditures columns
ed_exp2 <- select(ed_exp1, c(State, Minor.Population, Education.Expenditures))

### Add a column for the Expenditures Per Child
ed_exp3 <- mutate(ed_exp2, Expenditures.Per.Child = Education.Expenditures / Minor.Population)

### Arrange the data set to sort by Expenditures.Per.Child
ed_exp4 <- arrange(ed_exp3, desc(Expenditures.Per.Child))
```

Building our data frame this way, we create four separate objects to reach our goal. With each activity, we assign a new object and then feed that object as the new data frame into the next activity. We first filter the original data set, creating ed_exp1. Then, we apply the select function on ed_exp1, creating ed_exp2, and so on until we end up with our final result in ed_exp4. And, sure enough, this method works:

We can now answer our question: Ohio spends the least amount per child and Minnesota spends the most.

That being said, ed_exp4 is not the only data frame we’ve created.In getting our result, we have created several intermediary objects. We have no use for ed_exp1, ed_exp2, or ed_exp3. The final result–what we’ve called ed_exp4–is the only revised data frame we care about. And yet, these other three data sets are taking up space in our working memory:

None of these subsets give us the complete information to answer our question. All we need is the final result–ed_exp4. So, is there a way to get to ed_exp4 without creating the first three objects. Yes, there is–but it’s a little tricky…

```### Create final result using a single nested function
ed_exp5 <- arrange(mutate(select(filter(education, Region == 2),c(State,Minor.Population, Education.Expenditures)), Expenditures.Per.Child = Education.Expenditures / Minor.Population),desc(Expenditures.Per.Child))
```

So, what is happening in this long, convoluted line of code? We are nesting each object as the data frame in the function that creates the next object. The innermost function, filter, creates the result that serves as the data frame for the select function, and then it builds all the way out to our last activity–arrange. As we see below, ed_exp5 gives us the same result as ed_exp4–and we only have to create one object.

The downside to using this method is rather obvious–it’s too complicated! Sure, we save space by not creating extraneous variables, but the trade off is that we have a long line of code that’s difficult to understand. The more activities we do to create our resulting data frame, the farther apart our arguments will get from the functions we are trying to apply to them. Sooner or later, mistakes will become inevitable.

But there is a fix even for this! Included as part of the dplyr package is the documentation for the “piping” operator. It essentially does the same thing as nesting functions does, but it’s a lot cleaner. Let’s have a look at the code…

```### Create final result using the piping operator
ed_exp6 <- education %>%
filter(Region == 2) %>%
select(c(State, Minor.Population, Education.Expenditures)) %>%
mutate(Expenditures.Per.Child = Education.Expenditures / Minor.Population) %>%
arrange(desc(Expenditures.Per.Child))
```

The piping operator, delineated by the “%>%” symbol, funnels each object preceding the operator as the first argument in subsequent functions. In other words…

education %>% filter(Region = 2)

is the same thing as…

filter(education, Region == 2)

You simply continue linking the chain, or “extending the pipe,” all the way down to your last action. In our case, the final action is to arrange the data set, so that’s where our pipe ends.

So, the moment of truth–does the piping operator give us the result we’re looking for?

Indeed, it does! But we’re only creating a single object and the code is much, much cleaner.

Pretty cool, huh?