Examine a Data Frame in R with 7 Basic Functions

When I first started learning R, it seemed way more complicated than what I was used to with looking at spreadsheets in Microsoft Excel. When I started working with data frames in R, it didn’t seem quite as easy to know what I was looking at.

I’ve since come to see the light. While there is a bit of a learning curve to get a handle on it, viewing data in R is infinitely more flexible than doing so in Excel. In this post, I’ll cover the most basic R functions for examining a data set and explain why they’re important.

Understanding how to get a simple overview of the data set has become a huge time saver for me. If you aren’t familiar with these functions, you need to be. If you’re anything like me, you’ll use them first for every single data set you consider.

All of the functions I’m discussing here come in the base R Utils package, so there’s no need to install any additional packages. Here are the functions, with links to their documentation:

  1. dim(): shows the dimensions of the data frame by row and column
  2. str(): shows the structure of the data frame
  3. summary(): provides summary statistics on the columns of the data frame
  4. colnames(): shows the name of each column in the data frame
  5. head(): shows the first 6 rows of the data frame
  6. tail(): shows the last 6 rows of the data frame
  7. View(): shows a spreadsheet-like display of the entire data frame

Now, let’s import a data set see how each of these functions works. First, here’s the code:

### Import a data set on violent crime by state and assign it to the data frame "crime"
crime <- read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/datasets/USArrests.csv", stringsAsFactors = FALSE)

### Call the functions on crime to examine the data frame
dim(crime)
str(crime)
summary(crime)
colnames(crime)

### The head() and tail() functions default to 6 rows, but we can adjust the number of rows using the "n = " argument
head(crime, n = 10)
tail(crime, n = 5)

### While the first 6 functions are printed to the console, the View() function opens a table in another window
View(crime)

Now, let’s take a look at the output, so we can see what happens when the code is run.

First, we’ll look at the dim(), str(), summary(), and colnames()  functions:

crime1

  • dim(): In the crime data set, we can see immediately that there are only 50 rows and 5 columns. This function is useful, because it tells us whether it would be okay to print the entire data frame to the console. With this data set, it’s probably okay. If, however, there were 5,000 rows and 50 columns, we’d definitely want to view the data frame in smaller chunks.
  • str(): The structure of the crime data set also tells us the number of rows (observations) and columns (variables), but it provides even more information. It tells us the column names, the class of each column (what kind of data is stored in it), and the first few observations of each variable.
  • summary(): The summary provides descriptive statistics including the min, max, mean, median, and quartiles of each column. For example, we can see in the crime data set that the average murder rate across all states is 7.8 for every 100k people.
  • colnames(): This function prints a vector of the column names, which can be useful if you’re trying to reference a particular column. For the crime data set, we can see that the state column has no name. Knowing this, we may want to assign it a name before going forward in our analysis.

Now, let’s take a look at the head() and tail() functions:

crime2

  • head(): This function defaults to printing the first 6 rows, but we’ve decided to call the first 10. In the crime data set, this gives us the data on states Alabama through Georgia.
  • tail(): The same as head(), except this function prints the end of the data frame. In this case, we’ve called the last 5 observations, so we can see the data on Virginia through Wyoming.

Finally, let’s take a look at the window that appears when we call the View() function:

crime3

  • View(): This window provides vertical and horizontal (if enough columns to justify) scroll bars for you to browse the entire data set. It looks exactly like an Excel spreadsheet–you just can’t manipulate any of the data. (Note: make sure you use a capital “V” when calling this function; it’s case sensitive).

That’s it! Getting comfortable with these functions should make it easier for you to work with data frames in a more logical and efficient manner. 

Happy viewing!

Nesting Functions in R with the Piping Operator

One of the most useful (and most popular) applications in R are the functions available in the dplyr package. With functions like select, filter, arrange, and mutate, you can restructure a data set to get it looking just the way you want it. The problem is that doing so can take multiple steps. As a result, you either end up creating a bunch of extraneous objects to keep your activities organized, or you end up nesting your activities in one long convoluted line of nested functions. Is there a better way to create cleaner code with dplyr? Let’s have a look…

### import education expenditure data set and assign column names
education <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) <- c("X","State","Region","Urban.Population","Per.Capita.Income","Minor.Population","Education.Expenditures")
View(education)

First, we’ve taken a data set on education expenditures by state and given the columns appropriate names. For a more detailed explanation on ways to subset this data set, visit this post. Here’s a snapshot of what the first half of the data set looks like:

edexp1

Now, let’s supposed we are tasked with answering a very specific question:

Which states in the Midwestern region of the United States have the highest and lowest education expenditures per minority resident?

Let’s use the dplyr functions to filter this information from the data set–one step at a time…

### Filter for Region 2
ed_exp1 <- filter(education, Region == 2)

### Select the State, Minor Population, and Education Expenditures columns
ed_exp2 <- select(ed_exp1, c(State, Minor.Population, Education.Expenditures))

### Add a column for the Expenditures Per Child
ed_exp3 <- mutate(ed_exp2, Expenditures.Per.Child = Education.Expenditures / Minor.Population)

### Arrange the data set to sort by Expenditures.Per.Child
ed_exp4 <- arrange(ed_exp3, desc(Expenditures.Per.Child))

Building our data frame this way, we create four separate objects to reach our goal. With each activity, we assign a new object and then feed that object as the new data frame into the next activity. We first filter the original data set, creating ed_exp1. Then, we apply the select function on ed_exp1, creating ed_exp2, and so on until we end up with our final result in ed_exp4. And, sure enough, this method works:

edexpenses1

We can now answer our question: Ohio spends the least amount per child and Minnesota spends the most.

That being said, ed_exp4 is not the only data frame we’ve created.In getting our result, we have created several intermediary objects. We have no use for ed_exp1, ed_exp2, or ed_exp3. The final result–what we’ve called ed_exp4–is the only revised data frame we care about. And yet, these other three data sets are taking up space in our working memory:

edexpenses2

None of these subsets give us the complete information to answer our question. All we need is the final result–ed_exp4. So, is there a way to get to ed_exp4 without creating the first three objects. Yes, there is–but it’s a little tricky…

### Create final result using a single nested function
ed_exp5 <- arrange(mutate(select(filter(education, Region == 2),c(State,Minor.Population, Education.Expenditures)), Expenditures.Per.Child = Education.Expenditures / Minor.Population),desc(Expenditures.Per.Child))

So, what is happening in this long, convoluted line of code? We are nesting each object as the data frame in the function that creates the next object. The innermost function, filter, creates the result that serves as the data frame for the select function, and then it builds all the way out to our last activity–arrange. As we see below, ed_exp5 gives us the same result as ed_exp4–and we only have to create one object.

edexpenses3

The downside to using this method is rather obvious–it’s too complicated! Sure, we save space by not creating extraneous variables, but the trade off is that we have a long line of code that’s difficult to understand. The more activities we do to create our resulting data frame, the farther apart our arguments will get from the functions we are trying to apply to them. Sooner or later, mistakes will become inevitable.

But there is a fix even for this! Included as part of the dplyr package is the documentation for the “piping” operator. It essentially does the same thing as nesting functions does, but it’s a lot cleaner. Let’s have a look at the code…

### Create final result using the piping operator
ed_exp6 <- education %>%
filter(Region == 2) %>%
select(c(State, Minor.Population, Education.Expenditures)) %>%
mutate(Expenditures.Per.Child = Education.Expenditures / Minor.Population) %>%
arrange(desc(Expenditures.Per.Child))

The piping operator, delineated by the “%>%” symbol, funnels each object preceding the operator as the first argument in subsequent functions. In other words…

education %>% filter(Region = 2)

is the same thing as…

filter(education, Region == 2)

You simply continue linking the chain, or “extending the pipe,” all the way down to your last action. In our case, the final action is to arrange the data set, so that’s where our pipe ends.

So, the moment of truth–does the piping operator give us the result we’re looking for?

edexpenses4

Indeed, it does! But we’re only creating a single object and the code is much, much cleaner.

Pretty cool, huh?

Use R to Combine Multiple Columns of Data into a Single Column Spread Out Across Rows

I work a lot with Pivot Tables in Microsoft Excel. A problem I often encounter is trying to analyze a spreadsheet in which data from a single variable has been spread out over many columns. In particular, this happens rather frequently with longitudinal data.

If you are trying to look at data spread out across multiple years, it can be difficult to use a Pivot Table when each year has a designated column. What you want instead is to have a single column with the header “Year” and all of the data placed in the adjacent column as a “Values” column. Let me show you what I mean…

Let’s take this data set showing the population of all countries between 1980 and 2010, taken from data.gov, and try to look at it within a pivot table.

Here’s what the raw data looks like (I added the “country” header” for simplicity):

pop_data1

In Excel, you can easily make a simple graph using the data as it’s currently displayed. However, if you want to drop it into a pivot table for dynamic filtering, analysis, and visualization, it can get tricky. To include all the data in your Pivot Table, you have to drop each year individually into the “values” field of the pivot table. Then, if you want to exclude any of the years, you’ll have to manually remove them. This can be extremely time consuming, and it prevents you from performing many of the calculations available in the pivot table (i.e. “year over year percentage change”).

pop_data2

So, how can you get all of those columns into a single “Year” column so that you can analyze the data more efficiently in a pivot table? In R, there is a simple way to do this. Using the tidyr package, we can “gather” all of those columns into a single column under one variable. Let’s take a look at the code…

### install and load tidyr package
install.packages("tidyr")
library(tidyr)

### import population by country by year from data.gov
pop_wide <- read.csv("http://en.openei.org/doe-opendata/dataset/a7fea769-691d-4536-8ed3-471e993a2445/resource/86c50aa8-e40f-4859-b52e-29bb10166456/download/populationbycountry19802010millions.csv", stringsAsFactors = FALSE, check.names=FALSE)

### assign column name to the first column
colnames(pop_wide)[1] <- "country"

### combine the year columns into a single column with separate rows for each year; assign to new vector
pop_long <- gather(pop_wide,year,population,-country)

### download the new csv to your working directory
write.csv(pop_long,"population_data.csv")

After you import and load the tidyr package, simply import the file and give the column of countries a name. Then, use the gather() function to pull all of those columns into a single column with adjacent values.

The first argument is the data frame you are working in (‘pop_wide’ in our case), the second argument is what you want the variable column to be (‘year’ in our case), and the third argument is what you want the values column to be (‘population’ in our case).  The last argument consists of the columns you want to be combined. You can also select instead the columns you don’t want to combine using the ‘-‘ sign, which is what we did with the ‘country’ column.

After we send the file back to our computer with write.csv, we can open the file up in Excel to see how it looks.

pop_data3

Now, the spreadsheet is going to list every country’s population for 1980, then 1981, then 1982, and so on. In general, this is how you want your raw data to look. If, however, you want the high level summary you started with, you can split the data back into multiple columns with tidyr’s spread() function. But, for now, let’s see how this looks in a pivot table…

pop_data4

If you look at the pivot table “Values” field, you’ll see that it’s a lot less messy. Now, you can manipulate the data much more easily. Want to create a pivot chart showing the increase in population within North America from 1980 to 2010? You can now do this in just a few seconds…

North America Population Growth Over Time