Create Histograms and Scatter Plots in R for Exploratory Data Analysis

No matter how sophisticated you get with your statistical analysis, you’ll usually start off exploring your data the same way. If you’re looking at a single variable, you’ll want to create a histogram to look at the distribution. If you’re trying to compare two variables to see if there is a relationship between them, you’ll want to create a scatter plot. R makes both of these really easy. Let’s look at some code…


motorcars <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", stringsAsFactors = FALSE)

par(mfrow = c(1,3))

hist(motorcars$wt, main = "Weight", xlab = "Weight (in 1000s)")
hist(motorcars$mpg, main = "MPG", xlab = "Miles Per Gallon")

plot(motorcars$wt, motorcars$mpg, main = "Weight Vs MPG", xlab = "Weight (in 1000s)", ylab = "Miles Per Gallon")

First, we pull some data from the web and read it into a data frame called “motorcars.” This data frame compares 32 cars across 11 different variables. You can get an overview of the data frame by calling str(), summary(), or head() on the data frame (i.e. str(motorcars)).

In this example, we are wanting to look specifically at the weight of each car and the miles per gallon clocked for each car. We want to see how the distributions of these variables are spread out, and we also want to see if there is any relationship between them. Does a heavier car actually have an effect on the car’s fuel economy, or is that just an urban legend?

After we pull the data, we use the par() function to describe how we want the plots to be displayed. The mfrow argument creates a matrix of rows and columns that serves as the layout for the plots. In this case, we’ll want three plots laid side-by-side (1 row and 3 columns).

Next, we create the histograms, subsetting columns we are looking for by wedging the $ between the data frame’s name and the column’s name. Technically, this is all we have to do to create the histogram. The other arguments I’ve included are optional arguments to give histograms names and x-axis labels.

Finally, we’ll create the scatter plot with the plot() function. The plot() function takes two arguments: the first is what we want plotted on the x-axis, and the second is what we want plotted on the y-axis. In this case, we’ve chosen to use the weight on the x-axis and the miles per gallon on the y-axis.

After you run this code, here are the plots that are generated:

Effect of Vehicle Weight on Miles Per Gallon

So, what do you think?

What is the weight range of most motor vehicles? What MPG do most motor vehicles have? Is there any relationship between the two of them?

Create a Function in R to Calculate the Subtotal After Discounts and Taxes

One of the coolest things you can do in R is write custom functions to solve your own unique problems. I’m not sure I’m brave enough to try my hand at more complex functions with loops and conditionals and such but, for now, I thought I’d share something simple.

Suppose you have a list of transactions and you are trying to get the total amount from all of them. Of course, you can just call the sum() function on the individual transactions. But what if you get a discount on all the transactions? Then, what if there are taxes? Obviously, you can just do this with simple math. But, humor me. Here’s a function for it…

### create subtotal function
sub_tot <- function(costs,discount = 0,tax = 0) {
raw_sub <- sum(costs) - (sum(costs) * discount) + (sum(costs) * tax)
round(raw_sub, digits = 2)
}

The sub_tot function that I’ve created takes three arguments. The first is a vector of data that you want to total. The second is the percentage discount you get on all the transactions–expressed as a decimal. The third argument is the tax rate that is applied after the discounts are taken into consideration.

If you look at the function, all I do is subtract the discounted amount from the total, add the amount of taxes, and then round to two decimal places. Pretty simple, right? Now, let’s try it out on something…

### install and load readr package
install.packages("readr")
library(readr)

### import data set of sales transactions
sales <- read_csv("http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv")

### print sum of all purchases before any discounts or taxes
sum(sales$Price)

### print sum of all purchases after discounts and taxes
sub_tot(sales$Price,0.15,0.08)

### taking taxes into account, how much did we save with the discount?
sum(sales$Price) - sub_tot(sales$Price,0.15,0.08)

Let’s pull a random set of real estate transactions from the web and store it into a data frame called “sales.” If we view the data set, we’ll see that there is a column called “Price.”

Now, let’s suppose you are a real estate developer who is buying up all these properties. You get a 15% discount on all that you buy, and there is an 8% tax rate. How much do you spend in all?

When we use the sub_tot function, here’s what we get…

sub_tot

So, before the discounts and taxes, you spend ~$1.6 million. After the discounts and taxes, you spent ~$1.5 million–which mean that you save a total of ~$100k.

But, suppose you didn’t get the discount. How much would you spend without the discount but taking into consideration the taxes? And, how much after-tax money do you save by getting the 15% discount vs not getting it? Let’s write some more code…

### what if you don't get a discount? how much do you spend?
sub_tot(sales$Price,,0.08)

### how much do you save with the discount vs not having the discount?
sub_tot(sales$Price,,0.08) - sub_tot(sales$Price,0.15,0.08)

Since “0” is set as the default argument for discount and tax, we simply skip the discount argument when there is no discount. So, when we run the above code, we find that the subtotal without a discount is ~$1.76 million and the after-tax amount you save by getting the discount is ~$250k.

sub_tot2

Now, want to send some of that money my way???

 

Use R to Combine Multiple Columns of Data into a Single Column Spread Out Across Rows

I work a lot with Pivot Tables in Microsoft Excel. A problem I often encounter is trying to analyze a spreadsheet in which data from a single variable has been spread out over many columns. In particular, this happens rather frequently with longitudinal data.

If you are trying to look at data spread out across multiple years, it can be difficult to use a Pivot Table when each year has a designated column. What you want instead is to have a single column with the header “Year” and all of the data placed in the adjacent column as a “Values” column. Let me show you what I mean…

Let’s take this data set showing the population of all countries between 1980 and 2010, taken from data.gov, and try to look at it within a pivot table.

Here’s what the raw data looks like (I added the “country” header” for simplicity):

pop_data1

In Excel, you can easily make a simple graph using the data as it’s currently displayed. However, if you want to drop it into a pivot table for dynamic filtering, analysis, and visualization, it can get tricky. To include all the data in your Pivot Table, you have to drop each year individually into the “values” field of the pivot table. Then, if you want to exclude any of the years, you’ll have to manually remove them. This can be extremely time consuming, and it prevents you from performing many of the calculations available in the pivot table (i.e. “year over year percentage change”).

pop_data2

So, how can you get all of those columns into a single “Year” column so that you can analyze the data more efficiently in a pivot table? In R, there is a simple way to do this. Using the tidyr package, we can “gather” all of those columns into a single column under one variable. Let’s take a look at the code…

### install and load tidyr package
install.packages("tidyr")
library(tidyr)

### import population by country by year from data.gov
pop_wide <- read.csv("http://en.openei.org/doe-opendata/dataset/a7fea769-691d-4536-8ed3-471e993a2445/resource/86c50aa8-e40f-4859-b52e-29bb10166456/download/populationbycountry19802010millions.csv", stringsAsFactors = FALSE, check.names=FALSE)

### assign column name to the first column
colnames(pop_wide)[1] <- "country"

### combine the year columns into a single column with separate rows for each year; assign to new vector
pop_long <- gather(pop_wide,year,population,-country)

### download the new csv to your working directory
write.csv(pop_long,"population_data.csv")

After you import and load the tidyr package, simply import the file and give the column of countries a name. Then, use the gather() function to pull all of those columns into a single column with adjacent values.

The first argument is the data frame you are working in (‘pop_wide’ in our case), the second argument is what you want the variable column to be (‘year’ in our case), and the third argument is what you want the values column to be (‘population’ in our case).  The last argument consists of the columns you want to be combined. You can also select instead the columns you don’t want to combine using the ‘-‘ sign, which is what we did with the ‘country’ column.

After we send the file back to our computer with write.csv, we can open the file up in Excel to see how it looks.

pop_data3

Now, the spreadsheet is going to list every country’s population for 1980, then 1981, then 1982, and so on. In general, this is how you want your raw data to look. If, however, you want the high level summary you started with, you can split the data back into multiple columns with tidyr’s spread() function. But, for now, let’s see how this looks in a pivot table…

pop_data4

If you look at the pivot table “Values” field, you’ll see that it’s a lot less messy. Now, you can manipulate the data much more easily. Want to create a pivot chart showing the increase in population within North America from 1980 to 2010? You can now do this in just a few seconds…

North America Population Growth Over Time

Create a Cumulative Sum Column in R

One of the first things I learned in R was how to use basic statistics functions like sum(). However, what if you want a cumulative sum to measure how something is building over time–rather than just a total sum to measure the end result? It turns out there is a function for cumulative sum, called cumsum(). Here’s out to use it…

### create (or import) a data frame of totals
	product_sales <- data.frame("Day" = c(1:10),"Sales" = c(100,200,150,120,130,145,90,150,180,190))

### add cumulative column to the data frame
	product_sales[,"cum_sales"] <- cumsum(product_sales$Sales)

### print data frame
	product_sales

It’s that simple. Here’s how the output looks from the dummy data frame I created:
Output of cumsum in R

Create a 52 Week Vector of Dates in R

If you want to keep a weekly log of anything from your body weight to your company’s product sales, it might be easier to have a column of all the dates you’ll need over the next year as a template. That way, when you reach the time each week that you enter your data (say, Monday morning), you can enter the data beside a date that is already there in the adjacent column.

Here’s how to create a vector of weekly dates spanning out over an entire year.

### initialize vector as today's date; if you want to start with a different date, use as.Date("YYYY-MM-DD") in place of Sys.Date() in the code
week_ending = Sys.Date()

### write a for loop on a sequence of next 52 weeks
for(i in seq(from = 1, by = 7, length.out = 52)) {
week_ending[i] = (week_ending - 1) + i
}

### remove NAs from the vector that is created
week_ending = week_ending[!is.na(week_ending)]

### print week_ending to generate a vector of the next 52 weeks, beginning with today's date
week_ending

That’s it. You now have a vector with the length of 52, containing dates for every week over the next year beginning with today. See the code and R output below.

R Output for Vector of Weeks

Feel free to grab this code if it’s useful, and please fix it if there’s a better way!