Create, Interpret, and Use a Linear Regression Model in R

In my last post, we looked at how to create a correlation matrix in R. Specifically, we used data pulled from the web to see which variables were most highly correlated with an automobile’s fuel economy. Suppose, however, that we are trying to guess the fuel economy of a new car without actually having driven it. Is there a way for us to use the things we do know about the car to predict how many miles per gallon it will get?

When you’re trying to figure out how multiple variables interact to produce an effect in another variable, you’ll want to perform a regression analysis. There are many programs in which you can do this–SAS, SPSS, Stata, and even to a limited extent in Microsoft Excel. Heck, if you’ve got enough time to kill and brain power to exhaust, you can even do it with a pencil and paper.

Of course, given the nature of this blog, we’re going to perform a simple regression analysis using R. And it’s surprisingly fairly simple to generate the output. Let’s have a look at some code…

motorcars <- read.csv("", stringsAsFactors = FALSE)


The first thing we’ll do, as in many other situations, is read in some data and take a look at it. When we import the “motorcars” dataset and call the head() function on it, here’s what we get…


Now is the part where we use our subject matter expertise and basic human experience to evaluate which variables we think may influence the mpg of a particular vehicle. (Here’s the document explaining what the variable names mean). For argument’s sake, let’s say we determine the displacement (disp), the weight (wt), and the horsepower (hp) to be the only variables we think could really have an effect on the fuel economy of the car. So, those are the ones we decide to put into our model. Let’s look at the code on how to build the model…

mc_model <- lm(motorcars$mpg ~ motorcars$disp + motorcars$wt + motorcars$hp)


So, we actually create the model with the lm() function. Inside the function, we use the basic framework of “the independent variable (the one we’re trying to predict) is equal to variable1 + variable2 + variable3, and so on…” However, instead of the “=” sign, we use the “~” sign. Then, we assign the model to a named object–in this case “mc_model.”

Once, we’ve created the model and assigned a name to it, we can call the summary() function on it to get an overview of the results. Here’s what the output looks like for the model we’ve created…


Now, for simplicity’s sake, I’m just going to interpret a few components of the model. The first thing you’ll want to look at is the “p-value” in the bottom right corner of this summary. This number tells you whether or not the model is “statistically significant,” given your criteria. Essentially, it’s the probability that the outcome of the model is due to random chance. Generally, practitioners use 0.05 as a threshold–such that anything less than that is deemed acceptable. So, we can see that our model as a whole is statistically significant–or usable for making predictions.

Next, let’s look at the far right column of the table labeled “Coefficients,” with the header “Pr(>|t|).” This column contains the “p values” of each individual variable we are considering. Even if the model as a whole is statistically significant, there still may be some variables within it that are not. In this case, we can see that the “displacement” variable is not statistically significant by virtually any measure. So, we can decide to throw that out of our model. So, going forward, let’s look only at the other two variables: “weight” (wt) and “horsepower” (hp).

The last thing we’ll need to look at for our purposes is the “Estimate” column of the “Coefficients” table. Ignoring the “motorcars$disp” variable, we’ll look at the other three. The “Intercept” is what the model begins with before weight and horsepower are taken into consideration. Or, given the information in the model, it’s the miles per gallon the car will get if it has no weight and no horsepower.

For the “motorcars$wt” and “motorcars$hp” variables, you’ll multiply the estimate by each unit of weight and horsepower, respectively. Then, you’ll add the results together with the “Intercept” to conclude how many miles per gallon your car will get. Here’s the formula for our model:

miles per gallon = 37.105505 + (weight[in 1000s] * -3.800891) + (horsepower * -0.031157)

So, let’s say we have a car that weights 2000 pounds and has 100 horsepower. How many miles per gallon can we expect it to get? If we compute this in R, here’s what it will look like…


So, given our model, we can expect a car that weights 2000 pounds and has 100 horsepower to get about 26.4 miles per gallon.

Now, let’s suppose we have a huge list of cars on which we want to run this model. It would be tedious to have to type in those numbers over and over again. Is there anything we can do about it?

Of course, there is. We can create a function that has our dependent variables (weight and horsepower) as inputs. Then, all we have to do is put in the weight and horsepower of each car and the miles per gallon will return as a result. Here’s what the code for that looks like:

mc_fun <- function(weight,horsepower) {
37.105505 + (weight * -3.800891) + (horsepower * -0.031157)



Now, whatever combination of weight and horsepower we input into the mc_fun function, we’ll get an output of miles per gallon. To check the order in which you’ll input the variables, use the args() function.

In the code above, I’ve included four examples–increasing in weight and horsepower. I start with our original example of a 2000 pound car with 100 horsepower and go all the way up to a 5000 pound car with 350 horsepower. Based on our model, we can expect the fuel economy to decrease as the weight and horsepower increases. Does this actually happen? Let’s look at the output to find out…


Consistent with our original formula example, a car with a weight of 2000 pounds and 100 horsepower gets 26.4 miles per gallon. If, however, we’ve got our hearts set on buying that massive yet agile tank of an automobile weighing 5000 pounds and getting 350 horsepower, we’ll have to settle for a measly 7.2 miles per gallon.

If we’re set on buying the boat, we might want to live very close to a gas station…

Create a Function in R to Calculate the Subtotal After Discounts and Taxes

One of the coolest things you can do in R is write custom functions to solve your own unique problems. I’m not sure I’m brave enough to try my hand at more complex functions with loops and conditionals and such but, for now, I thought I’d share something simple.

Suppose you have a list of transactions and you are trying to get the total amount from all of them. Of course, you can just call the sum() function on the individual transactions. But what if you get a discount on all the transactions? Then, what if there are taxes? Obviously, you can just do this with simple math. But, humor me. Here’s a function for it…

### create subtotal function
sub_tot <- function(costs,discount = 0,tax = 0) {
raw_sub <- sum(costs) - (sum(costs) * discount) + (sum(costs) * tax)
round(raw_sub, digits = 2)

The sub_tot function that I’ve created takes three arguments. The first is a vector of data that you want to total. The second is the percentage discount you get on all the transactions–expressed as a decimal. The third argument is the tax rate that is applied after the discounts are taken into consideration.

If you look at the function, all I do is subtract the discounted amount from the total, add the amount of taxes, and then round to two decimal places. Pretty simple, right? Now, let’s try it out on something…

### install and load readr package

### import data set of sales transactions
sales <- read_csv("")

### print sum of all purchases before any discounts or taxes

### print sum of all purchases after discounts and taxes

### taking taxes into account, how much did we save with the discount?
sum(sales$Price) - sub_tot(sales$Price,0.15,0.08)

Let’s pull a random set of real estate transactions from the web and store it into a data frame called “sales.” If we view the data set, we’ll see that there is a column called “Price.”

Now, let’s suppose you are a real estate developer who is buying up all these properties. You get a 15% discount on all that you buy, and there is an 8% tax rate. How much do you spend in all?

When we use the sub_tot function, here’s what we get…


So, before the discounts and taxes, you spend ~$1.6 million. After the discounts and taxes, you spent ~$1.5 million–which mean that you save a total of ~$100k.

But, suppose you didn’t get the discount. How much would you spend without the discount but taking into consideration the taxes? And, how much after-tax money do you save by getting the 15% discount vs not getting it? Let’s write some more code…

### what if you don't get a discount? how much do you spend?

### how much do you save with the discount vs not having the discount?
sub_tot(sales$Price,,0.08) - sub_tot(sales$Price,0.15,0.08)

Since “0” is set as the default argument for discount and tax, we simply skip the discount argument when there is no discount. So, when we run the above code, we find that the subtotal without a discount is ~$1.76 million and the after-tax amount you save by getting the discount is ~$250k.


Now, want to send some of that money my way???