Motor Trend Regression Models

By Domtria Simba M | September 20, 2016

Introduction

Motor Trend, is a magazine about the automobile industry. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). They are particularly interested in the following two questions:

Is an automatic or manual transmission better for MPG.
Quantify the MPG difference between automatic and manual transmissions.

Loading required libraries

# Required libraries
library(ggplot2)
library(psych)
library(MASS)

Load Data and Preprocessing

data(mtcars)
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))

Perform some basic exploratory data analyses

Let us see how response variable MPG correlate with all the explanatory variables.

pairs.panels(mtcars, main="Pair Graph of Motor Trend Car Road Tests")

The pairs plot show us correlation between different variables. This will be useful once we find out model.

plotbox(mtcars)

Inference

Next, my hypothesis question: Is there a difference between the average MPG for mtcars using manual transmission and those using automatic transmission.

Independent Two-Sample T-Test

\(H_0: \mu_{Manual} - \mu_{Automatic}\) = 0 (Means outcome is same between group)
\(H_A: \mu_{Manual} - \mu_{Automatic}\) != 0

MPG by Transmission type

(mpg <- t.test(mpg ~ am,mtcars))
paste("For Manual and Automatic p-value is",
      format(mpg$p.value, scientific=FALSE), ":reject NULL hypothesis", sep=" ")

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231 
## 
## [1] "For Manual and Automatic p-value is 0.001373638 :reject NULL hypothesis"

< 5%: reject \(H_0\). > 5%: fail to reject \(H_0\).

MPG for Manual vs Automatic: Since p-value is low(lower than 5%), we reject \(H_0\). These data do indeed provide convincing evidence that there is difference between the average MPG for mtcars using manual transmission and those using automatic transmission. We can conclude that manual transmission has better MPG more than automatic transmission.

Regression Analysis

In linear regression we estimate two parameter: \(\beta_{0}\) and \(\beta_{1}\). So we lose 1 degree of freedom for each parameter as penalty. So k is the number of degrees of freedom. AIC: k = 2. BIC: k = \(\log_{n}\).

Adjusted \(R^2\) vs P-value:

p-value - when interested in finding out which predictor are statistically significant
adjusted \(R^2\) - when interested in a more reliable predictors.

I am going to use adjusted \(R^2\) to pick my final model.

Backward elimination method begins with the full model, drop one variable at a time and record adjusted \(R^2\) of each smaller model. Pick the model with the highest increase in adjusted \(R^2\).

Forward selection method starts with single predictor of response vs. each explanatory variable. Add the remaining variables one at a time to existing model and pick the model with highest adjusted \(R^2\).

I decided to use stepAIC function from the MASS package to perform this analysis because it is automated. However, I will include non-automated linear fit using AIC and BIC to see if we get same best/final models.

#All Results hidden
AutoFullModel <- stepAIC(lm(mpg ~ . ,data=mtcars), 
                         direction = 'both', k = log(32), trace = FALSE)
ManualFullModel <- lm(mpg ~ ., data=mtcars)
forwardModelAIC <- step(lm(mpg ~ 1, data = mtcars), 
      direction = "forward", scope = formula(ManualFullModel), k = 2, trace = 0)
forwardModelBIC <- step(lm(mpg ~ 1, data = mtcars), 
      direction = "forward", scope = formula(ManualFullModel), k = log(32), trace = 0)
backwardModelAIC <- step(ManualFullModel, 
      direction = "backward", k = 2, trace = 0)
backwardModelBIC <- step(ManualFullModel, direction = "backward", k = log(32), trace = 0)

Below are the Automated MASS package results:

AutoFullModel$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
## Final Model:
## mpg ~ wt + qsec + am
## 
## 
##     Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                             21   147.4944 87.02084
## 2  - cyl  1 0.07987121        22   147.5743 83.57243
## 3   - vs  1 0.26852280        23   147.8428 80.16486
## 4 - carb  1 0.68546077        24   148.5283 76.84715
## 5 - gear  1 1.56497053        25   150.0933 73.71682
## 6 - drat  1 3.34455117        26   153.4378 70.95632
## 7 - disp  1 6.62865369        27   160.0665 68.84398
## 8   - hp  1 9.21946935        28   169.2859 67.17025

Below are adjusted \(R^2\) results non-automated linear models fitted using AIC and BIC:

formulafAIC <- toString(c(summary(forwardModelAIC)$call$formula))
msg<-paste("Forward AIC highest Adjusted $R^2$ value of",
  format(summary(forwardModelAIC)$adj.r.squared),"and that model is",formulafAIC,sep=" ")
formulafBIC <- toString(c(summary(forwardModelBIC)$call$formula))
msg1<-paste("Forward BIC highest Adjusted $R^2$ value of",
  format(summary(forwardModelBIC)$adj.r.squared),"and that model is",formulafBIC,sep=" ")
formulabAIC <- toString(c(summary(backwardModelAIC)$call$formula))
msg2<-paste("Backward AIC highest Adjusted $R^2$ value of",
  format(summary(backwardModelAIC)$adj.r.squared),"and that model is",formulabAIC,sep=" ")
formulabBIC <- toString(c(summary(backwardModelBIC)$call$formula))
msg3<-paste("Backward BIC highest Adjusted $R^2$ value of",
  format(summary(backwardModelBIC)$adj.r.squared),"and that model is",formulabBIC,sep=" ")
prnt.test(c(msg,msg1,msg2,msg3))

Forward AIC highest Adjusted \(R^2\) value of 0.8263446 and that model is mpg ~ wt + cyl + hp

Forward BIC highest Adjusted \(R^2\) value of 0.8185189 and that model is mpg ~ wt + cyl

Backward AIC highest Adjusted \(R^2\) value of 0.8335561 and that model is mpg ~ wt + qsec + am

Backward BIC highest Adjusted \(R^2\) value of 0.8335561 and that model is mpg ~ wt + qsec + am

As you can see the non-automated linear fit best model based on the adjusted \(R^2\), is: . It is the same as the automated linear fit using MASS package.

Now we go back to the pairs plot. We observe that wt and am have a negative correlation of -69. As a result we will have to include that interaction in our custom final model: .

Residual Analysis and Diagnostics

Diagnostic for Multiple Linear regression has the following items:

How is response variable linearly related to the explanatory variable.
Nearly normal residuals with mean of zero.
Constant variability of residual.
Independence of residuals.

customFinalModel<-lm(mpg ~ wt + qsec + am + wt:am, data=mtcars)
par(mfrow = c(2, 2))
plot(customFinalModel)

Residual vs Leverage: we observe no outliers. All cases are inside dashed line, Cook’s distance.
Normal Q-Q Plot: we observe that residual plot is nearly normal centered around 0.
Scale-Location: we observed that plots of vs. randomly scatter in a band with a constant width around 0 (no fan shape).
Residual vs Fitted: we observe a residual plot that has no pattern(complete random scatter).

R Code

#Function for plotting histogram and Printing function
plotbox <- function(box.data){
  ggplot(box.data,aes(x=factor(am),y=mpg, fill=am))+geom_boxplot()+labs(x="X (binned)")+
  theme(axis.text.x=element_text(angle=0, vjust=0.4,hjust=0.5,face="bold",
                                 color ="black", size =rel(2)),
        axis.title.x = element_text(size = rel(1.5),vjust= -.5),
        axis.text.y = element_text(face="bold", color="black", size =rel(2)),
        axis.title.y = element_text(size = rel(1.5), angle = 90,vjust=-.5),
        strip.text.x = element_text(size=rel(2),face="bold"),
        legend.position = "none", panel.grid.major = element_blank(),
        panel.background = element_rect(fill = "transparent"),
        axis.line = element_line(colour = "black", size=1, linetype = "solid"),
        plot.title = element_text(vjust=-8,lineheight=.8, face="bold",
                                  color="#CD0000", size=16))+
  scale_x_discrete("Transmission") + scale_y_continuous("Miles Per Gallon") + 
  ggtitle("Blox Plot of MPG by Transmission Type")
}
prnt.test <- function(x){cat(x, sep="\n\n")}