1276251155734027
Loading...

R Tutorials | Regression Analysis with R | #rstats

#rstats R is an invaluable instrument for regression analysis. It enables us to easily conduct all sorts of analysis using regression which gives us an opportunity to draw causality lines between certain factors and even predict future phenomena. In this series of our R tutorials, we shall see how we can easily and quickly analyse our data using linear regression in R.

Regression Analysis in R
Regression Analysis with R
 
 

What is Regression?

Regression helps us to explain the variation in a dependent variable through the variations in independent variables. Thus, regression is basically an illustration of causation existing between dependent and independent variables. Should the variation in dependent variables be sufficiently explained by independent variables, the regression models in R can be used for prdediction.
 
Regression Analysis in R
Regression Analysis with R | Regression helps us to interpret the variations in a dependent variable (Y) through variations in independent variables

 
The aim of the regression analysis in R is to develop a function that predicts the dependent variable based on the values of the independent variables. A simple regression model in R represents a straight line passing between the points of data and showing the trend of the causation.
 
Regression Models in R
Regression Analysis in R | A simple linear regression model
Before we continue to our R tutorials, let's learn the basics of the code helping us to work with linear regression models in R.
 

Doing Regression Analysis with R

Regression Models with R

For our regression analysis in R, we will need to know about several regression models – lm():
  • a simple linear regression model and multiple linear regression model
  • a quadratic (or parabolic) regression model
  • an exponential regression model
Regression Models in R
Regression Models in R | A simple linear regression model and multiple linear regression model, a quadratic (or parabolic) regression model, an exponential regression model
 

Simple Linear Regression Model and Multiple Linear Regression Model in R

The simple regression model is different from the multiple regression model in the number of variables. The simple linear regression model has one independent and one dependent variable:
reg<-lm(price~size)

The multiple linear regression model has one dependent and several independent variables:
reg<-lm(price~size+localisation+age)  
 

Parabolic Regression Model in R

The parabolic regression model or quadratic regression model has the following form:
reg2<-lm(price~size+I(size^2))

NB! The I is necessary because inside a model definition * and ^ are just working inside I()).
 

Exponential Regression Model in R

To analyse the exponential regression model, we need to do the following:

1) define the exponential regression model:
reg3<-lm(log(price)~log(size))

2) compute the parameters of the exponential regression model:
summary(reg3)

3) find the intercept of the exponential regression model:
exp(1.13856).
 

Case Study of Linear Regression Analysis with R: Appartment Price in Paris as a function of its size 


Linear Regression Analysis with R
Linear Regression Analysis with R
 
1. After opening the data file, attaching it to the memory of R, we explore its structure:
getwd()
app<-read.table("C:/Users/Karol/Desktop/R/appart.csv",header=TRUE, sep=",")
attach(app)
str(app).

2. Next, we create a new variable price per square metre and add it to the datafile app:
psq<-price/size
cbind(app,psq)->app.

3. Then, we plot the price as a function of the size and add the appartment numbers in red:
plot(size,price)
text(size,price,number,pos=3,cex=0.7,col="red")

linear regression r

4. Our next action would be to define the simple linear regression model and study its results, after that we add the regression line to the graph:
reg1<-lm(price~size)
summary(reg1)
abline(reg1,col="darkblue")
abline()
Linear Regression Line in R

5. Having defined the simple linear regression model in R, we 1) add the parabolic regression model or the quadratic regression model, analyse its information and 2) add the curve to the graph of the parabolic regression model:
1) reg2<-lm(price~I(size^2))
summary(reg2)
2) x<-0:300
y<-0.0199*x^2+213.6
lines(x,y,col="darkgreen")

Parabolic Regression Analysis with R

6. After the parabolic regression model, we 1) define the exponential regression model and compute its parameters; 2) having done that, we add the exponential curve to the graph:
1) reg3<-lm(log(price)~log(size))
summary(reg3)
exp(1.13856)
2) x<-0:300
y<-3.12*x^1.097
lines(x,y,col="brown")

Exponential Regression Analysis with R

7. Having analysed the regression models in R (the simple linear regression model, the parabolic or the quadratic linear regression model and the exponential linear regression model), we return to the linear model. We define namely its standardised residuals and attach them to the datafile app:
rstandard(reg1)->res
cbind(app,res)->app

NB!

> To compute the residuals of a regression models we use the command residuals():
Example : residuals<-residuals(reg)

> In order to find the studentised residuals we take advantage of the function rstudent():
Example : resid_stud<-rstudent(reg)

8. Next, we check if there are outliers by putting the appartments in an increasing order with respect to their residuals:
app[order(res),c(1,6)]

9. We check if there are outliers graphically by drawing the standardised residuals with horizontal lines in -2 and 2 and vertical limits defined by ylim=c(-3,3):
plot(res~number, ylim=c(-3,3))
abline(h=c(-2,0,2),lty=c(2,1,2))
text(number,res,number,pos=2)

Outliers in Regression Analysis with R

10. We define a subset of the sample not containing any outliers and redefine the linear model for this subset:
app2<-subset(app, subset=(res<2))
reg4<-lm(app2$price~app2$size)
summary(reg4)

11. Finally, we install the package asbio and plot the confidence and prediction intervals for the mean and the prevision of the linear model - plotCI.reg():
utils:::menuInstallPkgs()
library(asbio)
plotCI.reg(size,price,conf=0.97)

plotCI.reg()


Now, let's move to the most important part of our R tutorials, the case study.

Case Study of Multiple Regression Analysis with R: Predictive Analysis of Sales

1. First of all, we go through the preliminaries in R, i.e. we open the file, attach it to the memory of R and analyse its structure:
getwd()
setwd("C:/Users/Karol/Desktop/R")
sal<-read.table("salesdata.csv", header=TRUE, sep=",")
str(sal)
attach(sal)


2. Our next step would be to draw a pairwise scatterplot of the variables using the function plot() in R:
plot(sal)
Pairwise Scatterplot in R


3. Using the function cor() in r, we can study the correlation coefficient of the variables:
cor(sal)
cor()

NB!
In order to carry out a correlation test to see whether a given correlation is statistically significant, we use the function cor.test():
cor.test(price, size)
cor.test(q3, q4, method="spearman")


4. Now, let’s define the linear multiple regression model and analyse it:
reg1<-lm(sales~tm+dw+price+rb+inv+pub+se+tpub)
summary(reg1)

 
5. In order to elaborate a new linear multiple regression model, we stick to the stepwise backward multiple regression method. On each step, we need to get rid of collinearity problems. The variance inflation factor helps us to do that by elliminating the variable with the highest variance inflation factor on each step, which in the long run enables us to get the final model. To compute the variance inflation factor in R, we use the function vif() that à propos needs the library car:

utils:::menuInstallPkgs()
library(car)
vif(reg1)

vif()_r
 

6. The variable with the highest variance inflation factor is selling expenses - se (8.621217), hence we eliminate it and proceed with the next model.
 
reg2<-lm(sales~tm+dw+price+rb+inv+pub+tpub)
vif(reg2)
summary(reg2)

The new model does not manifest any multicollinearity.


7. Stepwise backward multiple regression

A. Manual Method
At each step we remove the variable with the smallest marginal contribution… And so on until we get the final model where all variables are relevant:

reg3<-lm(sales~tm+dw+price+rb+inv+pub)
summary(reg3)
> We eliminate tpub.

reg4<-lm(sales~tm+dw+price+inv+pub)
summary(reg4)
> We eliminate dw.

reg5<-lm(sales~tm+price+inv+pub)
summary(reg5)
> The model that we were looking for is the following:
 
Sales=3302+5.19tm-13.17price+1.97inv+8.23pub.
 
B: Automatic Method (library MASS required)
 
utils:::menuInstallPkgs()
library(MASS)
stepAIC(reg2, direction="backward")
 

7. Next, the command anova() can do a lot for us. It can namely:

1) display the variance decomposition of a regression model
by showing the total sum of squares (the sum of explained squares + the sum of residual squares):
anova(reg5)
 
2) compare the simplified model to the complete model and test whether the more complicated model is better. The H0 hypothesis is that the two models are equivalent:
anova(reg2,reg5)
 
In our case the models are equivalent. Therefore, we retain the H0 and work with the simplified model.


8. The next action would be to create the standardised residuals and add them to the data file:
res<-rstandard(reg5)

 
9. Then, we analyse the outliers graphically. By using the command fitted() we compute previsions for the observations in our regression model :
plot(res~semester, ylim=c(-3,3))
abline(h=c(-2,0,2), lty=c(2,1,2))
text(semester,res, semester, col="blue", pos=2, cex=0.6)
fitted(reg2)->pred
plot(pred,sales)
plotCI.reg(pred, sales, conf = 0.97)
text(pred, sales, semester, pos=2, cex=0.5)


10. After that we verify the model hypotheses graphically by using the functions par() and plot():
# graphical verification of the model hypotheses
par(mfrow=c(2,2)) - creates a 2x2 window to put the 4 graphs
plot(reg5)
dev.off() - goes back to a single graphic window

Other Useful Functions for Regression Analysis in R:

- lines() – draws the result of a non linear model:
x<-0:300
y<-predict(reg2,list(size=x))

- confint() - computes the confidence intervals for the parameters of a regression model :
confint(reg, level=0.99)
 
- predict() – computes previsions for sample points: 
predict(reg,list(size=c(35,67,93)))
predict(reg,list(size=35))
 
- update() – recreates a model for a subset of the data:
reg2<-update(reg,subset=(number!=4)) - takes out observation number 4 from the sample
 
- qqPlot() – verifies the normality hypothesis graphically (needs the library car):
utils:::menuInstallPkgs()
library(car) qqPlot(reg, id.method="identify")
 
- influencePlot() - illustrates the influence of the sample points in the parameter estimation by means of a graph:
influencePlot(reg2,id.method="identify")














































































technology 242859983926908173

Post a Comment

Home item

More links from #glfintech:

#glfintech Newsletter

#glfintech Recommends

#glfintech on Twitter

Statistics


Partner and Clients

Warmboutique

Webstore selling textile goods

Adblock is enabled

Hi! We have detected that you are using adblock on your web browser and take this chance to ask you to pause it just for this site. Time is money and we invest lots of time in the content that we work really hard on, and advertising is the only source of income for this particular project.
Thanks.

No harming software. We promise!