Regression in Data Mining

Regression involves predictor variable (the values which are known) and response variable (values to be predicted).

The two basic types of regression are:

1. Linear regression

  • It is simplest form of regression. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observe the data.
  • Linear regression attempts to find the mathematical relationship between variables.
  • If outcome is straight line then it is considered as linear model and if it is curved line, then it is a non linear model.
  • The relationship between dependent variable is given by straight line and it has only one independent variable.
    Y =  α + Β X
  • Model 'Y', is a linear function of 'X'.
  • The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also changes.
linear regression

2. Multiple regression model

  • Multiple linear regression is an extension of linear regression analysis.
  • It uses two or more independent variables to predict an outcome and a single continuous dependent variable.
    Y = a0 + a1 X1 + a2 X2 +.........+ak Xk +e
    where,
    'Y' is the response variable.
    X1 + X2 + Xk are the independent predictors.
    'e' is random error.
    a0, a1, a2, ak are the regression  coefficients.

Naive Bays Classification Solved example

Bike Damaged example: In the following table attributes are given such as color, type, origin and subject can be yes or no.

Bike NoColorTypeOriginDamaged?
10BlueMopedIndianYes
20BlueMopedIndianNo
30BlueMopedIndianYes
40RedMopedIndianNo
50RedMopedJapaneseYes
60RedSportsJapaneseNo
70RedSportsJapaneseYes
80RedSportsIndianNo
90BlueSportsJapaneseNo
100BlueMopedJapaneseYes

Solution:
Required formula:
              P (c | x) = P (c | x) P (c) / P (x) In the fields of science, engineering and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's true value.

Where,   
P (c | x) is the posterior of probability.
P (c)  is the prior probability.
P (c | x) is the likelihood.
P (x) is the prior probability.

It is necessary to classify a <Blue, Indian, Sports>, is unseen sample, which is not given in the data set.
So the probability can be computed as:
     
P (Yes) = 5/10
P (No) = 5/10

Color
P(Blue|Yes) = 3/5P(Blue|No) = 2/5
P(Red|Yes) = 2/5P(Red|No) = 3/5
Type
P(Sports|Yes) = 1/5P(Sports|No) = 3/5
P(Moped|Yes) = 4/5P(Moped|No) = 2/5
Origin
P(Indian|Yes) = 2/5P(Indian|No) = 3/5
P(Japanese|Yes) = 3/5P(Japanese|No) = 2/5

So, unseen example X = <Blue, Indian, Sports>

P(X|Yes). P(Yes) = P(Blue|Yes). P(Indian|Yes). P(Sports|Yes). P(Yes)
                             = 3/5*2/5*1/5*5/10 = 0.024

P(X|No). P(No) = P(Blue|No). P(Indian|No). P(Sports|No).P(No)
                           = 2/5*3/5*3/5*5/10 = 0.072

So, 0.072 > 0.024 so example can be classified as NO.