Regression Dataset Example
Regression Dataset Example
with:
• a=2
• b = 1.5
A linear function has one independent variable and one
dependent variable. The independent variable is x and
the dependent variable is y.
• a is the constant term or the y intercept. It is the
value of the dependent variable when x = 0.
• b is the coefficient of the independent variable. It is
also known as the slope and gives the rate of change
of the dependent variable.
Why we call it linear? Alright, let’s visualize the data set
we got above!
After plotting all value of the shopping cost (in blue line),
you can see, they all are in one line, that’s why we call
it linear. With the equation of linear (y=a+bx), the a is
an independent variable. Even if a=0 (you have no need
to pay for the parking ticket), the Shopping Cost line will
shift down and they are still in a line (orange line).
Code explanation:
• dataset: the table contains all values in our csv file
• X: the first column which contains Years Experience
array
• y: the last column which contains Salary array
# Splitting the dataset into the Training set and Test set
Code explanation:
• test_size=1/3: we will split our dataset (30 observations)
into 2 parts (training set, test set) and the ratio
of test set compare to dataset is 1/3 (10
observations will be put into the test set. You can put
it 1/2 to get 50% or 0.5, they are the same. We should
not let the test set too big; if it’s too big, we will lack
of data to train. Normally, we should pick around 5%
to 30%.
• train_size: if we use the test_size already, the rest of
data will automatically be assigned to train_size.
• random_state: this is the seed for the random number
generator. We can put an instance of
the RandomState class as well. If we leave it blank or 0,
the RandomState instance used by np.random will be used
instead.
We already have the train set and test set, now we have
to build the Regression Model:
# Fitting Simple Linear Regression to the Training set
Code explanation:
• regressor = LinearRegression(): our training model which will
implement the Linear Regression.
• regressor.fit: in this line, we pass the X_train which
contains value of Year Experience and y_train which
contains values of particular Salary to form up the
model. This is the training process.
Compare two plots, we can see 2 blue lines are the same
direction. Our model is good to use now.
Alright! We already have the model, now we can use it to
calculate (predict) any values of X depends on y or any
values of y depends on X. This is how we do it:
y_pred = regressor.predict(5)
Predict y_pred using single value of X=5
Bingo! The value of y_pred with X = 5 (5 Years
Experience) is 73545.90
You can offer to your candidate the salary of
$73,545.90 and this is the best salary for him!
We can also pass an array of X instead of single value
of X:
# Predicting the Test set results
y_pred = regressor.predict(X_test)
Complete code:
import numpy as np
import pandas as pd
#dataset = pd.read_csv('181105_missing-data.csv')
dataset = pd.read_csv('/home/student/Desktop/salary_data.csv')
# Splitting the dataset into the Training set and Test set
# Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
viz_train = plt
viz_train.xlabel('Year of Experience')
viz_train.ylabel('Salary')
viz_train.show()
viz_test = plt
viz_test.xlabel('Year of Experience')
viz_test.ylabel('Salary')
viz_test.show()
y_pred = regressor.predict([5])
y_pred
y_pred = regressor.predict(X_test)
y_pred
Output:
[73545.90445964]