0% found this document useful (0 votes)
20 views

Regression Dataset Example

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Regression Dataset Example

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Simple Linear Regression with Python

In this post we will guide you the very first step to


approach Machine Learning using Simple Linear
Regression.
What is Linear?
First, let’s say that you are shopping at Walmart.
Whether you buy goods or not, you have to pay $2.00 for
parking ticket. Each apple price $1.5, and you have to
buy an (x) item of apple. Then we can populate a price
list as below:
It’s easy to predict (or calculate) the Price based on
Value and vice versa using the equation of y=2+1.5x for
this example or:

with:
• a=2
• b = 1.5
A linear function has one independent variable and one
dependent variable. The independent variable is x and
the dependent variable is y.
• a is the constant term or the y intercept. It is the
value of the dependent variable when x = 0.
• b is the coefficient of the independent variable. It is
also known as the slope and gives the rate of change
of the dependent variable.
Why we call it linear? Alright, let’s visualize the data set
we got above!

After plotting all value of the shopping cost (in blue line),
you can see, they all are in one line, that’s why we call
it linear. With the equation of linear (y=a+bx), the a is
an independent variable. Even if a=0 (you have no need
to pay for the parking ticket), the Shopping Cost line will
shift down and they are still in a line (orange line).

But in real life, things are not that simple!


Let’s take another example, in AB Company, there is a
salary distribution table based on Year of Experience as
per below:
“The scenario is you are a HR officer, you got a
candidate with 5 years of experience. Then what
is the best salary you should offer to him?”
Before deep dive into this problem, let’s plot the data set
into the plot first:

Please look at this chart carefully. Now we have a bad


news: all the observations are not in a line. It means we
cannot find out the equation to calculate the (y) value.
So what now? Don’t worry, we have a good news for you!
Look at the Scatter Plot again before scrolling down. Do
you see it?
All the points is not in a line BUT they are in a line-shape!
It’s linear!
Based on our observation, we can guess that the salary
range of 5 Years Experience should be in the red range.
Of course, we can offer to our candidate any number in
that red range. But how to pick the best number for him?
It’s time to use Machine Learning to predict the best
salary for our candidate.
In this section, we will use Python on Spyder IDE to find
the best salary for our candidate. Okay, let’s do it!
Linear Regression with Python
Before moving on, we summarize 2 basic steps of
Machine Learning as per below:
1.Training
2.Predict
Okay, we will use 4 libraries such as numpy and pandas to
work with data set, sklearn to implement machine learning
functions, and matplotlib to visualize our plots for viewing:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('salary_data.csv')
X = dataset.iloc[:, :-1].values #get a copy of dataset exclude last column
y = dataset.iloc[:, 1].values #get array of dataset in column 1st

Code explanation:
• dataset: the table contains all values in our csv file
• X: the first column which contains Years Experience
array
• y: the last column which contains Salary array

Next, we have to split our dataset (total 30 observations)


into 2 sets: training set which used for training and test
set which used for testing:

# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3,
random_state=0)

Code explanation:
• test_size=1/3: we will split our dataset (30 observations)
into 2 parts (training set, test set) and the ratio
of test set compare to dataset is 1/3 (10
observations will be put into the test set. You can put
it 1/2 to get 50% or 0.5, they are the same. We should
not let the test set too big; if it’s too big, we will lack
of data to train. Normally, we should pick around 5%
to 30%.
• train_size: if we use the test_size already, the rest of
data will automatically be assigned to train_size.
• random_state: this is the seed for the random number
generator. We can put an instance of
the RandomState class as well. If we leave it blank or 0,
the RandomState instance used by np.random will be used
instead.
We already have the train set and test set, now we have
to build the Regression Model:
# Fitting Simple Linear Regression to the Training set

from sklearn.linear_model import LinearRegression


regressor = LinearRegression()
regressor.fit(X_train, y_train)

Code explanation:
• regressor = LinearRegression(): our training model which will
implement the Linear Regression.
• regressor.fit: in this line, we pass the X_train which
contains value of Year Experience and y_train which
contains values of particular Salary to form up the
model. This is the training process.

Let’s visualize our training model and testing model:

# Visualizing the Training set results


viz_train = plt
viz_train.scatter(X_train, y_train, color='red')
viz_train.plot(X_train, regressor.predict(X_train), color='blue')
viz_train.title('Salary VS Experience (Training set)')
viz_train.xlabel('Year of Experience')
viz_train.ylabel('Salary')
viz_train.show()

# Visualizing the Test set results


viz_test = plt
viz_test.scatter(X_test, y_test, color='red')
viz_test.plot(X_train, regressor.predict(X_train), color='blue')
viz_test.title('Salary VS Experience (Test set)')
viz_test.xlabel('Year of Experience')
viz_test.ylabel('Salary')
viz_test.show()
After running above code, you will see 2 plots in the
console window:

Compare two plots, we can see 2 blue lines are the same
direction. Our model is good to use now.
Alright! We already have the model, now we can use it to
calculate (predict) any values of X depends on y or any
values of y depends on X. This is how we do it:

# Predicting the result of 5 Years Experience

y_pred = regressor.predict(5)
Predict y_pred using single value of X=5
Bingo! The value of y_pred with X = 5 (5 Years
Experience) is 73545.90
You can offer to your candidate the salary of
$73,545.90 and this is the best salary for him!
We can also pass an array of X instead of single value
of X:
# Predicting the Test set results
y_pred = regressor.predict(X_test)

Predict y_pred using array of X_test


And we can predict X using y as well. Let’s try it yourself!
In conclusion, with Simple Linear Regression, we
have to do 5 steps as per below:
1.Importing the dataset.
2.Splitting dataset into training set and testing set (2
dimensions of X and y per each set). Normally, the
testing set should be 5% to 30% of dataset.
3.Visualize the training set and testing set to double
check (you can bypass this step if you want).
4.Initializing the regression model and fitting it using
training set (both X and y).
5.Let’s predict!!

Complete code:
import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

# Importing the dataset

#dataset = pd.read_csv('181105_missing-data.csv')

dataset = pd.read_csv('/home/student/Desktop/salary_data.csv')

X = dataset.iloc[:, :-1].values #get a copy of dataset exclude last column

y = dataset.iloc[:, 1].values #get array of dataset in column 1st

# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=0)

# Scaling
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# itting Simple Linear Regression to the Training set

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X_train, y_train)

# Predicting the Test set results

y_pred = regressor.predict(X_test)

# Visualizing the Training set results

viz_train = plt

viz_train.scatter(X_train, y_train, color='red')

viz_train.plot(X_train, regressor.predict(X_train), color='blue')

viz_train.title('Salary VS Experience (Training set)')

viz_train.xlabel('Year of Experience')

viz_train.ylabel('Salary')

viz_train.show()

# Visualizing the Test set results

viz_test = plt

viz_test.scatter(X_test, y_test, color='red')

viz_test.plot(X_train, regressor.predict(X_train), color='blue')

viz_test.title('Salary VS Experience (Test set)')

viz_test.xlabel('Year of Experience')

viz_test.ylabel('Salary')

viz_test.show()

y_pred = regressor.predict([5])
y_pred

y_pred = regressor.predict(X_test)

y_pred
Output:

[73545.90445964]

You might also like