0% found this document useful (0 votes)
161 views32 pages

Data Science Training Report 2023

The document is an industrial training report on data science completed at Teachnook from April 2023 to June 2023. It covers the student's learning of introduction to data science, python for data science, statistics for data science, and predictive modelling basics. The report defines data science and the data science process. It also provides examples of data science applications and an introduction to computer vision.

Uploaded by

Krishna Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views32 pages

Data Science Training Report 2023

The document is an industrial training report on data science completed at Teachnook from April 2023 to June 2023. It covers the student's learning of introduction to data science, python for data science, statistics for data science, and predictive modelling basics. The report defines data science and the data science process. It also provides examples of data science applications and an introduction to computer vision.

Uploaded by

Krishna Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

l OM oARc PSD|37 86 0 66 5

Industrial TRAINING REPORT


ON
“DATA SCIENCE”
Complete at
Teachnook
Duration
1 April to 31 June 2023
3rd Year (5th Sem)
Submitted By:
Krishna Soni

Enrolment Roll No.: ECB 2021/10/19


Department of artificial intelligence & data science
Engineering College of Bikaner
Bikaner, Rajasthan
l OM oARc PSD|37 86 0 66 5
l OM oARc PSD|37 86 0 66 5

DECLARATION

I hereby declare that the Industrial Training Report on Data Science completed at
Teachnook is an authentic record of my own work as requirement of Industrial
Training as a part of the V semester syllabus during the period from April 2023 to
June 2023 submitted at the Department of Artificial Intelligence and Data Science,
Engineering College Bikaner for the award of the degree of B.Tech. in Artificial
Intelligence and Data Science by Bikaner Technical University, Bikaner.

Krishna
21EEBAD019
l OM oARc PSD|37 86 0 66 5

Table of Content
Certificate
Student Declaration

1. Introduction
1.1) Data science
1.2) Data Science Process

2. My Learning
2.1) Introduction to Data Science
2.2) Python for Data Science
2.3) Understanding the statistics for Data Science
2.4) Predictive Modelling basis of Machine Learning

3. Introduction to Data Science


3.1) Data Science
3.2) Example
3.3) Computer Vision
3.4) Application of Data Science
3.5) Reason for choosing Data Science

4. Python Introduction
4.1) History of python
4.2) Feature of Python
4.3) Python for data science
4.4) Why Python

5. Statistics
5.1) Descriptive Statistics
5.2) Types of Variable
5.3) Outliers
5.4) Range
5.5) Histogram
5.6) Inferential statistics
5.7) Hypothesis testing
5.8) T-Test
5.9) Z scored
5.10) Chi Squared Test
l OM oARc PSD|37 86 0 66 5

6. Predictive Modelling
6.1) Types
6.2) Stages of Predictive Modelling
6.3) Problem Definition
6.4) Problem Generation
6.5) Data Extraction and Collection
6.6) Data Exploration and Transportation
6.6.1) Variable Treatment
6.6.2) Univariate Analysis
6.6.3) Bivariate Analysis
6.6.4) missing value treatment
6.7) Types of Outliers
6.7.1) Univariate
6.7.2) Bivariate

7. Modelling Building
7.1) Algorithm
7.2) Algorithm of Machine Learning
8. Methodology
9. Result
10.Refrence
l OM oARc PSD|37 86 0 66 5

INTRODUCTION
OBJECTIVES
To explore, sort and analyse mega data from various sources to take advantage of them and reach
conclusions to optimize business processes and for decision support.
Examples include machine maintenance or (predictive maintenance), in the fields of marketing and
sales with sales forecasting based on weather.

1.1) DATA SCIENCE:


Data Science as a multi-disciplinary subject that uses mathematics, statistics, and computer science
to study and evaluate data. The key objective of Data Science is to extract valuable information for
use in strategic decision making, product development, trend analysis, and forecasting.
Data Science concepts and processes are mostly derived from data engineering, statistics,
programming, social engineering, data warehousing, machine learning, and natural language
processing. The key techniques in use are data mining, big data analysis, data extraction and data
retrieval.
Data science is the field of study that combines domain expertise, programming skills, and
knowledge of mathematics and statistics to extract meaningful insights from data. Data science
practitioners apply machine learning algorithms to numbers, text, images, video, audio, and more
to produce artificial intelligence (AI) systems to perform tasks that ordinarily require human
intelligence. In turn, these systems generate insights which analysts and business users can translate
into tangible business value.
1.2) DATA SCIENCE PROCESS:
1. The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project.
2. The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result
is data in its raw form, which probably needs polishing and transformation before it becomes
usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data
from a raw form into data that’s directly usable in your models. To achieve this, you’ll detect
and correct different kinds of errors in the data, combine data from different data sources,
and transform it. If you have successfully completed this step, you can progress to data
visualization and modelling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of
the data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modelling.

1
l OM oARc PSD|37 86 0 66 5

5. Finally, we get to the sexiest part: model building (often referred to as “data modelling”
throughout this book). It is now that you attempt to gain the insights or make the predictions
stated in your project charter. Now is the time to bring out the heavy guns, but remember
research has taught us that often (but not always) a combination of simple models tends to
outperform one complicated model. If you’ve done this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis,
if needed. One goal of a project is to change a process and/or make better decisions. You
may still need to convince the business that your findings will indeed change the business
process as expected. This is where you can shine in your influencer role. The importance of
this step is more apparent in projects on a strategic and tactical level. Certain projects require
you to perform the business process over and over again, so automating the project will save
time.

MY LEARNINGS
2.1) INTRODUCTION TO DATA SCIENCE
• Overview & Terminologies in Data Science
Applications of Data Science
 Unfamiliar detection (fraud, disease, etc.)
 Automation and decision-making (credit worthiness, etc.)
 Classifications (classifying emails as “important” or “junk”)
 Forecasting (sales, revenue, etc.)
 Pattern detection (weather patterns, financial market patterns, etc.)
 Recognition (facial, voice, text, etc.)
 Recommendations (based on learned preferences, recommendation engines can refer
you to movies, restaurants and books you may like)

2.2) PYTHON FOR DATA SCIENCE


Introduction to Python, Understanding Operators, Variables and Data Types, Conditional
Statements, Looping Constructs, Functions, Data Structure, Lists, Dictionaries, Understanding
Standard Libraries in Python, reading a CSV File in Python, Data Frames and basic operations with
Data Frames, Indexing Data Frame.

2
l OM oARc PSD|37 86 0 66 5

2.3) UNDERSTANDING THE STATISTICS FOR DATA SCIENCE


Introduction to Statistics, Measures of Central Tendency, Understanding the spread of data,
Data Distribution, Introduction to Probability, Probabilities of Discrete and Continuous Variables,
Normal Distribution, Introduction to Inferential Statistics, Understanding the Confidence Interval
and margin of error, Hypothesis Testing, Various Tests, Correlation.
2.4) PREDICTIVE MODELLINGAND BASICS OF MACHINE LEARNING
Introduction to Predictive Modelling, Types and Stages of Predictive Models, Hypothesis
Generation, Data Extraction and Exploration, Variable Identification, Univariate Analysis for
Continuous Variables and Categorical Variables, Bivariate Analysis, Treating Missing Values and
Outliers, Transforming the Variables, Basics of Model Building, Linear and Logistic Regression,
Decision Trees, K-means Algorithms in Python.
Summary of Procedure of Analyzing Data:
Data science generally has a five-stage life cycle that consists of:
• Capture: data entry, signal reception, data extraction
• Maintain: Data cleansing, data staging, data processing.
• Process: Data mining, clustering/classification, data modelling
• Communicate: Data reporting, data visualization
• Analyse: Predictive analysis, regression

Introduction to Data Science


3.1) Data Science
The field of bringing insights from data using scientific techniques is called data science.

3.2) Applications:

Amazon Go – No checkout lines

3.3) Computer Vision - The advancement in recognizing an image by a computer involves processing large sets
of image data from multiple objects of same category. For example, Face recognition.

3
l OM oARc PSD|37 86 0 66 5

Spectrum of Business Analysis

What can happen?


Given data is
collected and used.

Big Data

What is likely to
happen?
Predictive Analysis

What’s happening
now?

Dashboards

Why did it
happen?

Detective Analysis

What happened?
Reporting

Value added to organization

 Reporting / Management Information System

To track what is happening in organization.

4
l OM oARc PSD|37 86 0 66 5

 Detective Analysis

Asking questions based on data we are seeing, like. Why something happened?

 Dashboard / Business Intelligence

Utopia of reporting. Every action about business is reflected in front of screen.

 Predictive Modelling

Using past data to predict what is happening at granular level.

 Big Data

Stage where complexity of handling data gets beyond the traditional system.

Can be caused because of volume, variety or velocity of data. Use specific tools to analyse such scale data.

3.4) Application of Data Science


• Recommendation System
Example-In Amazon recommendations are different for different users according to their past search.

• Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
• Deciding the right credit limit for credit card customers.
• Suggesting right products from e-commerce companies
1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization
• How google and other search engines know what are the more relevant results for our search query?
1. Apply ML and Data Science
2. Fraud Detection
3. AD placement
3.5) Reason for choosing data science
Data Science has become a revolutionary technology that everyone seems to talk about. Hailed as the ‘sexiest
job of the 21st century’. Data Science is a buzzword with very few people knowing about the technology in its
true sense.
While many people wish to become Data Scientists, it is essential to weigh the pros and cons of data science
and give out a real picture. In this article, we will discuss these points in detail and provide you with the
necessary insights about Data Science.
5
l OM oARc PSD|37 86 0 66 5

Advantages: -
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile
Disadvantages: -
1. Mastering Data Science is near to impossible
2. A large Amount of Domain Knowledge Required
3. Arbitrary Data May Yield Unexpected Results
4. The problem of Data Privacy

6
l OM oARc PSD|37 86 0 66 5

Python Introduction
PYTHON
Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and
dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application
development in many areas on most platforms.

4.1) HISTORY OF PYTHON


Python was developed by Guido van Rossum in the late eighties and early nineties at the National Research
Institute for Mathematics and Computer Science in the Netherlands. Python is derived from many other
languages, including ABC, Modula-3, C, C++, Algol-68, Small Talk, and Unix shell and other scripting
languages. Python is copyrighted. Like Perl, Python source code is now available under the GNU General Public
License (GPL). Python is now maintained by a core development team at the institute, although Guido van
Rossum still holds a vital role in directing its progress.
4.2) PYTHON FEATURES
Python's features include:
 Easy-to-learn:
Python has few keywords, simple structure, and a clearly defined syntax. This allows the
student to pick up the language quickly.
 Easy-to-read:
Python code is more clearly defined and visible to the eyes.

 Easy-to-maintain:
Python's source code is fairly easy-to-maintain.

 A broad standard library:


Python's bulk of the library is very portable and cross-platform compatible on UNIX,
Windows, and Macintosh.

 Interactive Mode:
Python has support for an interactive mode which allows interactive testing and debugging
of snippets of code.

 Portable:
Python can run on a wide variety of hardware platforms and has the same interface on all
platforms.
7
l OM oARc PSD|37 86 0 66 5

 Extendable:
You can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.

 Databases:
Python provides interfaces to all major commercial databases.

 GUI Programming:
Python supports GUI applications that can be created and ported to many system calls, libraries and
windows systems, such as Windows MFC, Macintosh, and the X Window system of Unix.

 Scalable:
o Python provides a better structure and support for large programs than shell scripting.
Python has a big list of good features:

o It supports functional and structured programming methods as well as OOP.

o It can be used as a scripting language or can be compiled to byte-code for building large
applications.

o It provides very high-level dynamic data types and supports dynamic type checking.

o IT supports automatic garbage collection.

o It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

4.3) Python for Data science:


In this python for data science guide, we’ll explore the exciting world of Python and its wide-ranging applications
in data science. We will also explore a variety of data science techniques used in data science using the Python
programming language.
We all know that data Science is applied to gather multiple data sets to collect information, project the insight,
and interpret it to make an effective business decision. However, being a data scientist requires you to learn some
of the best and most highly used programming languages, such as Java, C++, R, Python, etc. Among
these, Python has been considered the preferred choice among data scientists throughout the globe.

4.4) Why Python???


1. Python is an open source language.
2. Syntax as simple as English.

8
l OM oARc PSD|37 86 0 66 5

3. Very large and Collaborative developer community.


4. Extensive Packages.
• UNDERSTANDING OPERATORS:
Theory of operators: - Operators are symbolic representation of Mathematical tasks.
• VARIABLES AND DATATYPES:
Variables are named bounded to objects. Data types in python are int (Integer), Float, Boolean and
strings.
• CONDITIONAL STATEMENTS:
If-else statements (Single condition)
If- elif- else statements (Multiple Condition)
• LOOPING CONSTRUCTS:
For loop
• FUNCTIONS:
Functions are re-usable piece of code. Created for solving specific problem.
Two types: Built-in functions and User- defined functions.
Functions cannot be reused in python.
• DATA STRUCTURES:

Two types of Data structures:

LISTS: A list is an ordered data structure with elements separated by comma and enclosed within square
brackets.

DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and
stored as key: value pair, enclosed with curly braces {}.

9
l OM oARc PSD|37 86 0 66 5

Statistics:
5.1) Descriptive Statistic
 Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of new values.
Code import pandas as pd data=pd.read_csv( "Mode.csv") //reads data from csv
file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of subject column
print(mode_data)
 Mean
import pandas as pd data=pd.read_csv( "mean.csv") //reads
data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column print(mean_data)
 Median
Absolute central value of data set.
import pandas as pd data=pd.read_csv( "data.csv") //reads
data from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column print(median_data)
5.2) Types of variables
• Continous – Which takes continuous numeric values. Eg-marks
• Categorial-Which have discrete values. Eg- Gender
• Ordinal – Ordered categorial variables. Eg- Teacher feedback
• Nominal – Unorderd categorial variable. Eg- Gender
5.3) Outliers
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of 97.
Reasons of Outliers
• Typos-During collection. Eg-adding extra zero by mistake.
• Measurement Error-Outliers in data due to measurement operator being faulty.
• Intentional Error-Errors which are induced intentionally. Eg-claiming smaller amount of alcohol
consumed then actual.
• Legit Outlier—These are values which are not actually errors but in data due to legitimate reasons.

10
l OM oARc PSD|37 86 0 66 5

Eg - a CEO’s salary might actually be high as compared to other employees. Interquartile


5.4) Range (IQR):
Is difference between third and first quartile from last. It is robust to outliers.
5.5) Histograms:
Histograms depict the underlying frequency of a set of discrete or continuous data that are measured on an
interval scale.
import pandas as pd
histogram=pd.read_csv(histogram.csv) import
matplotlib.pyplot as plt
%matplot inline plt.hist(x= 'Overall
Marks',data=histogram) plt.show()
5.6) Inferential Statistics
Inferential statistics allows to make inferences about the population from the sample data.
5.7) Hypothesis Testing:
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting data, and then
examining what the data tells us about how to proceed. The hypothesis to be tested is called the null hypothesis
and given the symbol Ho. We test the null hypothesis against an alternative hypothesis, which is given the
symbol Ha.

5.8) T Tests:
When we have just a sample not population statistics.
Use sample standard deviation to estimate population standard deviation.
T test is more prone to errors, because we just have samples.

5.9) Z Score:
The distance in terms of number of standard deviations, the observed value is away from mean, is standard score
or z score.

+Z – value is above mean.


-Z – value is below mean.
The distribution once converted to z- score is always same as that of shape of original distribution.

11
l OM oARc PSD|37 86 0 66 5

5.10) Chi Squared Test:


To test categorical variables.
Correlation:
Determine the relationship between two variables.
It is denoted by r. The value ranges from -1 to +1. Hence, 0 means no relation.
Syntax import pandas as pd import numpy as np data=pd.read_csv("data.csv")
data.corr()

12
l OM oARc PSD|37 86 0 66 5

Predictive Modelling:
A data model helps organizations capture all the points of information necessary to perform operations and enact
policy based on the data they collect. This can be explained with an example of a sales transaction which is
broken down into related groups of data points, describing the customer, the seller, the item sold, and the payment
mechanism. For instance, if the sales transactions were recorded without the date on which they occurred, it
would be impossible to enforce certain return policies. Data modelling in data science is also performed to help
organizations ensure that they are collecting all the necessary items of information in the first place
Making use of past data and attributes we predict future using this data.
eg-
Past Horror Movies
Future Unwatched Horror Movies

Predicting stock price movement:


1. Analysing past stock prices.
2. Analysing similar stocks.
3. Future stock price required.
6.1) Types:
1. Supervised Learning:
Supervised learning is a type algorithm that uses a known dataset (called the training dataset) to make
predictions. The training dataset includes input data and response values.
• Regression-which have continuous possible values. Eg-Marks
• Classification-which have only two values. Eg-Cancer prediction is either 0 or 1.
2. Unsupervised Learning:
Unsupervised learning is the training of machine using information that is neither classified nor. Here the
task of machine is to group unsorted information according to similarities, patterns and differences
without any prior training of data.
• Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behaviour.
• Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.

6.2) Stages of Predictive Modelling


1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
4. Data Exploration and Transformation

13
l OM oARc PSD|37 86 0 66 5

5. Predictive Modelling
6. Model Development/Implementation

6.3) Problem Definition:


Identify the right problem statement, ideally formulate the problem mathematically.
6.4) Hypothesis Generation:
List down all possible variables, which might influence problem objective. These variables should be free from
personal bias and preferences.
Quality of model is directly proportional to quality of hypothesis.

6.5) Data Extraction/Collection


Collect data from different sources and combine those for exploration and model building.
While looking at data we might come across new hypothesis.

6.6) Data Exploration and Transformation


Data extraction is a process that involves retrieval of data from various sources for further data processing or
data storage.
Steps of Data Extraction •
Reading the data eg- From
csv file
• Variable identification
• Univariate Analysis
• Bivariate Analysis
• Missing value treatment
• Outlier treatment
• Variable Transformation

6.6.1) Variable Treatment:


It is the process of identifying whether variable is
1. Independent or dependent variable
2. Continuous or categorical variable
Why do we perform variable identification?
1. Techniques like supervised learning require identification of dependent variable.

14
l OM oARc PSD|37 86 0 66 5

2. Different data processing techniques for categorical and continuous data.


Categorical variable- Stored as object.
Continuous variable-Stored as int or float.
6.6.2) Univariate Analysis:
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies, etc.
6.6.3) Bivariate Analysis:
• When two variables are studied together for their empirical relationship.
• When you want to see whether the two variables are associated with each other.
• It helps in prediction and detecting anomalies.
6.6.4) Missing Value Treatment:
Reasons of missing value:
1. Non-response – Eg-when you collect data on people’s income and many choose not to answer. 2.
Error in data collection. Eg- Faculty data
3. Error in data reading.
Types:
1. MCAR (Missing completely at random): Missing values have no relation to the variable in which
missing value exist and other variables in dataset.
2. MAR (Missing at random): Missing values have no relation to the in which missing value exist and the
variables other than the variables in which missing values exist.
3. MNAR (Missing not at random): Missing values have relation to the variable in which missing value
exists Identifying
Syntax: -
1. Describe ()
2. Isnull()
Output will we in True or False
Different methods to deal with missing values
1. Imputation
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.
2. Deletion
Row wise or column wise deletion. But it leads to loss of data. Outlier
Treatment

15
l OM oARc PSD|37 86 0 66 5

Reasons of Outliers:
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
6.7) Types of Outliers:
6.7.1) Univariate
Analysing only one variable for outlier.
Example:
– In box plot of height and weight.
Weight will we analysed for outlier
6.7.2) Bivariate
Analysing both variables for outlier.
Eg- In scatter plot graph of height and weight. Both will we analysed.
Identifying Outlier
Graphical Method
• Box Plot

• Scatter Plot

Formula Method
Using Box Plot
< Q1 - 1.5 * IQR or > Q3+1.5 * IQR
Where IQR= Q3 – Q1
Q3=Value of 3rd quartile
16
l OM oARc PSD|37 86 0 66 5

Q1=Value of 1st quartile Treating


Outlier
1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate Variable Transformation Is the process by which-
1. We replace a variable with some function of that variable. Eg – Replacing a variable x
with its log.
2. We change the distribution or relationship of a variable with others. Used to –
1. Change the scale of a variable
2. Transforming nonlinear relationships into linear relationship
3. Creating symmetric distribution from skewed distribution.
Common methods of Variable Transformation – Logarithm, Square root, Cube root, Binning, etc.

17
l OM oARc PSD|37 86 0 66 5

Model Building
It is a process to create a mathematical model for estimating / predicting the future based on past data.
Example-
A retail wants to know the default behaviour of its credit card customers. They want to predict the probability of
default for each customer in next three months.
• Probability of default would lie between 0 and 1.
• Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default (closer to 0).

Steps in Model Building


1. Algorithm Selection
2. Training Model
3. Prediction / Scoring

Algorithm Selection
Example-

Eg- Predict the customer will buy product or not.

18
l OM oARc PSD|37 86 0 66 5

7.1) Algorithms
• Logistic Regression
• Decision Tree
• Random Forest

Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to predict/estimate. Dataset
• Train
Past data (known dependent variable).
Used to train model.
• Test
Future data (unknown dependent variable)
Used to score. Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.

7.2) Algorithm of Machine Learning


 Linear Regression
Linear regression is a statistical approach for modelling relationship between a dependent variable with a given
set of independent variables.
It is assumed that the wo variables are linearly related. Hence, we try to find a linear function. That predicts the
response value(y) as accurately as possible as a function of the feature or independent variable(x).

The equation of regression line is


Y-Values
14 represented as:
12

10

6 The squared error or cost function, J as:


4

0
0 1 2 3 4 5 6 7 8 9

19
l OM oARc PSD|37 86 0 66 5

 Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary
dependent variable, although many more complex extensions exist.

C = -y (log(y) – (1-y) log(1-y))

 K-Means Clustering (Unsupervised learning)

K-means clustering is a type of unsupervised learning, which is used when you have unlabelled data (i.e., data
without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number
of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K
groups based on the features that are provided. Data points are clustered based on feature similarity.

20
l OM oARc PSD|37 86 0 66 5

METHODOLOGY
PREDICTING IF CUSTOMER BUYS TERM DEPOSIT
Problem Statement:

Your client is a retail banking institution. Term deposits are a major source of income for a bank.
A term deposit is a cash investment held at a financial institution. Your money is invested for an
agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans
to sell term deposits to their customers such as email marketing, advertisements, telephonic
marketing and digital marketing.
Telephonic marketing campaigns still remain one of the most effective ways to reach out to
people. However, they require huge investment as large call centers are hired to actually execute
these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand
so that they can be specifically targeted via call.
You are provided with the client data such as: age of the client, their job type, their marital
status, etc. Along with the client data, you are also provided with the information of the call
such as the duration of the call, day and month of the call, etc. Given this information, your
task is to predict if the client will subscribe to term deposit. Data Dictionary: -

21
l OM oARc PSD|37 86 0 66 5

Prerequisites:
We have the following files:
• train.csv: This dataset will be used to train the model. This file contains all the client and
call details as well as the target variable “subscribed”.
• test.csv: The trained model will be used to predict whether a new set of clients will
subscribe the term deposit or not for this dataset.
• TEST.csv file: -

 TRAIN.csv file: -

22
l OM oARc PSD|37 86 0 66 5

Problem Description
Provided with following files: train.csv and test.csv.
Use train.csv dataset to train the model. This file contains all the client and call details as well as the target
variable “subscribed”. Then use the trained model to predict whether a new set of clients will subscribe the term
deposit.

23
l OM oARc PSD|37 86 0 66 5

24
l OM oARc PSD|37 86 0 66 5

25
l OM oARc PSD|37 86 0 66 5

26
l OM oARc PSD|37 86 0 66 5

RESULTS
In this complete 6 weeks training I successfully learnt about DATA SCIENCE. Also, now I’m able
to perform data analysis using python. I also attempted various quizzes and assignments provided
for periodic evaluation during 6 weeks and completed this training with 82% score in Final Test.

REFRENCE
1) WIKIPEDIA.COM
2) TEACHNOOK .REPORT
3) SCRIBB.NET

***

27

You might also like