Data Science Training Report 2023
Data Science Training Report 2023
DECLARATION
I hereby declare that the Industrial Training Report on Data Science completed at
Teachnook is an authentic record of my own work as requirement of Industrial
Training as a part of the V semester syllabus during the period from April 2023 to
June 2023 submitted at the Department of Artificial Intelligence and Data Science,
Engineering College Bikaner for the award of the degree of B.Tech. in Artificial
Intelligence and Data Science by Bikaner Technical University, Bikaner.
Krishna
21EEBAD019
l OM oARc PSD|37 86 0 66 5
Table of Content
Certificate
Student Declaration
1. Introduction
1.1) Data science
1.2) Data Science Process
2. My Learning
2.1) Introduction to Data Science
2.2) Python for Data Science
2.3) Understanding the statistics for Data Science
2.4) Predictive Modelling basis of Machine Learning
4. Python Introduction
4.1) History of python
4.2) Feature of Python
4.3) Python for data science
4.4) Why Python
5. Statistics
5.1) Descriptive Statistics
5.2) Types of Variable
5.3) Outliers
5.4) Range
5.5) Histogram
5.6) Inferential statistics
5.7) Hypothesis testing
5.8) T-Test
5.9) Z scored
5.10) Chi Squared Test
l OM oARc PSD|37 86 0 66 5
6. Predictive Modelling
6.1) Types
6.2) Stages of Predictive Modelling
6.3) Problem Definition
6.4) Problem Generation
6.5) Data Extraction and Collection
6.6) Data Exploration and Transportation
6.6.1) Variable Treatment
6.6.2) Univariate Analysis
6.6.3) Bivariate Analysis
6.6.4) missing value treatment
6.7) Types of Outliers
6.7.1) Univariate
6.7.2) Bivariate
7. Modelling Building
7.1) Algorithm
7.2) Algorithm of Machine Learning
8. Methodology
9. Result
10.Refrence
l OM oARc PSD|37 86 0 66 5
INTRODUCTION
OBJECTIVES
To explore, sort and analyse mega data from various sources to take advantage of them and reach
conclusions to optimize business processes and for decision support.
Examples include machine maintenance or (predictive maintenance), in the fields of marketing and
sales with sales forecasting based on weather.
1
l OM oARc PSD|37 86 0 66 5
5. Finally, we get to the sexiest part: model building (often referred to as “data modelling”
throughout this book). It is now that you attempt to gain the insights or make the predictions
stated in your project charter. Now is the time to bring out the heavy guns, but remember
research has taught us that often (but not always) a combination of simple models tends to
outperform one complicated model. If you’ve done this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis,
if needed. One goal of a project is to change a process and/or make better decisions. You
may still need to convince the business that your findings will indeed change the business
process as expected. This is where you can shine in your influencer role. The importance of
this step is more apparent in projects on a strategic and tactical level. Certain projects require
you to perform the business process over and over again, so automating the project will save
time.
MY LEARNINGS
2.1) INTRODUCTION TO DATA SCIENCE
• Overview & Terminologies in Data Science
Applications of Data Science
Unfamiliar detection (fraud, disease, etc.)
Automation and decision-making (credit worthiness, etc.)
Classifications (classifying emails as “important” or “junk”)
Forecasting (sales, revenue, etc.)
Pattern detection (weather patterns, financial market patterns, etc.)
Recognition (facial, voice, text, etc.)
Recommendations (based on learned preferences, recommendation engines can refer
you to movies, restaurants and books you may like)
2
l OM oARc PSD|37 86 0 66 5
3.2) Applications:
3.3) Computer Vision - The advancement in recognizing an image by a computer involves processing large sets
of image data from multiple objects of same category. For example, Face recognition.
3
l OM oARc PSD|37 86 0 66 5
Big Data
What is likely to
happen?
Predictive Analysis
What’s happening
now?
Dashboards
Why did it
happen?
Detective Analysis
What happened?
Reporting
4
l OM oARc PSD|37 86 0 66 5
Detective Analysis
Asking questions based on data we are seeing, like. Why something happened?
Predictive Modelling
Big Data
Stage where complexity of handling data gets beyond the traditional system.
Can be caused because of volume, variety or velocity of data. Use specific tools to analyse such scale data.
• Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
• Deciding the right credit limit for credit card customers.
• Suggesting right products from e-commerce companies
1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization
• How google and other search engines know what are the more relevant results for our search query?
1. Apply ML and Data Science
2. Fraud Detection
3. AD placement
3.5) Reason for choosing data science
Data Science has become a revolutionary technology that everyone seems to talk about. Hailed as the ‘sexiest
job of the 21st century’. Data Science is a buzzword with very few people knowing about the technology in its
true sense.
While many people wish to become Data Scientists, it is essential to weigh the pros and cons of data science
and give out a real picture. In this article, we will discuss these points in detail and provide you with the
necessary insights about Data Science.
5
l OM oARc PSD|37 86 0 66 5
Advantages: -
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile
Disadvantages: -
1. Mastering Data Science is near to impossible
2. A large Amount of Domain Knowledge Required
3. Arbitrary Data May Yield Unexpected Results
4. The problem of Data Privacy
6
l OM oARc PSD|37 86 0 66 5
Python Introduction
PYTHON
Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and
dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application
development in many areas on most platforms.
Easy-to-maintain:
Python's source code is fairly easy-to-maintain.
Interactive Mode:
Python has support for an interactive mode which allows interactive testing and debugging
of snippets of code.
Portable:
Python can run on a wide variety of hardware platforms and has the same interface on all
platforms.
7
l OM oARc PSD|37 86 0 66 5
Extendable:
You can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.
Databases:
Python provides interfaces to all major commercial databases.
GUI Programming:
Python supports GUI applications that can be created and ported to many system calls, libraries and
windows systems, such as Windows MFC, Macintosh, and the X Window system of Unix.
Scalable:
o Python provides a better structure and support for large programs than shell scripting.
Python has a big list of good features:
o It can be used as a scripting language or can be compiled to byte-code for building large
applications.
o It provides very high-level dynamic data types and supports dynamic type checking.
o It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
8
l OM oARc PSD|37 86 0 66 5
LISTS: A list is an ordered data structure with elements separated by comma and enclosed within square
brackets.
DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and
stored as key: value pair, enclosed with curly braces {}.
9
l OM oARc PSD|37 86 0 66 5
Statistics:
5.1) Descriptive Statistic
Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of new values.
Code import pandas as pd data=pd.read_csv( "Mode.csv") //reads data from csv
file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of subject column
print(mode_data)
Mean
import pandas as pd data=pd.read_csv( "mean.csv") //reads
data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column print(mean_data)
Median
Absolute central value of data set.
import pandas as pd data=pd.read_csv( "data.csv") //reads
data from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column print(median_data)
5.2) Types of variables
• Continous – Which takes continuous numeric values. Eg-marks
• Categorial-Which have discrete values. Eg- Gender
• Ordinal – Ordered categorial variables. Eg- Teacher feedback
• Nominal – Unorderd categorial variable. Eg- Gender
5.3) Outliers
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of 97.
Reasons of Outliers
• Typos-During collection. Eg-adding extra zero by mistake.
• Measurement Error-Outliers in data due to measurement operator being faulty.
• Intentional Error-Errors which are induced intentionally. Eg-claiming smaller amount of alcohol
consumed then actual.
• Legit Outlier—These are values which are not actually errors but in data due to legitimate reasons.
10
l OM oARc PSD|37 86 0 66 5
5.8) T Tests:
When we have just a sample not population statistics.
Use sample standard deviation to estimate population standard deviation.
T test is more prone to errors, because we just have samples.
5.9) Z Score:
The distance in terms of number of standard deviations, the observed value is away from mean, is standard score
or z score.
11
l OM oARc PSD|37 86 0 66 5
12
l OM oARc PSD|37 86 0 66 5
Predictive Modelling:
A data model helps organizations capture all the points of information necessary to perform operations and enact
policy based on the data they collect. This can be explained with an example of a sales transaction which is
broken down into related groups of data points, describing the customer, the seller, the item sold, and the payment
mechanism. For instance, if the sales transactions were recorded without the date on which they occurred, it
would be impossible to enforce certain return policies. Data modelling in data science is also performed to help
organizations ensure that they are collecting all the necessary items of information in the first place
Making use of past data and attributes we predict future using this data.
eg-
Past Horror Movies
Future Unwatched Horror Movies
13
l OM oARc PSD|37 86 0 66 5
5. Predictive Modelling
6. Model Development/Implementation
14
l OM oARc PSD|37 86 0 66 5
15
l OM oARc PSD|37 86 0 66 5
Reasons of Outliers:
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
6.7) Types of Outliers:
6.7.1) Univariate
Analysing only one variable for outlier.
Example:
– In box plot of height and weight.
Weight will we analysed for outlier
6.7.2) Bivariate
Analysing both variables for outlier.
Eg- In scatter plot graph of height and weight. Both will we analysed.
Identifying Outlier
Graphical Method
• Box Plot
• Scatter Plot
Formula Method
Using Box Plot
< Q1 - 1.5 * IQR or > Q3+1.5 * IQR
Where IQR= Q3 – Q1
Q3=Value of 3rd quartile
16
l OM oARc PSD|37 86 0 66 5
17
l OM oARc PSD|37 86 0 66 5
Model Building
It is a process to create a mathematical model for estimating / predicting the future based on past data.
Example-
A retail wants to know the default behaviour of its credit card customers. They want to predict the probability of
default for each customer in next three months.
• Probability of default would lie between 0 and 1.
• Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default (closer to 0).
Algorithm Selection
Example-
18
l OM oARc PSD|37 86 0 66 5
7.1) Algorithms
• Logistic Regression
• Decision Tree
• Random Forest
Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to predict/estimate. Dataset
• Train
Past data (known dependent variable).
Used to train model.
• Test
Future data (unknown dependent variable)
Used to score. Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.
10
0
0 1 2 3 4 5 6 7 8 9
19
l OM oARc PSD|37 86 0 66 5
Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary
dependent variable, although many more complex extensions exist.
K-means clustering is a type of unsupervised learning, which is used when you have unlabelled data (i.e., data
without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number
of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K
groups based on the features that are provided. Data points are clustered based on feature similarity.
20
l OM oARc PSD|37 86 0 66 5
METHODOLOGY
PREDICTING IF CUSTOMER BUYS TERM DEPOSIT
Problem Statement:
Your client is a retail banking institution. Term deposits are a major source of income for a bank.
A term deposit is a cash investment held at a financial institution. Your money is invested for an
agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans
to sell term deposits to their customers such as email marketing, advertisements, telephonic
marketing and digital marketing.
Telephonic marketing campaigns still remain one of the most effective ways to reach out to
people. However, they require huge investment as large call centers are hired to actually execute
these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand
so that they can be specifically targeted via call.
You are provided with the client data such as: age of the client, their job type, their marital
status, etc. Along with the client data, you are also provided with the information of the call
such as the duration of the call, day and month of the call, etc. Given this information, your
task is to predict if the client will subscribe to term deposit. Data Dictionary: -
21
l OM oARc PSD|37 86 0 66 5
Prerequisites:
We have the following files:
• train.csv: This dataset will be used to train the model. This file contains all the client and
call details as well as the target variable “subscribed”.
• test.csv: The trained model will be used to predict whether a new set of clients will
subscribe the term deposit or not for this dataset.
• TEST.csv file: -
TRAIN.csv file: -
22
l OM oARc PSD|37 86 0 66 5
Problem Description
Provided with following files: train.csv and test.csv.
Use train.csv dataset to train the model. This file contains all the client and call details as well as the target
variable “subscribed”. Then use the trained model to predict whether a new set of clients will subscribe the term
deposit.
23
l OM oARc PSD|37 86 0 66 5
24
l OM oARc PSD|37 86 0 66 5
25
l OM oARc PSD|37 86 0 66 5
26
l OM oARc PSD|37 86 0 66 5
RESULTS
In this complete 6 weeks training I successfully learnt about DATA SCIENCE. Also, now I’m able
to perform data analysis using python. I also attempted various quizzes and assignments provided
for periodic evaluation during 6 weeks and completed this training with 82% score in Final Test.
REFRENCE
1) WIKIPEDIA.COM
2) TEACHNOOK .REPORT
3) SCRIBB.NET
***
27