Data Science Lecture No 02
Data Science Lecture No 02
02 AI 7 , SEN –5
th th
12/15/2024 1
Data Science
12/15/2024 2
Lecture Contents
Data Science
Understanding Data Science
Exploratory Data Analysis
12/15/2024 3
Data Science
Data Science
Data science is the application of computational and statistical techniques to
address or gain insight into some problem in the real world
Data science = statistics +
data processing +
machine learning +
scientific inquiry +
visualization +
business analytics +
big data + …
12/15/2024 4
CRISP process
CRoss-Industry Standard Process for data
mining (CRISP)
5
Data Science Process Step
6
Understanding data science
Data requirements:
There can be various sources of data for an organization. It is important to comprehend what type of data is required for
the organization to be collected, curated, and stored.
In addition to this, it is required to categorize the data, numerical or categorical, and the format of storage and
dissemination.
Data collection:
Data collected from several sources must be stored in the correct format and transferred to the right information technology
personnel within a company. As mentioned previously, data can be collected from several objects on several events using
different types of sensors and storage tools.
Data processing:
Preprocessing involves the process of pre-curating the dataset before actual analysis. Common tasks involve correctly
exporting the dataset, placing them under the right tables, structuring them, and exporting them in the correct format.
Understanding data science
Data cleaning:
Preprocessed data is still not ready for detailed analysis. It must be correctly transformed for an incompleteness
check, duplicates check, error check, and missing value check. These tasks are performed in the data cleaning stage,
which involves responsibilities such as matching the correct record, finding inaccuracies in the dataset, understanding
the overall data quality, removing duplicate items, and filling in the missing values.
However, how could we identify these anomalies on any dataset?
An example of data cleaning would be using outlier detection methods for quantitative data cleaning.
EDA:
Exploratory data analysis, is the stage where we actually start to understand the message contained in the data.
Communication:
This stage deals with disseminating the results to end stakeholders to use the result for business
intelligence. One of the most notable steps in this stage is data visualization.
Visualization deals with information relay techniques such as tables, charts, summary diagrams,
and bar charts to show the analyzed result.
Prior Knowledge
Gaining information on:
10
10
Data Preparation / Data exploration
Data Exploration
Data quality
Handling missing values
Data type conversion
Transformation
Outliers
Feature selection
Sampling
11
11
Introduction to Exploratory Data Analysis (EDA)
12
Key aspects of EDA
Correlation Analysis
Checking the relationships between variables to understand how they might affect each other. This
includes computing correlation coefficients and creating correlation matrices.
Summary Statistics
Calculating key statistics that provide insight into data trends and nuances
Testing Assumptions
Many statistical tests and models assume the data meet certain conditions (like normality
Why Exploratory Data Analysis is Important?
Exploratory Data Analysis (EDA) is important for several reasons, especially in the context of data
science and statistical modeling. Here are some of the key reasons why EDA is a critical step in the
data analysis process:
Understanding Data Structures
Identifying Patterns and Relationships
Detecting Anomalies and Outliers
Testing Assumptions
Informing Feature Selection and Engineering
Optimizing Model Design
Facilitating Data Cleaning
Enhancing Communication
EDA Importance
Understanding Data Structures
o EDA helps in getting familiar with the dataset, understanding the number of features, the
type of data in each feature, and the distribution of data points. This understanding is
crucial for selecting appropriate analysis or prediction techniques.
Enhancing Communication
o Visual and statistical summaries from EDA can make it easier to communicate findings
and convince others of the validity of your conclusions, particularly when explaining data-
driven insights to stakeholders without technical backgrounds.
Traditional Vs Machine Learning Model
18
https://www.ranker.com/crowdranked-list/best-jobs-in-the-world
https://www.panelplace.com/blogs/top-8-coolest-jobs-world 18
Data Science process
19
https://www.ranker.com/crowdranked-list/best-jobs-in-the-world
https://www.panelplace.com/blogs/top-8-coolest-jobs-world 19
Data Science process
20
20
Thank You !
12/15/2024 21