0% found this document useful (0 votes)
2 views

Data Science Lecture No 02

Uploaded by

abdul baqi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Science Lecture No 02

Uploaded by

abdul baqi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Lecture No.

02 AI 7 , SEN –5
th th

Course: Data Science


Instructor: Dr. Maryum Nisar

12/15/2024 1
Data Science

12/15/2024 2
Lecture Contents
Data Science
Understanding Data Science
Exploratory Data Analysis

12/15/2024 3
Data Science
Data Science
 Data science is the application of computational and statistical techniques to
address or gain insight into some problem in the real world
 Data science = statistics +
data processing +
machine learning +
scientific inquiry +
visualization +
business analytics +
big data + …

12/15/2024 4
CRISP process
CRoss-Industry Standard Process for data
mining (CRISP)

5
Data Science Process Step

6
Understanding data science
Data requirements:
 There can be various sources of data for an organization. It is important to comprehend what type of data is required for
the organization to be collected, curated, and stored.
 In addition to this, it is required to categorize the data, numerical or categorical, and the format of storage and
dissemination.

 Data collection:
 Data collected from several sources must be stored in the correct format and transferred to the right information technology
personnel within a company. As mentioned previously, data can be collected from several objects on several events using
different types of sensors and storage tools.

Data processing:
 Preprocessing involves the process of pre-curating the dataset before actual analysis. Common tasks involve correctly
exporting the dataset, placing them under the right tables, structuring them, and exporting them in the correct format.
Understanding data science
 Data cleaning:
 Preprocessed data is still not ready for detailed analysis. It must be correctly transformed for an incompleteness
check, duplicates check, error check, and missing value check. These tasks are performed in the data cleaning stage,
which involves responsibilities such as matching the correct record, finding inaccuracies in the dataset, understanding
the overall data quality, removing duplicate items, and filling in the missing values.
 However, how could we identify these anomalies on any dataset?
 An example of data cleaning would be using outlier detection methods for quantitative data cleaning.

 EDA:
 Exploratory data analysis, is the stage where we actually start to understand the message contained in the data.

Modeling and algorithm:


 From a data science perspective, generalized models or mathematical formulas can represent or exhibit relationships
among different variables, such as correlation or causation..
Understanding data science
Data Product:
 A data product is generally based on a model developed during data analysis, for example, a
recommendation model that inputs user purchase history and recommends a related item that
the user is highly likely to buy.

Communication:
 This stage deals with disseminating the results to end stakeholders to use the result for business
intelligence. One of the most notable steps in this stage is data visualization.
 Visualization deals with information relay techniques such as tables, charts, summary diagrams,
and bar charts to show the analyzed result.
Prior Knowledge
Gaining information on:

- Objective of the problem


- Subject area of the problem
- Data

10

10
Data Preparation / Data exploration

Data Exploration
Data quality
Handling missing values
Data type conversion
Transformation
Outliers
Feature selection
Sampling

11

11
Introduction to Exploratory Data Analysis (EDA)

EDA is a crucial step in data


science that allows for
understanding data.

It involves summarizing data,


detecting anomalies, and
testing assumptions.

EDA helps make data-driven


decisions before modeling.
12

12
Key aspects of EDA
Correlation Analysis
 Checking the relationships between variables to understand how they might affect each other. This
includes computing correlation coefficients and creating correlation matrices.

 Handling Missing Values


 Detecting and deciding how to address missing data points, whether by imputation or removal,
depending on their impact and the amount of missing data.

 Summary Statistics
 Calculating key statistics that provide insight into data trends and nuances

Testing Assumptions
 Many statistical tests and models assume the data meet certain conditions (like normality
Why Exploratory Data Analysis is Important?
Exploratory Data Analysis (EDA) is important for several reasons, especially in the context of data
science and statistical modeling. Here are some of the key reasons why EDA is a critical step in the
data analysis process:
 Understanding Data Structures
 Identifying Patterns and Relationships
 Detecting Anomalies and Outliers
 Testing Assumptions
 Informing Feature Selection and Engineering
 Optimizing Model Design
 Facilitating Data Cleaning
 Enhancing Communication
EDA Importance
Understanding Data Structures
o EDA helps in getting familiar with the dataset, understanding the number of features, the
type of data in each feature, and the distribution of data points. This understanding is
crucial for selecting appropriate analysis or prediction techniques.

Identifying Patterns and Relationships


o Through visualizations and statistical summaries, EDA can reveal hidden patterns and
intrinsic relationships between variables. These insights can guide further analysis and
enable more effective feature engineering and model building.

Detecting Anomalies and Outliers


o EDA is essential for identifying errors or unusual data points that may adversely affect the
results of your analysis. Detecting these early can prevent costly mistakes in predictive
modeling and analysis.
EDA Importance
 Testing Assumptions
o Many statistical models assume that data follow a certain distribution or that variables
are independent. EDA involves checking these assumptions.

 Informing Feature Selection and Engineering


o Insights gained from EDA can inform which features are most relevant to include in a
model and how to transform them (scaling, encoding) to improve model performance.

Optimizing Model Design


o By understanding the data’s characteristics, analysts can choose appropriate modeling
techniques, decide on the complexity of the model, and better tune model parameters.
EDA Importance
 Facilitating Data Cleaning
o EDA helps in spotting missing values and errors in the data, which are critical to address
before further analysis to improve data quality and integrity.

 Enhancing Communication
o Visual and statistical summaries from EDA can make it easier to communicate findings
and convince others of the validity of your conclusions, particularly when explaining data-
driven insights to stakeholders without technical backgrounds.
Traditional Vs Machine Learning Model

18
https://www.ranker.com/crowdranked-list/best-jobs-in-the-world
https://www.panelplace.com/blogs/top-8-coolest-jobs-world 18
Data Science process

19
https://www.ranker.com/crowdranked-list/best-jobs-in-the-world
https://www.panelplace.com/blogs/top-8-coolest-jobs-world 19
Data Science process

20

20
Thank You !
12/15/2024 21

You might also like