0% found this document useful (0 votes)

11 views9 pages

Fods Unit 1

Uploaded by

saivenkatuppada0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views9 pages

Fods Unit 1

Uploaded by

saivenkatuppada0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Unit – 1 (Data Science)

Data Science and Steps:

Data science is a field that uses maths, statistics, programming and
domain knowledge to extract insights and patterns from raw data.
Steps in data Science Process:
1. Problem Understanding
2. Data Collection
3. Data preprocessing and Cleaning
4. Exploratory data analysis
5. Feature selection
6. Model Building
7. Model Testing and Evaluation
8. Real world Applications
Problem Understanding:

• The first step is to gain a complete understanding of the business

problem or use case.
• This involves identifying what we are trying to predict or solve,
understanding the factors that may influence the outcome, and
defining what a successful solution looks like.
• At this stage, we also decide on the accuracy or performance level
expected from the model and select the appropriate tools,
programming languages (like Python or R), and machine learning
techniques that will be used in the project.
Data Collection:

• In this step, we gather data that will be used to build the machine
learning model.
• The data must be relevant, accurate, and sufficient for solving the
defined problem.
• Common sources of data include:
a. APIs: Used to pull real-time or structured data from online
services like Twitter, financial platforms, weather services,
or Google Maps.
b. Databases: Data stored in systems such as MySQL,
PostgreSQL, Oracle, or NoSQL systems like MongoDB can be
accessed using SQL queries or database connectors.
c. Web Scraping: Data is extracted from websites using tools
like BeautifulSoup, Selenium, or Scrapy, often used to collect
information such as product listings, reviews, or news
articles.
d. Flat Files (CSV, Excel, JSON): Many datasets are available in
these formats from internal teams or public sources.
e. Public Datasets: Platforms like Kaggle, UCI Machine
Learning Repository, and government open data portals
provide high-quality, ready-to-use datasets.
f. Cloud and Data Warehouses: Data stored in cloud platforms
like AWS S3, Google BigQuery, or Azure Blob Storage can be
accessed via secure APIs or cloud connectors.
g. Sensors/IoT Devices: In industries like manufacturing or
healthcare, data may be streamed from devices in real time.
Data Preprocessing and Cleaning:

• Once the data is collected, it needs to be cleaned and prepared to

ensure it is usable for analysis and modeling.
• This step includes:
a. Handling Missing Values: Identify and treat missing values
by filling them using mean, median, mode, or interpolation
methods. In some cases, rows or columns with excessive
missing data may be removed.
b. Removing Duplicate Records: Detect and delete repeated
entries which may result in biased analysis or overfitting
c. Outlier Detection and Treatment: Identify unusual values
using statistical techniques like the interquartile range (IQR),
Z-score, or visualization methods, and decide whether to
keep, modify, or remove them.
d. Correcting Inconsistent Data: Standardize formats across
the dataset, such as converting date formats or unifying
categories like "male", "Male", and "M".
e. Fixing Noise and Irrelevant Features: Drop columns that are
unrelated to the target variable or contain random data that
does not contribute to the prediction process.
Exploratory Data Analysis:

• Exploratory Data Analysis helps in understanding the underlying

structure of the data and discovering patterns, relationships, or
anomalies.
• In this step:
a. We analyze summary statistics such as mean, median,
mode, variance, and correlation between variables.
b. Visual tools are used to support the analysis, including:
• Histograms to show distribution of numeric values
• Box plotsto detect outliers and compare distributions
• Scatter plotsto visualize relationships between two
variables
• Heatmaps to view correlations across multiple
variables
• Bar charts and pie chartsfor categorical data insights

Feature Selection:

• Feature selection is the process of identifying and retaining only

the most relevant variables (features) that have a strong impact on
the model’s predictions.
• This helps reduce model complexity, improve accuracy, and
prevent overfitting.

Model Building:

• After feature selection, the dataset is divided into two parts: a

larger training set (usually 70–80%) used to train the model, and a
smaller testing set (20–30%) used for evaluation.
• Choose a suitable machine learning algorithm depending on the
problem type—classification, regression, clustering, etc.
• Train the model using the training data, allowing it to learn
patterns and relationships from the input features.
Model Evaluation and Testing:

• Once the model is trained, we evaluate how well it performs using

the test data.
• The model's predictions are compared to the actual values to
measure accuracy and performance.
• Different types of evaluation metrics are:
a. Accuracy – Overall correctness of the model
b. Precision – Correct positive predictions out of total
predicted positives
c. Recall (Sensitivity)– Correct positive predictions out of
actual positives
d. F1-Score– Harmonic mean of precision and recall
Real World Applications:

• After achieving a satisfactory evaluation score, the final model is

deployed in a real-world environment where it can be used to
make predictions or support decision-making.
• Maintaining and Updating the model over time using new data,
retraining when necessary to keep the model accurate and
relevant.
Uses of Data Science:
1. Problem Solving – Example: Healthcare (Disease Diagnosis):

• Data science is used to analyze patient records, lab results,

and medical images to identify early signs of diseases.
• Machine learning models assist doctors in diagnosing
conditions like cancer or diabetes with higher accuracy.
• It helps healthcare providers predict outbreaks and allocate
resources effectively (e.g., during pandemics).

2. Automation – Example: Gmail (Spam Detection):

• Gmail uses data science to automatically filter out spam

emails by analyzing message content, sender behavior, and
user feedback.
• Machine learning models are trained on millions of emails to
recognize patterns typical of spam (e.g., suspicious links,
keywords).
• The system updates automatically as new types of spam
emerge, reducing the need for manual intervention.

3. Decision Making – Example: Starbucks:

• Data science helps Starbucks decide where to open new

store locations by analyzing foot traffic, demographics, and
competitor presence.
• Sales and customer data are used to decide which products
to promote or discontinue.
• Customer segmentation allows Starbucks to offer
personalized rewards and seasonal promotions to different
types of buyers (e.g., morning commuters vs. casual
afternoon visitors).

Big data and its characteristics

Big Data means massive volumes of data that are too large, too fast or
too complex for traditional data processing tools to handle effectively.
These datasets are generated from a wide range of sources such as:

• Social media platforms

• IoT devices
• Online Transactions
• Sensors

Characteristics of Big Data:

Big data is characterized by 5 V’s . They are:

• Volume

• Variety

• Velocity

• Veracity

• Value
Volume

Refers to the massive amount of data generated every second from

sources like social media, sensors, mobile devices, and business
transactions.
Ex: Facebook generates over 4 petabytes of data every day

Variety
Big data comes in multiple formats: structured (tables, databases), semi-
structured (XML, JSON), and unstructured (text, images, audio, video).
This diversity makes storage and analysis more challenging.

Velocity

Describes the speed at which data is generated, processed, and

analyzed.

Real-time or near real-time data processing is essential in areas like stock

trading or fraud detection.

Veracity

Relates to the accuracy and trustworthiness of the data.

Inconsistent, incomplete, or noisy data can lead to poor decisions, so

ensuring data quality is crucial.

Value

Represents the potential benefits that can be derived from analyzing big
data.

Extracting meaningful patterns, trends, or forecasts can create immense

business or societal value.
Uses of Big Data:

1. Advertisement – example: Netflix

• Data science analyzes user viewing behavior (what, when, how

long) to recommend personalized content.
• Machine learning models predict which shows or movies a user is
most likely to watch next.
• It helps Netflix segment users into groups (e.g., action lovers,
drama fans) for targeted promotion and content placement.

2. Customer Experience – example: Market Basket Analysis

• Market basket analysis finds associations between products

frequently bought together (like bread and butter).
• Retailers use this insight to place items near each other or offer
combo discounts.
• Enhances user experience by showing relevant product
suggestions during checkout.
•
3. Future Scope – example: Weather Prediction

• Uses historical climate data and real-time satellite input to

forecast weather conditions.
• Machine learning helps in early detection of extreme events like
storms or droughts.
• Supports agriculture, disaster preparedness, and transportation
planning.

Big Data Ecosystem:

The Big Data Ecosystem refers to the collection of tools, technologies,

frameworks, and processes used to store, process, analyze, and manage
large volumes of data that traditional systems can't handle efficiently.

Key Components of the Big Data Ecosystem:

1. Data Sources

• Where data comes from.

• Examples: Social media, IoT devices, sensors, web servers, mobile apps.
2. Data Ingestion

• Tools that collect and bring data into the system.

• Examples:
o Apache Kafka (real-time streaming)
o Apache Sqoop (importing data from databases)
o Flume (log data collection)

3. Data Storage

• Where data is stored for further processing.

• Examples:
o HDFS (Hadoop Distributed File System)
o Amazon S3
o NoSQL databases (MongoDB, Cassandra)
o Data Lakes

4. Data Processing

• Converting raw data into meaningful information.

• Examples:
o Apache Hadoop MapReduce
o Apache Spark (faster, in-memory processing)
o Apache Flink (real-time stream processing)

5. Data Analysis

• Analyzing data to extract insights.

• Examples:
o R, Python (with libraries like Pandas, NumPy, Scikit-learn)
o SAS
o RapidMiner

6. Data Visualization

• Presenting data in charts, dashboards, and reports.

• Examples:
o Tableau
o Power BI
o Apache Superset
o D3.js

7. Data Management & Governance

• Ensuring data quality, security, and compliance.

• Examples:
o Apache Ranger (security)
o Apache Atlas (metadata management)
o Informatica

Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Big Data Essentials & Challenges
No ratings yet
Big Data Essentials & Challenges
71 pages
6001 - Datascience With Bigdata
No ratings yet
6001 - Datascience With Bigdata
34 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
Datascience
No ratings yet
Datascience
12 pages
Data Science
No ratings yet
Data Science
17 pages
Data Science Mastery Course in Pitampura
No ratings yet
Data Science Mastery Course in Pitampura
19 pages
Data Science Course in Pitampura
No ratings yet
Data Science Course in Pitampura
19 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Data Science and Analytics Reviewer
No ratings yet
Data Science and Analytics Reviewer
5 pages
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
Data Science
No ratings yet
Data Science
9 pages
Data Science & Cyber Security
100% (1)
Data Science & Cyber Security
13 pages
Ids PDF
No ratings yet
Ids PDF
397 pages
Data Science and Big Data
No ratings yet
Data Science and Big Data
19 pages
Data Science Practitioner Guide
No ratings yet
Data Science Practitioner Guide
403 pages
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
53 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
PDF Data Science
No ratings yet
PDF Data Science
7 pages
Fda 1
No ratings yet
Fda 1
5 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
A Functional Approach To Basics of Data Science With Excel-Book - Chapter 1 and 2 - 1st Print
No ratings yet
A Functional Approach To Basics of Data Science With Excel-Book - Chapter 1 and 2 - 1st Print
13 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Kadir
No ratings yet
Kadir
84 pages
DS - Unit I
No ratings yet
DS - Unit I
3 pages
Data Science Basics for Beginners
No ratings yet
Data Science Basics for Beginners
23 pages
MSE Merged
No ratings yet
MSE Merged
78 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
DS QB Unit 1
No ratings yet
DS QB Unit 1
45 pages
Data Science
No ratings yet
Data Science
8 pages
Data Science & Big Data Essentials
No ratings yet
Data Science & Big Data Essentials
46 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
15 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
33 pages
Unit I Data Analytics
No ratings yet
Unit I Data Analytics
46 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
Data Science Modern Technology5
No ratings yet
Data Science Modern Technology5
6 pages
21CS64 Data Science and Visualization (PE)
No ratings yet
21CS64 Data Science and Visualization (PE)
37 pages
Handbook DSC 1 2
No ratings yet
Handbook DSC 1 2
35 pages
Ds Final
No ratings yet
Ds Final
3 pages
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
63 pages
Data Science Process
No ratings yet
Data Science Process
13 pages
Classroom Assignment 2
No ratings yet
Classroom Assignment 2
3 pages
Unit 1 Pds Material
No ratings yet
Unit 1 Pds Material
19 pages
Data Science Foundations
No ratings yet
Data Science Foundations
58 pages
Data Science Management - Vss
No ratings yet
Data Science Management - Vss
84 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Unit 1 Data Science
No ratings yet
Unit 1 Data Science
12 pages
Notes Data Science
100% (1)
Notes Data Science
5 pages
Introduction To Data Science and Analytics: Summer School 2015
No ratings yet
Introduction To Data Science and Analytics: Summer School 2015
31 pages
Learnin
No ratings yet
Learnin
9 pages
Code Smell Severity Detection Using ML
No ratings yet
Code Smell Severity Detection Using ML
18 pages
ML Spam Detection for Developers
No ratings yet
ML Spam Detection for Developers
51 pages
AI & Computer Vision Lab Manual
No ratings yet
AI & Computer Vision Lab Manual
36 pages
Employee Salary Prediction
No ratings yet
Employee Salary Prediction
10 pages
Crab Identify Shell
No ratings yet
Crab Identify Shell
12 pages
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
No ratings yet
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
16 pages
AI Builder Guide for Business Users
No ratings yet
AI Builder Guide for Business Users
422 pages
Anyasor C Applied Machine Learning. A Practical Guide From Novice To Pro 2024
No ratings yet
Anyasor C Applied Machine Learning. A Practical Guide From Novice To Pro 2024
322 pages
Urban Sound Classification PaperV2
No ratings yet
Urban Sound Classification PaperV2
6 pages
Simplifying Evapotranspiration Estimation
No ratings yet
Simplifying Evapotranspiration Estimation
11 pages
Prediction For Diagnosing Liver Disease in Patients Using KNN and Nave Bayes Algorithms
No ratings yet
Prediction For Diagnosing Liver Disease in Patients Using KNN and Nave Bayes Algorithms
5 pages
Intrusion Detection System An Automatic Machine Learning Algorithms Using Auto - WEKA
No ratings yet
Intrusion Detection System An Automatic Machine Learning Algorithms Using Auto - WEKA
5 pages
Minor 1 Final
No ratings yet
Minor 1 Final
50 pages
ML Module1 Notes
No ratings yet
ML Module1 Notes
176 pages
Machine Learning-Based Breast Cancer Detection
No ratings yet
Machine Learning-Based Breast Cancer Detection
82 pages
Machine Learning Lab Manual 2023
No ratings yet
Machine Learning Lab Manual 2023
41 pages
Logistics4 0MachineLearninginSCM
No ratings yet
Logistics4 0MachineLearninginSCM
24 pages
ML Unit-I
No ratings yet
ML Unit-I
121 pages
ML Model 1
No ratings yet
ML Model 1
42 pages
AI/ML in Medical Imaging: Key Challenges
No ratings yet
AI/ML in Medical Imaging: Key Challenges
7 pages
Final - Bank Customer Response Prediction Model
No ratings yet
Final - Bank Customer Response Prediction Model
23 pages
Google - Professional Machine Learning Engineer.v2023 08 28.q121
No ratings yet
Google - Professional Machine Learning Engineer.v2023 08 28.q121
56 pages
NNDL Lab Record - NNDL Lab Manual NNDL Lab Record - NNDL Lab Manual
No ratings yet
NNDL Lab Record - NNDL Lab Manual NNDL Lab Record - NNDL Lab Manual
105 pages
Gated Recurrent Unit For Load Forecasting of Water Pumping Stations in Jebel Akhdar
No ratings yet
Gated Recurrent Unit For Load Forecasting of Water Pumping Stations in Jebel Akhdar
9 pages
Image Classification Using CNN Pallavi
No ratings yet
Image Classification Using CNN Pallavi
26 pages
OpenXAI Benchmark Result of XAI
No ratings yet
OpenXAI Benchmark Result of XAI
16 pages
Football - Match - Result - Prediction - Using - Neural - Networks - and - Deep - Learning Yeah
No ratings yet
Football - Match - Result - Prediction - Using - Neural - Networks - and - Deep - Learning Yeah
4 pages
Untitled 10
No ratings yet
Untitled 10
12 pages
Predicting Travel Insurance Purchases in An Insura
No ratings yet
Predicting Travel Insurance Purchases in An Insura
16 pages

Fods Unit 1

Uploaded by

Fods Unit 1

Uploaded by

Unit – 1 (Data Science)

Data Science and Steps:

• The first step is to gain a complete understanding of the business

• Once the data is collected, it needs to be cleaned and prepared to

• Exploratory Data Analysis helps in understanding the underlying

• Feature selection is the process of identifying and retaining only

• After feature selection, the dataset is divided into two parts: a

• Once the model is trained, we evaluate how well it performs using

• After achieving a satisfactory evaluation score, the final model is

• Data science is used to analyze patient records, lab results,

2. Automation – Example: Gmail (Spam Detection):

• Gmail uses data science to automatically filter out spam

3. Decision Making – Example: Starbucks:

• Data science helps Starbucks decide where to open new

Big data and its characteristics

• Social media platforms

Characteristics of Big Data:

Refers to the massive amount of data generated every second from

Describes the speed at which data is generated, processed, and

Real-time or near real-time data processing is essential in areas like stock

Relates to the accuracy and trustworthiness of the data.

Inconsistent, incomplete, or noisy data can lead to poor decisions, so

Extracting meaningful patterns, trends, or forecasts can create immense

1. Advertisement – example: Netflix

• Data science analyzes user viewing behavior (what, when, how

2. Customer Experience – example: Market Basket Analysis

• Market basket analysis finds associations between products

• Uses historical climate data and real-time satellite input to

Big Data Ecosystem:

The Big Data Ecosystem refers to the collection of tools, technologies,

Key Components of the Big Data Ecosystem:

• Where data comes from.

• Tools that collect and bring data into the system.

• Where data is stored for further processing.

• Converting raw data into meaningful information.

• Analyzing data to extract insights.

• Presenting data in charts, dashboards, and reports.

7. Data Management & Governance

• Ensuring data quality, security, and compliance.

You might also like