0% found this document useful (0 votes)

9 views13 pages

DSV Notes

Uploaded by

altafmansuriam07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views13 pages

DSV Notes

Uploaded by

altafmansuriam07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Q1.Elaborate different stages of data science process in detail.

Ans:

1. Business Understanding – Defines the problem, goals, and success criteria in business terms.
Ensures the project is solving the right question.
2. Data Understanding – Collects relevant data from multiple sources and examines quality.
Identifies missing values, outliers, and basic patterns.
3. Data Preparation – Cleans and transforms raw data into usable form. Handles missing
values, duplicates, encoding, and feature creation.
4. Exploratory Data Analysis (EDA) – Uses visualization and statistics to study patterns. Helps
understand relationships and select important features.
5. Data Modeling – Applies machine learning/statistical algorithms to the prepared data. Trains
and fine-tunes models for prediction or classification.
6. Model Evaluation – Tests models using performance metrics (accuracy, precision, recall,
RMSE, etc.). Compares models and selects the best one.
7. Model Deployment – Implements the chosen model in real applications. Makes predictions
available to users through APIs, dashboards, or software.

Q2.What is supervised Machine Learning? Elaborate with suitable example and enlist its
applications.

Supervised Machine Learning (ML)

Supervised learning is a type of machine learning technique where the model is trained using a
labeled dataset.

 “Labeled” means the data already has input (X) and the correct output (Y).

 The algorithm learns the mapping function from input to output.

 Later, when new unseen data is given, the model can predict the output.

👉 In short: Supervised ML = Learn from past labeled data → predict future outcomes.
Example

Suppose we want to build a system that predicts whether an email is Spam or Not Spam.

 Input (X): Features of the email (words used, sender, subject line, etc.)

 Output (Y): Label = “Spam” or “Not Spam”

 We train the model with thousands of emails (already labeled as spam/not spam).

 After training, the model can classify new incoming emails correctly.

Another simple example: Predicting house prices based on size, location, and number of rooms.

Applications of Supervised Machine Learning

1. Email Filtering – Classifying spam vs. non-spam.

2. Medical Diagnosis – Predicting diseases based on symptoms or reports.

3. Fraud Detection – Identifying fraudulent transactions in banking.

4. Speech Recognition – Converting speech into text.

5. Weather Forecasting – Predicting rain, temperature, etc.

Q3. What is Unsupervised Machine Learning? Elaborate with suitable example

and enlist its applications.

Unsupervised Machine Learning

Definition:

 A type of machine learning where the algorithm is trained using unlabeled data (only input,
no output).

 The system tries to find hidden patterns, structure, or grouping in the data without prior
knowledge.

Example:

 Customer Segmentation: A retail store has customer purchase data but no labels.
The algorithm groups customers into clusters (e.g., frequent buyers, occasional buyers,
discount seekers).

 Market Basket Analysis: Finds products that are often bought together (e.g., bread →
butter).
Applications:

1. Customer segmentation in marketing

2. Market basket analysis in retail

3. Anomaly detection (fraud, network security)

4. Document or news categorization

5. Recommendation systems (movies, shopping sites)

6. Pattern and trend analysis in large datasets

7. Image compression and recognition

8. Genomic data analysis in bioinformatics

Q4. Define the terms : [4] i) Data science ii) Big Data

i) Data Science

 An interdisciplinary field that combines statistics, mathematics, computer science, and

domain knowledge.

 Focuses on extracting useful insights and predictions from structured and unstructured
data.

 Involves stages like data collection, cleaning, analysis, visualization, and machine learning
modeling.

 Example: Netflix using data science to recommend movies to users.

ii) Big Data

 Refers to large, complex, and fast-growing datasets beyond the ability of traditional tools to
manage.

 Characterized by the 5 V’s – Volume, Velocity, Variety, Veracity, and Value.

 Requires advanced tools like Hadoop, Spark, and NoSQL databases for processing.

 Example: Social media platforms analyzing billions of user posts and interactions daily.

Q5. Define Big data? Elaborate how Big Data plays an important role in Data Science?

Big Data

Definition:
 Big Data refers to large, complex, and fast-growing datasets that cannot be stored,
processed, or analyzed effectively using traditional data management tools.

 It is generally described by the 5 V’s:

o Volume (huge amount of data)

o Velocity (speed of data generation)

o Variety (different types – text, images, video, etc.)

o Veracity (uncertainty/accuracy of data)

o Value (usefulness of data)

Role of Big Data in Data Science

1. Foundation for Data Science – Big Data provides the raw material (large datasets) on which
Data Science methods are applied.

2. Improved Decision Making – Data scientists use big data analytics to help businesses make
data-driven decisions (e.g., Amazon product recommendations).

3. Machine Learning & AI – Training accurate ML models requires massive labeled/unlabeled

datasets provided by Big Data.

4. Pattern Discovery – Helps in finding hidden patterns, trends, and correlations that are not
visible in small datasets.

5. Real-Time Analysis – Enables real-time predictions (e.g., fraud detection in banking, traffic
monitoring).

Q6. How is Machine Learning related to Data Science? Explain in detail.

Relation between Machine Learning and Data Science

 Data Science is a field that focuses on collecting, processing, analyzing, and visualizing data
to extract useful insights.

 Machine Learning (ML) is a subset of AI and an important component of Data Science that
uses algorithms to learn from data and make predictions.

Relation:

1. ML is used in Data Science for predictive modeling and classification.

2. Data Science provides large datasets, ML extracts patterns and trends from them.

3. ML helps Data Science in automation and accurate decision-making.

Example: Netflix uses Data Science to collect user data and ML to recommend movies.
Q6. What is Exploratory Data Analysis in Data Science process? Explain with suitable example

Exploratory Data Analysis (EDA)

Definition:

 Exploratory Data Analysis (EDA) is a critical step in the Data Science process where data is
examined, visualized, and summarized to understand its structure, patterns, and
relationships before applying any modeling techniques.

 It helps in detecting outliers, missing values, trends, and hidden patterns in the dataset.

Key Activities in EDA:

1. Data Cleaning – Handling missing values, duplicates, and errors.

2. Descriptive Statistics – Mean, median, mode, variance, correlation.

3. Visualization – Using graphs like histograms, scatter plots, box plots, heatmaps.

4. Pattern & Relationship Detection – Finding dependencies between variables.

Example:

 Suppose a company wants to analyze employee salaries vs. years of experience.

 By plotting a scatter plot, EDA shows a clear positive relationship (more experience →
higher salary).

 This insight helps decide which features to use in building a salary prediction model.

Q7. Define the terms: i) Correlation ii) Variance

i) Correlation

 Definition: Correlation is a statistical measure that shows the strength and direction of the
relationship between two variables.

 Its value lies between –1 and +1:

o +1 → Perfect positive relation (both increase together)

o –1 → Perfect negative relation (one increases, other decreases)

o 0 → No relation

 Example: Height and weight usually show a positive correlation.

ii) Variance

 Definition: Variance is a measure of how much the data values deviate from the mean of
the dataset.
 It represents the spread or dispersion of data points.

 Formula:


 Example: If exam marks of students vary a lot, the variance will be high; if marks are close to
average, variance will be low.

Q8. Differentiate between the Structured data and Unstructured data.

Point Structured Data Unstructured Data

Data organized in rows & columns, Data not organized in predefined

Definition
easily stored in databases. format, difficult to store/analyze.

Tables, spreadsheets, relational Text, images, videos, audio, social

Format
databases. media posts.

Stored in data lakes, NoSQL, Hadoop

Storage Stored in RDBMS (SQL databases).
systems.

Easy to search, query, and analyze Requires advanced tools like NLP, ML,
Processing
using SQL. AI for processing.

Customer names, phone numbers, Emails, photos, CCTV footage,

Example
bank transactions. YouTube videos.

Q9. Explain different steps involved in Data Science process in detail.

Steps for Data Science Processes:

Step 1: Define the Problem and Create a Project Charter

Clearly defining the research goals is the first step in the Data Science Process. A project
charter outlines the objectives, resources, deliverables, and timeline, ensuring that all
stakeholders are aligned.

Step 2: Retrieve Data

Data can be stored in databases, data warehouses, or data lakes within an organization.
Accessing this data often involves navigating company policies and requesting permissions.

Step 3: Data Cleansing, Integration, and Transformation

Data cleaning ensures that errors, inconsistencies, and outliers are removed. Data
integration combines datasets from different sources, while data transformation prepares the
data for modeling by reshaping variables or creating new features.

Step 4: Exploratory Data Analysis (EDA)

During EDA, various graphical techniques like scatter plots, histograms, and box plots are used
to visualize data and identify trends. This phase helps in selecting the right modeling techniques.

Step 5: Build Models

In this step, machine learning or deep learning models are built to make predictions or
classifications based on the data. The choice of algorithm depends on the complexity of the
problem and the type of data.

Step 6: Present Findings and Deploy Models

Once the analysis is complete, results are presented to stakeholders. Models are deployed into
production systems to automate decision-making or support ongoing analysis.

UNIT - 2
Q1. Explain normal distribution and its characteristics.

Normal Distribution

 The normal distribution is a continuous probability distribution that is symmetric and bell-
shaped.

 Most data values cluster around the mean, and probabilities decrease as values move away
from the mean.

 It is also called the Gaussian distribution.

Characteristics of Normal Distribution

1. Symmetry → The curve is perfectly symmetric about the mean.

2. Mean = Median = Mode → All three measures of central tendency are equal and lie at the
center.

3. Bell-Shaped Curve → High frequency around the mean, tails decrease gradually.
4. Empirical Rule (68–95–99.7 Rule) →

o 68% of data lies within 1 standard deviation of mean,

o 95% within 2σ,

o 99.7% within 3σ.

5. Total Area = 1 → The entire probability under the curve sums to 1.

Q2. Write a note on conditional Probability.

Q3. Define Central Tendencies (Mean, Median and Mode) with examples.

Central Tendencies

Central tendency refers to statistical measures that identify the central or typical value of a dataset.
The three main measures are:

1. Mean (Arithmetic Average)

 It is the sum of all observations divided by the total number of observations.

 It represents the most common “average” used in statistics.

2. Median

 It is the middle value when the data is arranged in ascending or descending order.

 If the number of observations is even, it is the average of the two middle values.

3. Mode

 It is the value that occurs most frequently in a dataset.

 A distribution may be unimodal (one mode), bimodal (two modes), or multimodal.

Q4. Define the terms related to data science. i)Co-Variance ii)Standard Deviation

Q5. Describe the terms Normalization and Standardization in detail.

Q6. Define Bayes Theorem with example.
Q7. What is Bayes Theorem? How it is used to solve classification problem in machine learning.
llustrate with suitable example.

Ans. How it is used to solve classification problem in machine learning. llustrate with suitable
example.

Use in Machine Learning (Classification)

 Bayes’ Theorem is the foundation of the Naïve Bayes Classifier.

 It is used to classify data points into categories based on probability.

 In text classification (e.g., spam detection), it calculates the probability that an email is spam
given the words it contains.

Q8.Define the terms with appropriate examples. i) Continues Distribution ii) Normal Distribution

i) Continuous Distribution (3 marks)

 Definition: A continuous distribution represents data that can take any value within a given
range.

 The probability of any exact value is zero, but probabilities are assigned over intervals.

 Examples include height, weight, and time.

 Example: The probability distribution of students’ heights in a class follows a continuous

distribution.

Q9. Explain Min-Max Normalization with Example.

Q10. What is Simpson’s Paradox.llustrate with appropriate example.

Simpson’s Paradox

Definition:

 Simpson’s Paradox is a phenomenon in probability and statistics where a trend that appears
in several different groups of data reverses or disappears when the groups are combined.

 It highlights the risk of drawing misleading conclusions if subgroup differences (hidden

variables) are ignored.

Illustration with Example:

Two institutes (A and B) train students for an exam:

Group Success in A Success in B

Boys 80% (80/100) 70% (7/10)

Girls 90% (9/10) 80% (80/100)

2 Marks With Answers
No ratings yet
2 Marks With Answers
39 pages
FDS Unit 1 QB
No ratings yet
FDS Unit 1 QB
7 pages
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
53 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Question Bank With Answers
No ratings yet
Question Bank With Answers
103 pages
Introduction To Data Science - 23CSH-283
100% (1)
Introduction To Data Science - 23CSH-283
48 pages
Unit I
No ratings yet
Unit I
52 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Data Science and Analytics Reviewer
No ratings yet
Data Science and Analytics Reviewer
5 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Data Science
No ratings yet
Data Science
10 pages
Data Science
No ratings yet
Data Science
14 pages
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
No ratings yet
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
6 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
11 pages
Ab Assignment 3
No ratings yet
Ab Assignment 3
7 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
14 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Data Science
No ratings yet
Data Science
14 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
File
No ratings yet
File
27 pages
UNIT 1 Material
No ratings yet
UNIT 1 Material
28 pages
Ixs8h l8mgc
No ratings yet
Ixs8h l8mgc
40 pages
Wa0001.
No ratings yet
Wa0001.
9 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
Data Science
No ratings yet
Data Science
17 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data-Science - Introduction
No ratings yet
Data-Science - Introduction
35 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Data Science
No ratings yet
Data Science
10 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
DS QB Unit 1
No ratings yet
DS QB Unit 1
45 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
33 pages
Data Science Unit1
No ratings yet
Data Science Unit1
9 pages
Unit 3
No ratings yet
Unit 3
9 pages
Top Data Science Interview Questions and Answers in 2023 PDF
100% (1)
Top Data Science Interview Questions and Answers in 2023 PDF
14 pages
Data Science
No ratings yet
Data Science
18 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
What Is Data Science?
No ratings yet
What Is Data Science?
94 pages
Data Science Comprehension Worksheets
No ratings yet
Data Science Comprehension Worksheets
32 pages
Ds Revision 1
No ratings yet
Ds Revision 1
5 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
No ratings yet
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
17 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
DS 3-Marks Semeseter Suggestion
No ratings yet
DS 3-Marks Semeseter Suggestion
54 pages
Data Science
No ratings yet
Data Science
14 pages
Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
Data Science Process UNIT - II PS New
No ratings yet
Data Science Process UNIT - II PS New
21 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Data-Science-Assignment No-1
No ratings yet
Data-Science-Assignment No-1
5 pages
Data Science Foundations Guide
No ratings yet
Data Science Foundations Guide
19 pages
? What Is Data Science
No ratings yet
? What Is Data Science
31 pages
Data Science
No ratings yet
Data Science
18 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Cs3352 - Foundation of Data Science
No ratings yet
Cs3352 - Foundation of Data Science
56 pages
Machine Learning: The Hundred-Page Book
No ratings yet
Machine Learning: The Hundred-Page Book
4 pages
Fillers - Riempitivi
No ratings yet
Fillers - Riempitivi
6 pages
228 B.M. 1625 Uy Timosa
No ratings yet
228 B.M. 1625 Uy Timosa
2 pages
CSP CBR Activities
No ratings yet
CSP CBR Activities
6 pages
IPD - IA-Aditi Holey
No ratings yet
IPD - IA-Aditi Holey
20 pages
Debate1bingo PDF
No ratings yet
Debate1bingo PDF
1 page
Altronic EPC100E Brochure PDF
No ratings yet
Altronic EPC100E Brochure PDF
4 pages
Candidate Information Sheet: Thapar Institute of Engineering and Technology, Patiala
100% (1)
Candidate Information Sheet: Thapar Institute of Engineering and Technology, Patiala
3 pages
Lab Reports Inst and Control Lab
No ratings yet
Lab Reports Inst and Control Lab
12 pages
Lac-Proposal-For-Tletvl Teachers
No ratings yet
Lac-Proposal-For-Tletvl Teachers
5 pages
First Division G.R. No. L-32811, March 31, 1980: Supreme Court of The Philippines
No ratings yet
First Division G.R. No. L-32811, March 31, 1980: Supreme Court of The Philippines
17 pages
Hiring Manager 30 60 90 Day Check in Questions
No ratings yet
Hiring Manager 30 60 90 Day Check in Questions
2 pages
Terminal Report
No ratings yet
Terminal Report
13 pages
BMO Nesbitt Burns - Research Highlights - March 5
No ratings yet
BMO Nesbitt Burns - Research Highlights - March 5
32 pages
McHenry Squadron - Dec 2007
No ratings yet
McHenry Squadron - Dec 2007
7 pages
Federal Judge Denies Temporary Restraining Order On Restaurants - UNITED STATES DISTRICT COURT WESTERN DISTRICT OF MICHIGAN SOUTHERN DIVISION
No ratings yet
Federal Judge Denies Temporary Restraining Order On Restaurants - UNITED STATES DISTRICT COURT WESTERN DISTRICT OF MICHIGAN SOUTHERN DIVISION
11 pages
Strategic Management Essentials
No ratings yet
Strategic Management Essentials
7 pages
Chapter 2 ENSC 30 PDF
No ratings yet
Chapter 2 ENSC 30 PDF
16 pages
2-7-Totally Gel-Free Fiber Optical Cables Manufactured With PBT
No ratings yet
2-7-Totally Gel-Free Fiber Optical Cables Manufactured With PBT
6 pages
IBPS PO Prelims Day 5 Combined 168614589089
No ratings yet
IBPS PO Prelims Day 5 Combined 168614589089
42 pages
Chapter 2 World and The Internet
No ratings yet
Chapter 2 World and The Internet
10 pages
Hydraulic Fluid Selection Guide
100% (1)
Hydraulic Fluid Selection Guide
6 pages
Manual - 1. HR Policy Manual 2024
50% (2)
Manual - 1. HR Policy Manual 2024
112 pages
Prodoc OVC
No ratings yet
Prodoc OVC
23 pages
Nayati Tilting Boiling and Braising Pan
No ratings yet
Nayati Tilting Boiling and Braising Pan
8 pages
UCI 202 Course Outline
No ratings yet
UCI 202 Course Outline
6 pages
6300 Catalog
No ratings yet
6300 Catalog
15 pages
Hiroshima Technolgy
No ratings yet
Hiroshima Technolgy
15 pages
Competency Guide for Engineers
No ratings yet
Competency Guide for Engineers
18 pages
T3 Tugas
No ratings yet
T3 Tugas
15 pages