0% found this document useful (0 votes)
9 views13 pages

DSV Notes

Uploaded by

altafmansuriam07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

DSV Notes

Uploaded by

altafmansuriam07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Q1.Elaborate different stages of data science process in detail.

Ans:

1. Business Understanding – Defines the problem, goals, and success criteria in business terms.
Ensures the project is solving the right question.
2. Data Understanding – Collects relevant data from multiple sources and examines quality.
Identifies missing values, outliers, and basic patterns.
3. Data Preparation – Cleans and transforms raw data into usable form. Handles missing
values, duplicates, encoding, and feature creation.
4. Exploratory Data Analysis (EDA) – Uses visualization and statistics to study patterns. Helps
understand relationships and select important features.
5. Data Modeling – Applies machine learning/statistical algorithms to the prepared data. Trains
and fine-tunes models for prediction or classification.
6. Model Evaluation – Tests models using performance metrics (accuracy, precision, recall,
RMSE, etc.). Compares models and selects the best one.
7. Model Deployment – Implements the chosen model in real applications. Makes predictions
available to users through APIs, dashboards, or software.

Q2.What is supervised Machine Learning? Elaborate with suitable example and enlist its
applications.

Supervised Machine Learning (ML)

Supervised learning is a type of machine learning technique where the model is trained using a
labeled dataset.

 “Labeled” means the data already has input (X) and the correct output (Y).

 The algorithm learns the mapping function from input to output.

 Later, when new unseen data is given, the model can predict the output.

👉 In short: Supervised ML = Learn from past labeled data → predict future outcomes.
Example

Suppose we want to build a system that predicts whether an email is Spam or Not Spam.

 Input (X): Features of the email (words used, sender, subject line, etc.)

 Output (Y): Label = “Spam” or “Not Spam”

 We train the model with thousands of emails (already labeled as spam/not spam).

 After training, the model can classify new incoming emails correctly.

Another simple example: Predicting house prices based on size, location, and number of rooms.

Applications of Supervised Machine Learning

1. Email Filtering – Classifying spam vs. non-spam.

2. Medical Diagnosis – Predicting diseases based on symptoms or reports.

3. Fraud Detection – Identifying fraudulent transactions in banking.

4. Speech Recognition – Converting speech into text.

5. Weather Forecasting – Predicting rain, temperature, etc.

Q3. What is Unsupervised Machine Learning? Elaborate with suitable example

and enlist its applications.

Unsupervised Machine Learning

Definition:

 A type of machine learning where the algorithm is trained using unlabeled data (only input,
no output).

 The system tries to find hidden patterns, structure, or grouping in the data without prior
knowledge.

Example:

 Customer Segmentation: A retail store has customer purchase data but no labels.
The algorithm groups customers into clusters (e.g., frequent buyers, occasional buyers,
discount seekers).

 Market Basket Analysis: Finds products that are often bought together (e.g., bread →
butter).
Applications:

1. Customer segmentation in marketing

2. Market basket analysis in retail

3. Anomaly detection (fraud, network security)

4. Document or news categorization

5. Recommendation systems (movies, shopping sites)

6. Pattern and trend analysis in large datasets

7. Image compression and recognition

8. Genomic data analysis in bioinformatics

Q4. Define the terms : [4] i) Data science ii) Big Data

i) Data Science

 An interdisciplinary field that combines statistics, mathematics, computer science, and


domain knowledge.

 Focuses on extracting useful insights and predictions from structured and unstructured
data.

 Involves stages like data collection, cleaning, analysis, visualization, and machine learning
modeling.

 Example: Netflix using data science to recommend movies to users.

ii) Big Data

 Refers to large, complex, and fast-growing datasets beyond the ability of traditional tools to
manage.

 Characterized by the 5 V’s – Volume, Velocity, Variety, Veracity, and Value.

 Requires advanced tools like Hadoop, Spark, and NoSQL databases for processing.

 Example: Social media platforms analyzing billions of user posts and interactions daily.

Q5. Define Big data? Elaborate how Big Data plays an important role in Data Science?

Big Data

Definition:
 Big Data refers to large, complex, and fast-growing datasets that cannot be stored,
processed, or analyzed effectively using traditional data management tools.

 It is generally described by the 5 V’s:

o Volume (huge amount of data)

o Velocity (speed of data generation)

o Variety (different types – text, images, video, etc.)

o Veracity (uncertainty/accuracy of data)

o Value (usefulness of data)

Role of Big Data in Data Science

1. Foundation for Data Science – Big Data provides the raw material (large datasets) on which
Data Science methods are applied.

2. Improved Decision Making – Data scientists use big data analytics to help businesses make
data-driven decisions (e.g., Amazon product recommendations).

3. Machine Learning & AI – Training accurate ML models requires massive labeled/unlabeled


datasets provided by Big Data.

4. Pattern Discovery – Helps in finding hidden patterns, trends, and correlations that are not
visible in small datasets.

5. Real-Time Analysis – Enables real-time predictions (e.g., fraud detection in banking, traffic
monitoring).

Q6. How is Machine Learning related to Data Science? Explain in detail.

Relation between Machine Learning and Data Science

 Data Science is a field that focuses on collecting, processing, analyzing, and visualizing data
to extract useful insights.

 Machine Learning (ML) is a subset of AI and an important component of Data Science that
uses algorithms to learn from data and make predictions.

Relation:

1. ML is used in Data Science for predictive modeling and classification.

2. Data Science provides large datasets, ML extracts patterns and trends from them.

3. ML helps Data Science in automation and accurate decision-making.

Example: Netflix uses Data Science to collect user data and ML to recommend movies.
Q6. What is Exploratory Data Analysis in Data Science process? Explain with suitable example

Exploratory Data Analysis (EDA)

Definition:

 Exploratory Data Analysis (EDA) is a critical step in the Data Science process where data is
examined, visualized, and summarized to understand its structure, patterns, and
relationships before applying any modeling techniques.

 It helps in detecting outliers, missing values, trends, and hidden patterns in the dataset.

Key Activities in EDA:

1. Data Cleaning – Handling missing values, duplicates, and errors.

2. Descriptive Statistics – Mean, median, mode, variance, correlation.

3. Visualization – Using graphs like histograms, scatter plots, box plots, heatmaps.

4. Pattern & Relationship Detection – Finding dependencies between variables.

Example:

 Suppose a company wants to analyze employee salaries vs. years of experience.

 By plotting a scatter plot, EDA shows a clear positive relationship (more experience →
higher salary).

 This insight helps decide which features to use in building a salary prediction model.

Q7. Define the terms: i) Correlation ii) Variance

i) Correlation

 Definition: Correlation is a statistical measure that shows the strength and direction of the
relationship between two variables.

 Its value lies between –1 and +1:

o +1 → Perfect positive relation (both increase together)

o –1 → Perfect negative relation (one increases, other decreases)

o 0 → No relation

 Example: Height and weight usually show a positive correlation.

ii) Variance

 Definition: Variance is a measure of how much the data values deviate from the mean of
the dataset.
 It represents the spread or dispersion of data points.

 Formula:


 Example: If exam marks of students vary a lot, the variance will be high; if marks are close to
average, variance will be low.

Q8. Differentiate between the Structured data and Unstructured data.

Point Structured Data Unstructured Data

Data organized in rows & columns, Data not organized in predefined


Definition
easily stored in databases. format, difficult to store/analyze.

Tables, spreadsheets, relational Text, images, videos, audio, social


Format
databases. media posts.

Stored in data lakes, NoSQL, Hadoop


Storage Stored in RDBMS (SQL databases).
systems.

Easy to search, query, and analyze Requires advanced tools like NLP, ML,
Processing
using SQL. AI for processing.

Customer names, phone numbers, Emails, photos, CCTV footage,


Example
bank transactions. YouTube videos.

Q9. Explain different steps involved in Data Science process in detail.

Steps for Data Science Processes:

Step 1: Define the Problem and Create a Project Charter

Clearly defining the research goals is the first step in the Data Science Process. A project
charter outlines the objectives, resources, deliverables, and timeline, ensuring that all
stakeholders are aligned.

Step 2: Retrieve Data

Data can be stored in databases, data warehouses, or data lakes within an organization.
Accessing this data often involves navigating company policies and requesting permissions.

Step 3: Data Cleansing, Integration, and Transformation


Data cleaning ensures that errors, inconsistencies, and outliers are removed. Data
integration combines datasets from different sources, while data transformation prepares the
data for modeling by reshaping variables or creating new features.

Step 4: Exploratory Data Analysis (EDA)

During EDA, various graphical techniques like scatter plots, histograms, and box plots are used
to visualize data and identify trends. This phase helps in selecting the right modeling techniques.

Step 5: Build Models

In this step, machine learning or deep learning models are built to make predictions or
classifications based on the data. The choice of algorithm depends on the complexity of the
problem and the type of data.

Step 6: Present Findings and Deploy Models

Once the analysis is complete, results are presented to stakeholders. Models are deployed into
production systems to automate decision-making or support ongoing analysis.

UNIT - 2
Q1. Explain normal distribution and its characteristics.

Normal Distribution

 The normal distribution is a continuous probability distribution that is symmetric and bell-
shaped.

 Most data values cluster around the mean, and probabilities decrease as values move away
from the mean.

 It is also called the Gaussian distribution.

Characteristics of Normal Distribution

1. Symmetry → The curve is perfectly symmetric about the mean.

2. Mean = Median = Mode → All three measures of central tendency are equal and lie at the
center.

3. Bell-Shaped Curve → High frequency around the mean, tails decrease gradually.
4. Empirical Rule (68–95–99.7 Rule) →

o 68% of data lies within 1 standard deviation of mean,

o 95% within 2σ,

o 99.7% within 3σ.

5. Total Area = 1 → The entire probability under the curve sums to 1.

Q2. Write a note on conditional Probability.

Q3. Define Central Tendencies (Mean, Median and Mode) with examples.

Central Tendencies

Central tendency refers to statistical measures that identify the central or typical value of a dataset.
The three main measures are:

1. Mean (Arithmetic Average)

 It is the sum of all observations divided by the total number of observations.

 It represents the most common “average” used in statistics.


2. Median

 It is the middle value when the data is arranged in ascending or descending order.

 If the number of observations is even, it is the average of the two middle values.

3. Mode

 It is the value that occurs most frequently in a dataset.

 A distribution may be unimodal (one mode), bimodal (two modes), or multimodal.

Q4. Define the terms related to data science. i)Co-Variance ii)Standard Deviation

Q5. Describe the terms Normalization and Standardization in detail.


Q6. Define Bayes Theorem with example.
Q7. What is Bayes Theorem? How it is used to solve classification problem in machine learning.
llustrate with suitable example.

Ans. How it is used to solve classification problem in machine learning. llustrate with suitable
example.

Use in Machine Learning (Classification)

 Bayes’ Theorem is the foundation of the Naïve Bayes Classifier.

 It is used to classify data points into categories based on probability.

 In text classification (e.g., spam detection), it calculates the probability that an email is spam
given the words it contains.

Q8.Define the terms with appropriate examples. i) Continues Distribution ii) Normal Distribution

i) Continuous Distribution (3 marks)

 Definition: A continuous distribution represents data that can take any value within a given
range.

 The probability of any exact value is zero, but probabilities are assigned over intervals.

 Examples include height, weight, and time.

 Example: The probability distribution of students’ heights in a class follows a continuous


distribution.

Q9. Explain Min-Max Normalization with Example.


Q10. What is Simpson’s Paradox.llustrate with appropriate example.

Simpson’s Paradox

Definition:

 Simpson’s Paradox is a phenomenon in probability and statistics where a trend that appears
in several different groups of data reverses or disappears when the groups are combined.

 It highlights the risk of drawing misleading conclusions if subgroup differences (hidden


variables) are ignored.

Illustration with Example:

Two institutes (A and B) train students for an exam:

Group Success in A Success in B

Boys 80% (80/100) 70% (7/10)

Girls 90% (9/10) 80% (80/100)

You might also like