Unit – 1 (Data Science)
Data Science and Steps:
Data science is a field that uses maths, statistics, programming and
domain knowledge to extract insights and patterns from raw data.
Steps in data Science Process:
1. Problem Understanding
2. Data Collection
3. Data preprocessing and Cleaning
4. Exploratory data analysis
5. Feature selection
6. Model Building
7. Model Testing and Evaluation
8. Real world Applications
Problem Understanding:
• The first step is to gain a complete understanding of the business
problem or use case.
• This involves identifying what we are trying to predict or solve,
understanding the factors that may influence the outcome, and
defining what a successful solution looks like.
• At this stage, we also decide on the accuracy or performance level
expected from the model and select the appropriate tools,
programming languages (like Python or R), and machine learning
techniques that will be used in the project.
Data Collection:
• In this step, we gather data that will be used to build the machine
learning model.
• The data must be relevant, accurate, and sufficient for solving the
defined problem.
• Common sources of data include:
a. APIs: Used to pull real-time or structured data from online
services like Twitter, financial platforms, weather services,
or Google Maps.
b. Databases: Data stored in systems such as MySQL,
PostgreSQL, Oracle, or NoSQL systems like MongoDB can be
accessed using SQL queries or database connectors.
c. Web Scraping: Data is extracted from websites using tools
like BeautifulSoup, Selenium, or Scrapy, often used to collect
information such as product listings, reviews, or news
articles.
d. Flat Files (CSV, Excel, JSON): Many datasets are available in
these formats from internal teams or public sources.
e. Public Datasets: Platforms like Kaggle, UCI Machine
Learning Repository, and government open data portals
provide high-quality, ready-to-use datasets.
f. Cloud and Data Warehouses: Data stored in cloud platforms
like AWS S3, Google BigQuery, or Azure Blob Storage can be
accessed via secure APIs or cloud connectors.
g. Sensors/IoT Devices: In industries like manufacturing or
healthcare, data may be streamed from devices in real time.
Data Preprocessing and Cleaning:
• Once the data is collected, it needs to be cleaned and prepared to
ensure it is usable for analysis and modeling.
• This step includes:
a. Handling Missing Values: Identify and treat missing values
by filling them using mean, median, mode, or interpolation
methods. In some cases, rows or columns with excessive
missing data may be removed.
b. Removing Duplicate Records: Detect and delete repeated
entries which may result in biased analysis or overfitting
c. Outlier Detection and Treatment: Identify unusual values
using statistical techniques like the interquartile range (IQR),
Z-score, or visualization methods, and decide whether to
keep, modify, or remove them.
d. Correcting Inconsistent Data: Standardize formats across
the dataset, such as converting date formats or unifying
categories like "male", "Male", and "M".
e. Fixing Noise and Irrelevant Features: Drop columns that are
unrelated to the target variable or contain random data that
does not contribute to the prediction process.
Exploratory Data Analysis:
• Exploratory Data Analysis helps in understanding the underlying
structure of the data and discovering patterns, relationships, or
anomalies.
• In this step:
a. We analyze summary statistics such as mean, median,
mode, variance, and correlation between variables.
b. Visual tools are used to support the analysis, including:
• Histograms to show distribution of numeric values
• Box plotsto detect outliers and compare distributions
• Scatter plotsto visualize relationships between two
variables
• Heatmaps to view correlations across multiple
variables
• Bar charts and pie chartsfor categorical data insights
Feature Selection:
• Feature selection is the process of identifying and retaining only
the most relevant variables (features) that have a strong impact on
the model’s predictions.
• This helps reduce model complexity, improve accuracy, and
prevent overfitting.
Model Building:
• After feature selection, the dataset is divided into two parts: a
larger training set (usually 70–80%) used to train the model, and a
smaller testing set (20–30%) used for evaluation.
• Choose a suitable machine learning algorithm depending on the
problem type—classification, regression, clustering, etc.
• Train the model using the training data, allowing it to learn
patterns and relationships from the input features.
Model Evaluation and Testing:
• Once the model is trained, we evaluate how well it performs using
the test data.
• The model's predictions are compared to the actual values to
measure accuracy and performance.
• Different types of evaluation metrics are:
a. Accuracy – Overall correctness of the model
b. Precision – Correct positive predictions out of total
predicted positives
c. Recall (Sensitivity)– Correct positive predictions out of
actual positives
d. F1-Score– Harmonic mean of precision and recall
Real World Applications:
• After achieving a satisfactory evaluation score, the final model is
deployed in a real-world environment where it can be used to
make predictions or support decision-making.
• Maintaining and Updating the model over time using new data,
retraining when necessary to keep the model accurate and
relevant.
Uses of Data Science:
1. Problem Solving – Example: Healthcare (Disease Diagnosis):
• Data science is used to analyze patient records, lab results,
and medical images to identify early signs of diseases.
• Machine learning models assist doctors in diagnosing
conditions like cancer or diabetes with higher accuracy.
• It helps healthcare providers predict outbreaks and allocate
resources effectively (e.g., during pandemics).
2. Automation – Example: Gmail (Spam Detection):
• Gmail uses data science to automatically filter out spam
emails by analyzing message content, sender behavior, and
user feedback.
• Machine learning models are trained on millions of emails to
recognize patterns typical of spam (e.g., suspicious links,
keywords).
• The system updates automatically as new types of spam
emerge, reducing the need for manual intervention.
3. Decision Making – Example: Starbucks:
• Data science helps Starbucks decide where to open new
store locations by analyzing foot traffic, demographics, and
competitor presence.
• Sales and customer data are used to decide which products
to promote or discontinue.
• Customer segmentation allows Starbucks to offer
personalized rewards and seasonal promotions to different
types of buyers (e.g., morning commuters vs. casual
afternoon visitors).
Big data and its characteristics
Big Data means massive volumes of data that are too large, too fast or
too complex for traditional data processing tools to handle effectively.
These datasets are generated from a wide range of sources such as:
• Social media platforms
• IoT devices
• Online Transactions
• Sensors
Characteristics of Big Data:
Big data is characterized by 5 V’s . They are:
• Volume
• Variety
• Velocity
• Veracity
• Value
Volume
Refers to the massive amount of data generated every second from
sources like social media, sensors, mobile devices, and business
transactions.
Ex: Facebook generates over 4 petabytes of data every day
Variety
Big data comes in multiple formats: structured (tables, databases), semi-
structured (XML, JSON), and unstructured (text, images, audio, video).
This diversity makes storage and analysis more challenging.
Velocity
Describes the speed at which data is generated, processed, and
analyzed.
Real-time or near real-time data processing is essential in areas like stock
trading or fraud detection.
Veracity
Relates to the accuracy and trustworthiness of the data.
Inconsistent, incomplete, or noisy data can lead to poor decisions, so
ensuring data quality is crucial.
Value
Represents the potential benefits that can be derived from analyzing big
data.
Extracting meaningful patterns, trends, or forecasts can create immense
business or societal value.
Uses of Big Data:
1. Advertisement – example: Netflix
• Data science analyzes user viewing behavior (what, when, how
long) to recommend personalized content.
• Machine learning models predict which shows or movies a user is
most likely to watch next.
• It helps Netflix segment users into groups (e.g., action lovers,
drama fans) for targeted promotion and content placement.
2. Customer Experience – example: Market Basket Analysis
• Market basket analysis finds associations between products
frequently bought together (like bread and butter).
• Retailers use this insight to place items near each other or offer
combo discounts.
• Enhances user experience by showing relevant product
suggestions during checkout.
•
3. Future Scope – example: Weather Prediction
• Uses historical climate data and real-time satellite input to
forecast weather conditions.
• Machine learning helps in early detection of extreme events like
storms or droughts.
• Supports agriculture, disaster preparedness, and transportation
planning.
Big Data Ecosystem:
The Big Data Ecosystem refers to the collection of tools, technologies,
frameworks, and processes used to store, process, analyze, and manage
large volumes of data that traditional systems can't handle efficiently.
Key Components of the Big Data Ecosystem:
1. Data Sources
• Where data comes from.
• Examples: Social media, IoT devices, sensors, web servers, mobile apps.
2. Data Ingestion
• Tools that collect and bring data into the system.
• Examples:
o Apache Kafka (real-time streaming)
o Apache Sqoop (importing data from databases)
o Flume (log data collection)
3. Data Storage
• Where data is stored for further processing.
• Examples:
o HDFS (Hadoop Distributed File System)
o Amazon S3
o NoSQL databases (MongoDB, Cassandra)
o Data Lakes
4. Data Processing
• Converting raw data into meaningful information.
• Examples:
o Apache Hadoop MapReduce
o Apache Spark (faster, in-memory processing)
o Apache Flink (real-time stream processing)
5. Data Analysis
• Analyzing data to extract insights.
• Examples:
o R, Python (with libraries like Pandas, NumPy, Scikit-learn)
o SAS
o RapidMiner
6. Data Visualization
• Presenting data in charts, dashboards, and reports.
• Examples:
o Tableau
o Power BI
o Apache Superset
o D3.js
7. Data Management & Governance
• Ensuring data quality, security, and compliance.
• Examples:
o Apache Ranger (security)
o Apache Atlas (metadata management)
o Informatica