TYBScCS Sem VI Data Science
Definition of Data Science
1. Data science is the collection, maintenance, and study of data.
2. Data Science is an interdisciplinary field that uses scientific methods,
algorithms, and systems to extract knowledge and insights from
structured and unstructured data.
3. It combines aspects of statistics, computer science, data analysis,
and domain expertise to make sense of complex data sets and help
in decision-making processes.
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
Key components
Phase 1: Business Understanding ( the profound knowledge and insight)
The first phase consists of defining the business problem because a well-
defined problem statement defines a specific goal. The main goal is to get
an understanding of the business problem, the domain of the business
problem, and the kind of solution the business seeks. It should answer the
below questions:
1. What is the goal of the business?
2. What does the outcome business want from this business problem?
Phase 2: Data Collection(Gathering data from various sources)
The next step is to collect the data. Once the business understanding of
the problem is obtained, and the problem statement is defined, the next
step would be to collect the data. This is referred as Data Acquisition in
Machine Learning. Data collection is an essential step in data science
because data needs to be relevant that can solve the business problem
correctly. Though there are many sources to collect the data, it should be
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
made sure that data is collected from a reliable source to ensure that data
is correct.
Phase 3: Data Preparation(Transforming raw data into a usable form)
Data preparation is a crucial step in a Data Science project as it helps in
cleaning and bringing the data into the shape, which is required for
further analysis and modeling. This may also be referred as data cleaning.
As part of the data preparation, we treat issues like missing values,
outliers and also transform the data into the required format.
This step only lets data scientists decide how they need to treat this data
for further model building.
Phase 4: Exploratory Data Analysis(Visualizing and understanding data
patterns)
As part of exploratory data analysis (EDA) data is analyzed using summary
statistics and graphically to understand key patterns. This is relatively a
simpler step but highly effective. The exploratory analysis also establishes
the relationship among different variables in form of correlations. Here a
data scientist develops a stronger understanding of data in terms of which
variables may prove to be useful for further analyses that eventually meet
the business objectives, and accordingly drop the irrelevant data.
Phase 5: Model Building(Applying statistical and machine learning
algorithms to extract insights)
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
Once the data is prepared, and all the hidden insights and hidden patterns
from the data are understood, the next step is to build the model. There
are two types of data modeling, i.e., descriptive analytics, which involves
insights based on historical data and predictive modeling, which involves
future predictions. This step of Model Designing is considered the most
interesting step in a Data Science project, but a data scientist needs to
spend enough time in the prior step to get the most accurate solution. In
this step, feature selection is made to decide which features are relevant,
and the rest can be removed.
There are different types of model building techniques based on the type
of business problem and data. The business problem can be a
classification, regression, time series, clustering, or recommendation.
Based on this, the relevant algorithm can be selected to apply to the data.
The model accuracy is calculated to check if the model built is acceptable
and performs during the testing stage.
Phase 6: Model Deployment and Maintenance(ML model online after
development and training offline)
Once the model is built, it is ready to deploy in the real world. The
deployment can occur offline, on the web, on the cloud, any android or
iOS app. Generally, there is some variation in the accuracy of the model
built and the model deployed. This is because the model is built on a
certain amount of data and is deployed on different data. The Data
Science project is monitored and maintained to work in the long run. If
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
there is any performance downgrade, then relevant changes can be made
as a part of the maintenance.
These steps are repeated until a good model giving good results to the
business problem gets achieved.
Examples
1. It helps in getting the ideas of what customers would love to
purchase or eat according to their previous order history. This will
let online food delivery companies understand the requirements of
their customers. With the help of Data Science, they can know from
what area they are getting maximum orders and on what days of a
week. Moreover, they can provide more offers to selective
customers on particular orders based on their previous ordering
history. This kind of recommendation can be achieved by using the
data about customers, including their age, income, browsing history,
and prior orders. In this way, the food ordering companies can
increase their business by focusing on customer’s requirements.
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
2. Data Science also helps in making future predictions. For example,
the airlines can predict the prices for their flights according to the
customers’ previous booking history. Airline companies can collect
the data of their last flight bookings to understand the patterns at
what time of the year, most reservations get made, and for which
destinations most of the bookings get made and around what time
of the year. Understanding this pattern, airline companies can predict
the prices of their flights accordingly and gain maximum profit.
3. Data Science also helps in getting recommendations. As an example,
Netflix can give recommendations based on the previous browsing
history of videos and ratings given by users to the videos. Based on
the choice of videos, the new videos’ recommendations can be
provided of their interests to the users. This can keep the users busy
in using such sites and let the company earn more profits.
Data Science skills
the skills that a Data Scientist must have –
Statistics: Data scientists must have a good knowledge of statistical
techniques so that they can find the hidden pattern in data and
correlation between different features in data.
Machine Learning: Data scientists must know different algorithms for
building a model so that the machine can be trained.
Computer Science: A Data Scientist must be able to apply different
principles of Computer Science, including software engineering, database
system, Artificial Intelligence, and numerical analysis.
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
Programming: A Data Scientist must know at least one programming
language to the right algorithms. They must be comfortable in writing code
in programming languages such as Python, R, and SQL.
Analytical Thinking: A Data Scientist must think analytically to solve the
business problems.
Critical Thinking: A Data Scientist must have critical thinking ability to
analyze the facts before concluding.
Interpersonal Skills: A Data Scientist must have excellent communication
skills to interact with different audiences across the organization.
Business Intuition: A Data Scientist must be able to communicate with
clients to understand the problems.
Tools a Data Scientist Uses
There are a variety of tools that Data Scientists use in their day-to-day life.
These tools can be Programming Tools, Data Analysis Tools, or Statistical
Programming Tools.
Python: Python is a versatile programming language that is used most by
Data Scientists. Its most important application is used in the field of
Machine Learning. It has many libraries that make it perfect for handling
Data Science related work.
R Programming: R is one of the essential statistical programming tools,
which is mainly used by Data Scientists to perform a detailed analysis of
large data to find insights.
SQL: It is also a valuable tool used by a Data Scientist. It helps them in
working on DBMS and structured data. A Data Engineer also uses this tool.
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
Tableau: This is a top-rated data visualization tool among Data Scientists
because of its amazing reporting capabilities. This tool makes it simple to
visualize the data and show the results to clients.
Hadoop: It is an open-source and powerful tool that is used by every Data
Scientist.
SAS: SAS is an advanced tool for analysis, which many data analysts use. It
has many powerful features, such as analyzing, extracting, and reporting,
which makes it a popular tool. Also, it has a great GUI that anyone can use
it easily, and Data Scientists use it to convert the data into business
insights.
Scope of Data Science
Data Science has an extensive scope, spanning across various domains and
industries. Some of the critical areas include:
Data Engineering: Building infrastructure for data generation, processing,
and storage.
Data Analysis: Identifying patterns and trends from data.
Machine Learning and Predictive Analytics: Building models that can make
predictions or classifications.
Big Data Technologies: Handling and processing large volumes of data
(e.g., Hadoop, Spark).
Data Visualization: Presenting data insights through visual mediums.
Artificial Intelligence (AI) and Deep Learning: Advanced applications
involving automated decision-making and pattern recognition.
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
Applications and Domains of Data Science
Data Science is applied in numerous fields across various industries:
Healthcare: Predicting disease outbreaks, personalized medicine,
diagnostics, and drug discovery.
Finance: Credit scoring, fraud detection, risk assessment, algorithmic
trading.
Retail and E-Commerce: Customer behavior prediction, recommendation
systems, sales forecasting, inventory management.
Marketing: Customer segmentation, targeted advertising, sentiment
analysis.
Manufacturing: Predictive maintenance, process optimization, supply
chain management.
Social Media: Sentiment analysis, influencer identification, social network
analysis.
Government: Public policy analysis, urban planning, crime prediction.
Sports: Player performance analysis, injury prediction, strategy
optimization.
Comparison with Other Fields
a. Business Intelligence (BI):
Scope: BI focuses primarily on descriptive analytics, i.e., analysing
historical data to generate reports, dashboards, and key performance
indicators (KPIs).
Techniques: BI uses traditional data analysis, querying, and reporting
tools.
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
Goal: BI helps in providing insights based on past data to aid business
decisions.
Data Science vs. BI: Data Science goes beyond descriptive analytics to
include predictive and prescriptive analytics using advanced statistical and
machine learning models.
b. Artificial Intelligence (AI):
Scope: AI refers to creating systems that can perform tasks that normally
require human intelligence, such as reasoning, learning, and problem-
solving.
Techniques: AI encompasses areas like natural language processing (NLP),
computer vision, robotics, and expert systems.
Goal: AI focuses on automation, cognitive processes, and creating systems
that can "think" and adapt.
Data Science vs. AI: While Data Science is focused on extracting insights
from data, AI focuses on building systems that can learn from data and
make decisions. Data Science often uses AI methods (e.g., machine
learning) as part of its analytical process.
c. Machine Learning (ML):
Scope: ML is a subset of AI that involves building algorithms that can learn
from and make predictions based on data.
Techniques: Supervised learning, unsupervised learning, reinforcement
learning, deep learning.
Goal: ML aims to develop models that can improve performance over time
without being explicitly programmed.
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
Data Science vs. ML: Data Science is the broader field that includes data
preparation, visualization, and communication, while ML is specifically
concerned with creating models that can learn from data.
d. Data Warehousing/Data Mining (DW-DM):
Scope of Data Warehousing: Involves the storage, consolidation, and
management of large volumes of data from various sources in a central
repository (data warehouse).
Scope of Data Mining: Involves extracting patterns, trends, and
relationships from large datasets through techniques like clustering,
classification, association rule mining, etc.
Techniques: Data warehousing uses ETL (Extract, Transform, Load)
processes, while data mining applies statistical and machine learning
techniques.
Goal: Data warehousing ensures data availability and accessibility for
decision-making. Data mining extracts useful patterns and insights from
vast amounts of data.
Data Science vs. DW-DM: Data Science encompasses both the techniques
of data mining and more advanced analytics, such as predictive modeling
and machine learning, while Data Warehousing focuses more on data
storage, integration, and accessibility. Data Mining is a technique within
Data Science used to find patterns in data.
Data Science: A broad field that combines data collection, analysis,
machine learning, and AI to derive actionable insights.
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
Business Intelligence (BI): Primarily focused on descriptive analytics and
reporting.
Artificial Intelligence (AI): Focuses on creating systems that perform
human-like tasks and adapt over time.
Machine Learning (ML): A subset of AI that focuses on creating
algorithms that can learn and make predictions from data.
Data Warehousing/Data Mining (DW-DM): Data warehousing deals with
storing and managing data, while data mining involves discovering
patterns in data.
These fields are interconnected, with Data Science encompassing a wide
range of methodologies, including AI, ML, and data mining techniques, to
solve complex real-world problems across various industries.
D G RUPAREL COLLEGE
TYBScCS Sem VI Data Science
D G RUPAREL COLLEGE