Data Science Methodology
Data Science Methodology
Methodology in Data Science is the best way to organize your work, doing it better,
and without losing time.
Data Science Methodology indicates the routine for finding solutions to a specific
problem. This is a cyclic process that undergoes a critic behaviour guiding business
analysts and data scientists to act accordingly.
Majorly consists of 10 parts, some article describe it into 5 major parts which
further contains each step
Every customer’s request starts with a problem, and Data Scientists’ job is first to
understand it and approach this problem with statistical and machine learning
techniques.
1. Business Understanding:
This stage is crucial because it helps to clarify the goal of the customer. Before
solving any problem in the Business domain it needs to be understood properly.
Business understanding forms a concrete base, which further leads to easy
resolution of queries. We should have the clarity of what is the exact problem we
are going to solve.
2. Analytic Understanding:
Once the business problem has been clearly stated, the data scientist can define the
analytic approach to solve the problem. The approaches can be of 4 types:
- Descriptive approach (current status and information provided)
- Diagnostic approach(statistical analysis, what is happening and why it is
happening)
- Predictive approach(it forecasts on the trends or future events probability) and
- Prescriptive approach( how the problem should be solved actually).
This step is essential because it helps identify what type of patterns will be needed to
address the question most effectively. If the issue is to determine the probabilities of
something, then a predictive model might be used; if the question is to show
relationships, a descriptive approach may be required, and if our problem requires
counts, then statistical analysis is the best way to solve it. For each type of approach,
we can use different algorithms.
Once we have found a way to solve our problem, we will need to discover the correct
data for our model.
3. Data Requirements:
is the stage where we identify the necessary data content, formats, and sources for
initial data collection, and we use this data inside the algorithm of the approach we
chose. During the process of data requirements, one should find the answers for
questions like ‘what’, ‘where’, ‘when’, ‘why’, ‘how’ & ‘who’.
4. Data Collection:
data scientists identify the available data resources relevant to the problem domain.
Data collected can be obtained in any random format. So, according to the approach
chosen and the output to be obtained, the data collected should be validated. Thus, if
required one can gather more data or discard the irrelevant data.
Now that the data collection stage is complete, data scientists use descriptive statistics
and visualization techniques to understand data better. Data scientists, explore the
dataset to understand its content, determine if additional data is necessary to fill any
gaps but also to verify the quality of the data.
5. Data Understanding:
At this stage, data scientists try to understand more about the data collected before.
We have to check the type of each data and to learn more about the attributes and their
names. Data understanding answers the question “Is the data collected representative
of the problem to be solved?”. Descriptive statistics calculates the measures applied
over data to access the content and quality of matter. This step may lead to reverting
the back to the previous step for correction.
6. Data Preparation:
Here data scientists prepare data for modeling, which is one of the most crucial steps
because the model has to be clean and without errors. In this stage, we have to be sure
that the data are in the correct format for the machine learning algorithm we chose in
the analytic approach stage.
Once data are prepared for the chosen machine learning algorithm, we are ready for
modeling.
7. Modelling:
Modelling decides whether the data prepared for processing is appropriate or
requires more finishing and seasoning. the data scientist has the chance to understand
if his work is ready to go or if it needs review. Modeling focuses on developing models
that are either descriptive or predictive, and these models are based on the analytic
approach that was taken statistically or through machine learning.
Descriptive modeling is a mathematical process that describes real-world events
and the relationships between factors responsible for them, for example, a descriptive
model might examine things like: if a person did this, then they’re likely to prefer
that.
Predictive modeling is a process that uses data mining and probability to forecast
outcomes; for example, a predictive model might be used to determine whether an
email is a spam or not. For predictive modeling, data scientists use a training set that
is a set of historical data in which the outcomes are already known. This step can be
repeated more times until the model understands the question and answer to it.
8. Evaluation:
Model evaluation is done during model development. Data scientists evaluate the
model. It checks for the quality of the model to be assessed and also if it meets the
business requirements. It undergoes diagnostic measure phase (the model works as
intended and where are modifications required) and statistical significance testing
phase (ensures about proper data handling and interpretation).
Data scientists have to make the stakeholders familiar with the tool produced in
different scenarios, so once the model is evaluated and the data scientist is confident it
will work, it is deployed and put to the ultimate test.
9. Deployment:
As the model is effectively evaluated it is made ready for deployment in the
business market. Deployment phase checks how much the model can withstand in
the external environment and perform superiorly as compared to others.
The deployment stage depends on the purpose of the model, and it may be rolled out to
a limited group of users or in a test environment. A real case study example can be for
a model destined for the healthcare system; the model can be deployed for some
patients with low-risk and after for high-risk patients too.
10. Feedback:
Feedback is the necessary purpose which helps in refining the model and accessing
its performance and impact.
The Feedback stage is usually made the most from the customer. Customers after the
deployment stage can say if the model works for their purposes or not. Data scientists
take this feedback and decide if they should improve the model; that’s because the
process from modeling to feedback is highly iterative.