0% found this document useful (0 votes)
31 views

Applied Machine Learning Process

The document describes a 5-step process for applied machine learning problems: 1. Define the problem by describing it formally and informally, listing assumptions, and considering how it could be solved manually. 2. Prepare the data by analyzing, selecting, preprocessing, and transforming it for machine learning algorithms. 3. Spot check algorithms by running many standard algorithms on the data to identify high-performing combinations. 4. Improve results by tuning algorithms, using ensembles, and further feature engineering. 5. Present results in a report or presentation describing why the problem was addressed and the solution found.

Uploaded by

prediatech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Applied Machine Learning Process

The document describes a 5-step process for applied machine learning problems: 1. Define the problem by describing it formally and informally, listing assumptions, and considering how it could be solved manually. 2. Prepare the data by analyzing, selecting, preprocessing, and transforming it for machine learning algorithms. 3. Spot check algorithms by running many standard algorithms on the data to identify high-performing combinations. 4. Improve results by tuning algorithms, using ensembles, and further feature engineering. 5. Present results in a report or presentation describing why the problem was addressed and the solution found.

Uploaded by

prediatech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

 Navigation

Click to Take the FREE Crash-Course

Search... 

Applied Machine Learning Process


by Jason Brownlee on July 5, 2019 in Machine Learning Process  115

Share Tweet Share

The Systematic Process For Working Through Predictive Modeling Problems


That Delivers Above Average Results
Over time, working on applied machine learning problems you develop a pattern or process for quickly getting to
good robust results.

Once developed, you can use this process again and again on project after project. The more robust and
developed your process, the faster you can get to reliable results.

In this post, I want to share with you the skeleton of my process for working a machine learning problem.

You can use this as a starting point or template on your next project.

5-Step Systematic Process


I liked to use a 5-step process:

1. Define the Problem


2. Prepare Data
3. Spot Check Algorithms
4. Improve Results
5. Present Results

There is a lot of flexibility in this process. For example, the “prepare data” step is typically broken down into analyze
data (summarize and graph) and prepare data (prepare samples for experiments). The “Spot Checks” step may
involve multiple formal experiments.

It’s a great big production line that I try to move through in a linear manner. The great thing in using automated tools
is that you can go back a few steps (say from “Improve Results” back to “Prepare Data”) and insert a new transform
of the dataset and re-run experiments in the intervening steps to see what interesting results come out and how
they compare to the experiments you executed before.
Production Line
Photo by East Capital, some rights reserved

The process I use has been adapted from the standard data mining process of knowledge discovery in databases
(or KDD), See the post What is Data Mining and KDD for more details.

1. Define the Problem


I like to use a three step process to define the problem. I like to move quickly and I use this mini process to see the
problem from a few different perspectives very quickly:

Step 1: What is the problem? Describe the problem informally and formally and list assumptions and similar
problems.
Step 2: Why does the problem need to be solved? List your motivation for solving the problem, the benefits
a solution provides and how the solution will be used.
Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush
domain knowledge.

You can learn more about this process in the post:

How to Define Your Machine Learning Problem

2. Prepare Data
I preface data preparation with a data analysis phase that involves summarizing the attributes and visualizing them
using scatter plots and histograms. I also like to describe in detail each attribute and relationships between
attributes. This grunt work forces me to think about the data in the context of the problem before it is lost to the
algorithms

The actual data preparation process is three step as follows:

Step 1: Data Selection: Consider what data is available, what data is missing and what data can be removed.
Step 2: Data Preprocessing: Organize your selected data by formatting, cleaning and sampling from it.
Step 3: Data Transformation: Transform preprocessed data ready for machine learning by engineering
features using scaling, attribute decomposition and attribute aggregation.

You can learn more about this process for preparing data in the post:

How to Prepare Data For Machine Learning

3. Spot Check Algorithms


I use 10 fold cross validation in my test harnesses by default. All experiments (algorithm and dataset combinations)
are repeated 10 times and the mean and standard deviation of the accuracy is collected and reported. I also use
statistical significance tests to flush out meaningful results from noise. Box-plots are very useful for summarizing
the distribution of accuracy results for each algorithm and dataset pair.

I spot check algorithms, which means loading up a bunch of standard machine learning algorithms into my test
harness and performing a formal experiment. I typically run 10-20 standard algorithms from all the major algorithm
families across all the transformed and scaled versions of the dataset I have prepared.

The goal of spot checking is to flush out the types of algorithms and dataset combinations that are good at picking
out the structure of the problem so that they can be studied in more detail with focused experiments.

More focused experiments with well-performing families of algorithms may be performed in this step, but algorithm
tuning is left for the next step.

You can discover more about defining your test harness in the post:

How to Evaluate Machine Learning Algorithms

You can discover the importance of spot checking algorithms in the post:

Why you should be Spot-Checking Algorithms on your Machine Learning Problems

4. Improve Results
After spot checking, it’s time to squeeze out the best result from the rig. I do this by running an automated
sensitivity analysis on the parameters of the top performing algorithms. I also design and run experiments using
standard ensemble methods of the top performing algorithms. I put a lot of time into thinking about how to get more
out of the dataset or of the family of algorithms that have been shown to perform well.

Again, statistical significance of results is critical here. It is so easy to focus on the methods and play with algorithm
configurations. The results are only meaningful if they are significant and all configuration are already thought out
and the experiments are executed in batch. I also like to maintain my own personal leaderboard of top results on a
problem.

In summary, the process of improving results involves:

Algorithm Tuning: where discovering the best models is treated like a search problem through model
parameter space.
Ensemble Methods: where the predictions made by multiple models are combined.
Extreme Feature Engineering: where the attribute decomposition and aggregation seen in data preparation is
pushed to the limits.

You can discover more about this process in the post:


How to Improve Machine Learning Results

5. Present Results
The results of a complex machine learning problem are meaningless unless they are put to work. This typically
means a presentation to stakeholders. Even if it is a competition or a problem I am working on for myself, I still go
through the process of presenting the results. It’s a good practice and gives me clear learnings I can build upon
next time.

The template I use to present results is below and may take the form of a text document, formal report or
presentation slides.

Context (Why): Define the environment in which the problem exists and set up the motivation for the research
question.
Problem (Question): Concisely describe the problem as a question that you went out and answered.
Solution (Answer): Concisely describe the solution as an answer to the question you posed in the previous
section. Be specific.
Findings: Bulleted lists of discoveries you made along the way that interests the audience. They may be
discoveries in the data, methods that did or did not work or the model performance benefits you achieved
along your journey.
Limitations: Consider where the model does not work or questions that the model does not answer. Do not
shy away from these questions, defining where the model excels is more trusted if you can define where it
does not excel.
Conclusions (Why+Question+Answer): Revisit the “why”, research question and the answer you discovered
in a tight little package that is easy to remember and repeat for yourself and others.

You can discover more about using the results of a machine learning project in the post:

How to Use Machine Learning Results

Summary
In this post, you have learned my general template for processing a machine learning problem.

I use this process almost without fail and I use it across platforms, from Weka, R and scikit-learn and even new
platforms I have been playing around with like pylearn2.

What is your process, leave a comment and share?

Will you copy this process, and if so, what changes will you make to it?

Share Tweet Share

More On This Topic


Applied Machine Learning Hello World of Applied Machine A Gentle Introduction to XGBoost
Lessons from A Case Study… Learning for Applied Machine…

About Jason Brownlee


Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern
machine learning methods via hands-on tutorials.
View all posts by Jason Brownlee →

 Why you should be Spot-Checking Algorithms on your Machine Learning Problems How to Run Your First Classifier in Weka 

115 Responses to Applied Machine Learning Process

REPLY 
sadegh May 7, 2015 at 4:31 pm #

thanx

REPLY 
Vipin July 29, 2020 at 8:56 pm #

Hi Jason! I graduated college almost 2 years ago and after getting throught eliminating lots of choices
about carrier path, I wanted to learn machine learning and AI in long term to work on my some dream projects.

So I just have a doubt. May be I miss this somewhere in you articles I don’t know. May be it sound silly, but if you
can answer this, it will be helpful.

I understand your 5 – Step systematic process, But Can you give me a little hint , After spot check all the
compatible algorithm and it complementary process in Weka tool , How are you going to to extract your model
from tools like Weka and how are you going to implement them. For example take most popular iris dataset.
REPLY 
Jason Brownlee July 30, 2020 at 6:20 am #

Yes, here is an example:


https://machinelearningmastery.com/save-machine-learning-model-make-predictions-weka/

REPLY 
Amaal September 30, 2020 at 3:42 pm #

Thanks alot ????????

REPLY 
json July 15, 2015 at 5:10 am #

very helpful post

REPLY 
Paul June 30, 2019 at 11:03 am #

Thank you very much for a great article, has made me question some of the processes I have
implemented around this. Excellent level and language.

REPLY 
Jason Brownlee July 1, 2019 at 6:29 am #

Thanks Paul.

REPLY 
Raihan Masud September 22, 2015 at 5:52 am #

Well thought process. How about adding visualization after/during data section/pre-processing to view
distribution of the data? You might be doing it implicitly as part of the Data Prep. step. Thanks for sharing your
process. Very useful.

REPLY 
Robert Chumley April 2, 2016 at 12:45 am #

I like to take an Agile approach to Machine Learning where we apply and look for the highest priority
outcomes first. The data can have a ton of value and looking at it from the stakeholders perspective and what they
want to accomplish from the data is important. Then, based on the list of outcomes the stakeholders are looking for,
work backward to find the individual value based on the results. I also add an additional step of formalization where
the results are put into code and placed into reusable modules. This way, we can always reuse the result for later
applications.

REPLY 
Utku December 21, 2018 at 9:40 am #
This article does not consider a full business-oriented view. It intends to give an idea how to break down and
analyze a problem for studies of algorithms like machine learning.

From the industry point of view, I fully agree with Jason. In business language, this flow is called “waterfall”
(process-based). The current trend in the industry is agile approoach.

REPLY 
Danilo Burbano January 24, 2019 at 9:51 pm #

Interesting, agile approach can also be used to ML projects, can you share some references please

REPLY 
ali July 1, 2016 at 9:37 pm #

hi
which NN has a better outcome for spam detection ?
can i try taht on Weka Or not?
i am looking for dataset with content and not content attributes
i mean that has content features and none content features somethings this features length of email time and date of
send email IP and some like that.

REPLY 
Jason Brownlee July 2, 2016 at 6:19 am #

My advice would be to try a suite of different algorithms to see what works best on your problem.

REPLY 
Ali March 2, 2019 at 1:54 am #

Bonjour,
je suis débutant dans le Deep learning et j’ai une image IRM en niveaux de gris de taille maximal de
256*256.
Quelle est le langage le plus simple a utiliser pour la segmentation par Deep leaning ?
Comment je peux utiliser le deep learning pour la segmentation des images IRM, ( quelle architecture la plus
correspond ?
combien de neurones pour chaque couche ?
quelle type de fonction d’activation pour chaque couche ? )

REPLY 
Jason Brownlee March 2, 2019 at 9:34 am #

Perhaps look into CNNs.

Specifically, look into methods like R-CNN and YOLO.

I hope to have tutorials on these topics soon.

REPLY 
Murali December 21, 2016 at 11:15 pm #
I am very much thankful to your guidance

REPLY 
Jason Brownlee December 22, 2016 at 6:35 am #

You’re welcome Murali.

REPLY 
Jay January 3, 2017 at 10:40 pm #

Hi Jason, we prepare configurations for applying on Switches are routers. So we know all the parameters
and we know the intended outcome. Can we look at automating the process of preparing the configuration using ML?
Can you provide any pointers here, please?

REPLY 
Jason Brownlee January 4, 2017 at 8:54 am #

I’m not sure Jay, it almost sounds like a constraint optimization problem rather than a predictive
modeling problem.

Try this process to define your problem and let me know how you go:
https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

REPLY 
Dante Perez January 27, 2017 at 6:59 am #

@Jay, i work on wireless packet core engineering but I do understand what you’re intending to do.
Automating scripts won’t be applicable to ML since parameters are all predefined values for switches/routers to
perform. But if you’re trying to predict why switch is overloading and all metrics are green then as what Jason
mentioned you can look at several attributes of the switch processors like CPU, port utilization, counters etc and
other variables that contributes to switch/router spiking up then yes you can use ML to create a predictive
modelling.

REPLY 
Preeti Agarwal January 10, 2017 at 5:34 pm #

Great I am going to try the above steps

REPLY 
Jason Brownlee January 11, 2017 at 9:26 am #

Let me know how you go!

REPLY 
Ted January 12, 2017 at 9:47 am #

Awesome!!
REPLY 
Jason Brownlee January 13, 2017 at 9:02 am #

I’m glad you found the process useful Ted!

REPLY 
Mixymol March 13, 2017 at 3:50 pm #

Sir, Thank you. I started.

REPLY 
Jason Brownlee March 14, 2017 at 8:13 am #

I’m glad to hear it!

REPLY 
Giri March 16, 2017 at 6:04 pm #

Jason,
What would be your recommendations (blog posts, books, courses etc) for learning/mastering feature engineering?
Seems to me that it is as important as selecting proper algorithm.

REPLY 
Jason Brownlee March 17, 2017 at 8:25 am #

Start here:
https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-
good-at-it/

REPLY 
Giri March 29, 2017 at 3:06 am #

Thanks

REPLY 
Karunakaran April 6, 2017 at 6:39 pm #

Perfect outline

REPLY 
Jason Brownlee April 9, 2017 at 2:37 pm #

Thanks!

REPLY 
prasanna May 19, 2017 at 7:22 pm #
very helpful.

REPLY 
Jason Brownlee May 20, 2017 at 5:36 am #

Thanks, I’m really glad to hear that.

REPLY 
Winayak Wagle May 31, 2017 at 8:22 pm #

Very useful in getting an overview and proceeding to next steps.

REPLY 
Jason Brownlee June 2, 2017 at 12:45 pm #

I’m glad to hear that.

REPLY 
harouna June 10, 2017 at 12:13 am #

Thanks you a lot of ! very helpful !

i find your blogs very interesting

REPLY 
Jason Brownlee June 10, 2017 at 8:25 am #

Thanks, I’m glad to hear that.

REPLY 
Lautaro June 23, 2017 at 12:15 pm #

I love this types of post, it makes my head Clear about this topic.

Thanks!

REPLY 
Jason Brownlee June 24, 2017 at 7:55 am #

I’m glad it helped!

REPLY 
ARUNESH GUPTA June 24, 2017 at 1:12 am #

can u tell me how to start with machine learning from scratch. like which course should i take or tell me
some sources
REPLY 
Jason Brownlee June 24, 2017 at 8:03 am #

Start here:
https://machinelearningmastery.com/start-here/

REPLY 
Raj Kumar Thapa August 29, 2017 at 10:53 am #

first thanks lot,i am doing master in cs so after more than 14 years back to academic. i was working in
database in SQL .i want to use machine learning to predict some thing in feature . what should i do my (task model )
which could guide me to get result effectively and quickly

REPLY 
Jason Brownlee August 29, 2017 at 5:12 pm #

The process in the above post is my best advice, more here:


https://machinelearningmastery.com/start-here/#process

Perhaps a good tool for you to use with this process would be Weka as no coding is required:
https://machinelearningmastery.com/start-here/#weka

REPLY 
Bill Ern September 7, 2017 at 2:14 pm #

Jason,

Great post on giving a newbie to machine learning a place to start and work through a problem. I will use your outline
and make modifications as needed. Hopefully after working some problems I can post any modifications made.

REPLY 
Jason Brownlee September 9, 2017 at 11:37 am #

Thanks.

REPLY 
Anu September 9, 2017 at 6:53 pm #

I went through multiple sites and courses, the way you explain is incomparable. I was above to give-up, now
i will never give-up

REPLY 
Jason Brownlee September 11, 2017 at 12:00 pm #

Thanks Anu, hang in there!

REPLY 
Shiloh November 9, 2017 at 10:24 am #
Great post. Very well thought out and informative. It’s almost like you might be a person who thinks logically 🙂

REPLY 
Jason Brownlee November 10, 2017 at 10:29 am #

Ha ha, thanks Shiloh.

REPLY 
Connie November 27, 2017 at 4:11 am #

Thank you very much for sharing & guidance Dr. Brownlee.

This is like fitting the last piece of puzzle to the complex puzzle for me.

REPLY 
Jason Brownlee November 27, 2017 at 5:52 am #

Thanks, I’m glad to hear that.

REPLY 
Omotayo Oshiga December 21, 2017 at 4:24 pm #

Great post Dr. Brownlee,

Please, does your Machine Learning Mastery With Python book follow this process and explain them in details?

REPLY 
Jason Brownlee December 22, 2017 at 5:30 am #

Yes.

REPLY 
SarahM January 6, 2018 at 4:13 am #

Thank you for this guiding process. Can you explain more the 3rd step in defining the problem? Because as
I know solving it depends on the exploration of the data, so how could I know how to solve the problem before and
answer the question: Step 3: How would I solve the problem? Describe how the problem would be solved manually
to flush domain knowledge

REPLY 
Jason Brownlee January 6, 2018 at 5:56 am #

It is a question to help developers think about the problem and how they might code a non-ML solution
to it.

Does that help?

REPLY 
Mohammad Ehtasham Billah January 31, 2018 at 9:13 pm #
Hi,
can you explain attribute decomposition and attribute aggregation?

REPLY 
Mohammad Ehtasham Billah January 31, 2018 at 9:14 pm #

Oh…I have found the required information…thanks

REPLY 
Jason Brownlee February 1, 2018 at 7:20 am #

See this post:


https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-
good-at-it/

REPLY 
Mohammad Ehtasham Billah January 31, 2018 at 10:13 pm #

When representing result, do we need to go through the all 6 points that you mentioned for all of the
algorithms we used for solving the problem?Or just the final algorithm that performed best with that specific problem?

REPLY 
Jason Brownlee February 1, 2018 at 7:22 am #

You can choose how to work through the process for your project.

REPLY 
Jesús Martínez February 28, 2018 at 12:59 am #

Really nice process. Thank you very much.

I’d like to know how much time do you devote to each phase? And how many times you go through all the five
phases, on average?

Thanks a lot for your time and attention.

REPLY 
Jason Brownlee February 28, 2018 at 6:05 am #

As much as I have.

Some projects are fast, just a few hours, some are days or weeks.

REPLY 
Sachidanand Tripathi April 2, 2018 at 4:02 pm #

Thanks Jason, it does give me some insight as to where to start plus the cross validation technique
illustrated is awesome, I am putting it at work.
REPLY 
Jason Brownlee April 3, 2018 at 6:27 am #

I’m glad to hear that.

REPLY 
phil April 24, 2018 at 3:32 am #

how do u build an attrition model for credit cards . any idea or reading material.what are the variables of
interest?

REPLY 
Jason Brownlee April 24, 2018 at 6:36 am #

Like a churn model?

Start by searching here:


http://scholar.google.com/

REPLY 
phil April 24, 2018 at 3:33 am #

any special algorithms to use?

REPLY 
Jason Brownlee April 24, 2018 at 6:37 am #

Yes, random forest and stochastic gradient boosting seem to do very well on lots of problems.

REPLY 
Emmanuel June 13, 2018 at 11:12 am #

Great work, and awesome information. Thanks Jason. Your page has been a guide a for me.

REPLY 
Jason Brownlee June 13, 2018 at 3:05 pm #

I’m glad it helped.

REPLY 
Rick July 7, 2018 at 5:42 pm #


Great Article I think the five steps should be a circle when you find the best result

REPLY 
Jason Brownlee July 8, 2018 at 6:18 am #
As in, repeat the process?

REPLY 
ragav July 10, 2018 at 8:17 pm #

Being a researcher too and understand from a data practitioner perspective , providing solutions and
methods is really awasome. I am also getting confidence by looking your blogs Jassson

REPLY 
Jason Brownlee July 11, 2018 at 5:55 am #

Thanks, I’m glad to hear that.

REPLY 
Marco September 17, 2018 at 10:03 pm #

Did you describe or applied your 5-step process for ML in one of your books in detail?

REPLY 
Jason Brownlee September 18, 2018 at 6:14 am #

I show how to use the process in each of my “Machine Learning Mastery with …” books, e.g. in R,
Python and Weka.

REPLY 
Bisoi November 21, 2018 at 9:16 pm #

my process is lockbox back-end operation service. can i get advise.

REPLY 
Jason Brownlee November 22, 2018 at 6:24 am #

I don’t follow, can you elaborate?

REPLY 
kooshi November 22, 2018 at 5:10 am #

Dear jason

i want to do a machine learning project but i have big problem that is every subject that i choose , someone worked
on it for example diabetes predict and heart attack

and main question is that how can i understand that data set need for predict?(the predicted attribute)

for example: https://archive.ics.uci.edu/ml/datasets/Iris orhttps://archive.ics.uci.edu/ml/datasets/Primary+Tumor

what they want from us?


REPLY 
Jason Brownlee November 22, 2018 at 6:26 am #

Perhaps this framework will help:


https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

REPLY 
Rabiu December 12, 2018 at 7:46 pm #

Hi do you have any matlab code for prediction using deep learning.
Thanks for the help.

REPLY 
Jason Brownlee December 13, 2018 at 7:51 am #

This is a common question that I answer here:


https://machinelearningmastery.com/faq/single-faq/do-you-have-tutorials-in-octave-or-matlab

REPLY 
madusha January 8, 2019 at 7:27 pm #

thank you sir!

REPLY 
Jason Brownlee January 9, 2019 at 8:42 am #

I’m happy that it helped.

REPLY 
Ahmed January 18, 2019 at 3:25 am #

Thanks a lot, your articles are really helpful, I always go around learning more and more, and again come
here accidentally, and stay for days just reading your articles and learning from your long experiences .

REPLY 
Jason Brownlee January 18, 2019 at 5:47 am #

Thanks, I’m glad they help!

REPLY 
Nicko February 22, 2019 at 7:40 pm #

Thank you for a comprehensive post and related ones! I am quite new in ML and this framework is exactly
what I need now. I will share some ideas when I come up with them.

REPLY 
Jason Brownlee February 23, 2019 at 6:30 am #
Thanks.

REPLY 
Mohamed March 21, 2019 at 8:33 pm #

Thanks alot for this useful information, I just have small question, What are the basics or theory part like
when you say classifier or modelling or supervied vs non supervised, it seems there is a steps i am missing or a flow i
am not aware of . from where can i have these info ?

REPLY 
Jason Brownlee March 22, 2019 at 8:26 am #

Good question, this may help:


https://machinelearningmastery.com/faq/single-faq/what-are-the-machine-learning-basics-i-need-to-know

REPLY 
Ngel Rojas April 3, 2019 at 1:05 pm #

great job.!

REPLY 
Jason Brownlee April 3, 2019 at 4:13 pm #

Thanks.

REPLY 
sandipan sarkar June 30, 2019 at 5:30 am #

I think jason as a beginner if I follow the website whole heartedly I will be the master in machine learning

REPLY 
Jason Brownlee June 30, 2019 at 9:44 am #

Thanks.

REPLY 
Ethan Day August 22, 2019 at 10:52 am #

What software do you use to code on? I apologize if it is an obvious answer, but I am a young student,
aspiring to become an engineer. I love solving problems and I wanted an early start on Machine Learning.

REPLY 
Jason Brownlee August 22, 2019 at 1:58 pm #

Good question, see this post:


https://machinelearningmastery.com/machine-learning-development-environment/
REPLY 
franklin September 18, 2019 at 4:33 am #

thanks.
in a situation like electricity consumption data. how is the data captured? is it through sensors or other means?
thank you

REPLY 
Jason Brownlee September 18, 2019 at 6:21 am #

I believe a smart meter.

REPLY 
Yoni Krichevsky December 12, 2019 at 9:19 pm #

Hi Jason,

Awesome website and resources! Have been on the developer to machine learning journey for the last half a year,
too bad didn’t find the website earlier.

Question about the Spot Check Algorithms step and the connection to Feature Engineering.

In preparing the data, and feature engineering step, we might have some doubts about features: whether a feature is
helpful, whether to leave the feature continuous or to bin it, whether to have 2 highly correlated features, or just one
of them, and which one etc. etc.

We might at this stage select an initial way forward (like keeping all features and pruning them later), with the goal of
checking these assumptions later. Then according to the process you described, we would do the Spot Checking.
This would leave us with just a few of the best algorithms.

However, it’s possible that when we later do some changes in the features (drop / add / change format etc.), an
algorithm that we dropped at the Spot Checking stage could have performed better than the alternatives we are left
with.

One way to do it is not to do the Spot Checking, and work with all algorithms. However, then the process is
computationally and time expensive.

How do you handle this issue?

Thank you,
Yoni

REPLY 
Yoni Krichevsky December 13, 2019 at 1:32 am #

Also, how do you deal with the issue of different algorithms needing different features / transformation to
perform the best? I sometimes find 2 different feature sets for different models which increases complexity.

REPLY 
Jason Brownlee December 13, 2019 at 6:04 am #

A “model” is the data prep + algorithm + config.


REPLY 
Jason Brownlee December 13, 2019 at 6:00 am #

Thanks!

Yes, the process can be iterative as you prepare different views of your data.

It can be made simpler by focusing on a subset of views/data prep methods, and a subset of models in order to
see what works generally well, then use that as a starting point for a more detailed exploration.

Does that help?

REPLY 
Yoni Krichevsky December 13, 2019 at 6:34 pm #

Hi Jason,
Not sure I fully follow. I have done numerous models, am usually very methodological about it, and reach
great results.
However,
1. It takes quite some calendar time to reach good results (a week or a few weeks): find a very good feature
engineering view, algorithms, settings per algorithm and ensembling.
2. The model is sometimes too complex due to different algorithms requiring different views.
3. The model takes a somewhat long time to train due to having a few different algorithms and views etc.
4. I often times find myself taking the code that I have, and changing something on a very initial step, which
requires to redo many steps in the middle. I do it not on a whim, but as part of methodological process and
decision.

I was wondering how I can streamline the process, make it simpler and faster (both runtime and overall time
it takes until I’m happy with the results).

The process you described above seems like something that can help me.

However, although you write in other places that the process is usually iterative, and you many times start
from the beginning, specifically in this post it seems that the process is one step after another one which
would indeed speed it up.

However, because of a few questions I raised, I’m not sure how one can prune out completely models in
spot-cleaning step, unless they perform totally awfully. If they are close to other models, if we are not sure
that we won’t change features, how can one remove algorithms?

In general, do you have more tips and tricks how to make the process faster / not to have to start over too
many times / work with less algorithms?

Thank you!

REPLY 
Jason Brownlee December 14, 2019 at 6:12 am #

Hahah, yes I experience the same steps – this is not uncommon.

I have gone down the road of automating large parts of the process many times, and it is always a waste
of time given the specifics that change with each new dataset. Applied predictive modeling is a hard
problem and will remain so. Like software engineering. We keep trying to automate way the work.

Yes, this process is a one-shot to get you a “good enough” result quickly, which is what most people
need. Rarely do we need a really great result – unless it’s kaggle. You lose a lot with the one-step
approach, as you point out.
Re automation, I might have some code around that can help, e.g.:
https://machinelearningmastery.com/spot-check-machine-learning-algorithms-in-python/

I have a wonderful codebase I have developed that I might open source one day. CSV in, summary of
data prep + model + config that gives good/best results as output. I basically built a ML SaaS for myself
for regression, classification, time series, imbalanced classification, etc. 🙂
REPLY 
Bhaskaran February 19, 2020 at 4:47 pm #

Can you give me title of books that you have published and how to procure them (Amazon or any other way)
Thanks for all your knowledge sharing.

REPLY 
Jason Brownlee February 20, 2020 at 6:06 am #

You can see the whole catalog of books and bundles here:
https://machinelearningmastery.com/products/

I do not offer my books on Amazon, here’s why:


https://machinelearningmastery.com/faq/single-faq/why-arent-your-books-on-amazon

REPLY 
Lamone March 21, 2020 at 6:39 am #

Is there a section on your website where you write about combining applied machine learning to
applications? Like exactly how do you built a model, then implement it into software. I’m focused on making smarter
software for the consumer. Just need some general guidance.

Thank You

REPLY 
Jason Brownlee March 21, 2020 at 8:28 am #

Yes, there are hundreds of examples.

Perhaps start with this process:


https://machinelearningmastery.com/start-here/#process

REPLY 
George Mills July 13, 2020 at 10:47 pm #

You really seem to know your stuff inside and out. I am rather new to this whole field, although I would like to
think my general programming knowledge is rather deep, been doing it since ’94. I am trying to take, for starters, the
World Factbook as a data source and determine the optimum point on Maslow’s hierarchy of needs which the world
can hope to achieve. Any thoughts on the nature of the problem and methods and techniques I can use would be so
greatly appreciated.

Thank you,
George
REPLY 
Jason Brownlee July 14, 2020 at 6:24 am #

Not sure that is a machine learning problem, perhaps start with this framework for thinking about
supervised learning:
https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

REPLY 
Michael Mora-Poveda October 24, 2020 at 4:00 am #

Hello Jason,

Thanks for this post, in extremely useful. I saved your recommendations in my github for guides futures!!!

Cheers,

Michael

REPLY 
Jason Brownlee October 24, 2020 at 7:09 am #

Thanks!

REPLY 
JC Chouinard February 24, 2021 at 10:47 am #

Well, I don’t have a process. But making SEO experiments is similar in some ways. Repeating a test
multiple times, on different websites, or different parts of a website, discarding test that have no statistical
significance. Summarising multiple experiment with box-plot is something that I never thought of, but I will start
implementing. Thanks Jason.

REPLY 
hamidreza mazandarani April 29, 2021 at 11:17 pm #

hello and thanks


i install python from python.org
but i dont install component library numpy or comcv
>>pip install numpy
traceback (most recent call last):
file””,lin1,in
name error:name ‘pip’ is not defined

thank you

REPLY 
Jason Brownlee April 30, 2021 at 6:06 am #

The “pip” command is run from the command prompt, not the python interpreter.
Leave a Reply

Name (required)

Email (will not be published) (required)

SUBMIT COMMENT

Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more

Never miss a tutorial:

Picked for you:

What is the Difference Between Test and Validation Datasets?

How to Train a Final Machine Learning Model


What is the Difference Between a Parameter and a Hyperparameter?

So, You are Working on a Machine Learning Problem…

Classification Accuracy is Not Enough: More Performance Measures You Can Use

Loving the Tutorials?

The EBook Catalog is where


you'll find the Really Good stuff.

>> SEE WHAT'S INSIDE

© 2024 Guiding Tech Media. All Rights Reserved.


LinkedIn | Twitter | Facebook | Newsletter | RSS

Privacy | Disclaimer | Terms | Contact | Sitemap | Search

You might also like