Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
• Delivers a roadmap to follow while planning and taking out a data mining project
• Provides best practices for faster and better results of using data mining
• Nonproprietary
CRISP - DM
Business Understanding
• Understanding the business problem
• The data scientist must carefully evaluate the end goal of the data mining project
• What is the true goal of the project and the most important factors needed to know
about the business?
• Understand the project from business perspective and convert to data mining
subtasks where can be applied modeling technologies
Data Understanding
• Data we have VS Data we need
• Initial collection of data
• Describe the data
• Explore the data
• Exploratory Data Analysis (EDA)
• Evaluate the data
• Quality
• Availability
• Granularity
• Frequency
• Result evaluation
• How effective is this model to the business goals?
• Review Process
• Plan Monitoring
• How to evaluate how well the solution is responding to the “real world”?
• Will the algorithm be retrained?
• Final Report
• Describing full CRISP-DM process and decisions
• Review Project
Exploratory Data Analysis
• Set of procedures for creating explanatory and graphical summaries of the data
• Permits to analyze the data as they are without making any assumptions
• Types of Data
• Categorical Data
• Nominal
• Ordinal
• Numerical Data
• Discrete
• Continuous
Nominal
• Values characterize discrete units that have no inherent ordering
• Change the order of units does not alter their value
• Example:
• Color
• Blue, White, Green, Red, Yellow, …
• Language
• Portuguese, English, Italian, French, …
Ordinal
• Values characterize discrete and ordered units
• Change the order of units alter their value
• The distance between units is not the same
• Example:
• Level of Education
• Elementary, High School, Undergraduate, Graduate, …
• Level of Expertise
• Low, Medium, High, Expert, …
• Example:
• Number of Students
• No decimals are allowed
• Example:
• Height
• 1.70, 1.95, … (in meters)
• Weight
• 50, 70, 100, … (in kilograms)
Why is it important?
• Statistical methods are designed to work with certain types of data
• Many of methods to analyze continuous data are not the same to analyze categorical data
• Knowing a given dataset’s data types are very important for Data Understanding, Data
Preparation and, ultimately, Modeling
Descriptive Statistics
• A single variable, can assume multiple values
• We have a distribution of values
• If you roll two dices, which numbers are more likely to be rolled?
• The sum of the ones near the middle (6 and 8)
• One of the well-known strategies for “The Settlers of Catan” board game is to place
your first villages in those positions…
Descriptive Statistics
• Distributions Variability – the dispersion
or spread of values
• Range – the maximum value minus
the minimum value
• Standard deviation – dispersion
relative to the mean
• Interquartile Range – the difference
between the third quartile and the
first quartile
Descriptive Statistics
• We have various alternatives for describing central tendency and variability
• The answer depends on what we learn about our data while exploring it with graphs
• Outliers are unusually small or unusually large values
• Easy to spot with box plot
• They are values that extend beyond the fences
• Outliers defined as values outside one and a half times interquartile range
• Extreme values are defined as values more extreme than three interquartile range
Z-Score
• A good way of finding outliers (>3z rule of thumb)
• Assumes normal distribution but it can be used in others…
Robust Statistics
• If you encounter an extreme or outlier value, then
• Check if there is a data entry error
• Correct the error
• If value is a legitimate extreme or outlier value (and because outliers can influence the mean
instead the deviation), then
• Consider the use of robust statistics, such as
• Interquartile Range
• Median