0% found this document useful (0 votes)
12 views33 pages

Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views33 pages

Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine Learning (1)

INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)


N U N A L @ I S E P. I P P. P T
O M S @ I S E P. I P P. P T
CRISP - DM
• Cross Industry Standard Process for Data Mining

• Methodology or a process that:


• Helps provide a blueprint to conduct a data mining project
• Includes Machine Learning

• Delivers a roadmap to follow while planning and taking out a data mining project
• Provides best practices for faster and better results of using data mining

• Nonproprietary
CRISP - DM
Business Understanding
• Understanding the business problem
• The data scientist must carefully evaluate the end goal of the data mining project

• Characterize business objective


• Evaluate the problem
• Define data mining objectives
• Supply a project plan

• What is the true goal of the project and the most important factors needed to know
about the business?
• Understand the project from business perspective and convert to data mining
subtasks where can be applied modeling technologies
Data Understanding
• Data we have VS Data we need
• Initial collection of data
• Describe the data
• Explore the data
• Exploratory Data Analysis (EDA)
• Evaluate the data
• Quality
• Availability
• Granularity
• Frequency

• Create a hypothesis based on the data quality


Data Preparation
• Define the final dataset for the modeling phase
• Select the data
• Clean the data
• What are the missing attributes
• What records/values are invalid
• Construct the data
• Create new records
• Describe new attributes
• Data integration
• Combine records from multiple data sources
Modeling
• Generate Test Design
• Shapes how models are created and evaluated
• Data Preparation steps should be done according to test design
• If we select a test set to evaluate our model, no process that biases the modelling
phase can have explicit knowledge of such data
• Otherwise, any evaluation would be futile
• Even Data Understanding should be done carefully
• Knowing the probability distribution of the entire dataset can help me chose a
more appropriate algorithm…
• But in the real world? Do we have full information regarding the exact
probability distribution of our data?

• More on that later…


Modeling
• Select the Model
• Which models are more appropriate to fulfill the business goals and to handle the data
we have?
• Neural networks? Other?
• Fine Tune the model
• Determine the best set of hyperparameters of the chosen model(s)

• Create the Model


• As easy as calling “fit(X, y)” method on a sklearn’s Estimator

• Test the Model


• Generate a validation set and evaluate the quality of the model
• Different than Evaluation
Evaluation
• Choose appropriate evaluation metrics to serve the business goals
• Is accuracy enough?

• Result evaluation
• How effective is this model to the business goals?

• Review Process

• Decide/define next Steps


Deployment
• Deployment Plan
• How will the algorithm(s) be used?
• Incorporated in an existing software system? Create another one?

• Plan Monitoring
• How to evaluate how well the solution is responding to the “real world”?
• Will the algorithm be retrained?

• Final Report
• Describing full CRISP-DM process and decisions

• Review Project
Exploratory Data Analysis
• Set of procedures for creating explanatory and graphical summaries of the data

• Permits to analyze the data as they are without making any assumptions

• Useful form to understand relationships among variables

• Permits identify any problems such as data entry errors


Types of Data
• Determining the type of data contained in a variable is a crucial first step in data analysis
• Permits to identify appropriate statistical procedures

• When identifying the type of data for a variable


• Is important to first notice whether the data is categorical or numerical

• Types of Data
• Categorical Data
• Nominal
• Ordinal
• Numerical Data
• Discrete
• Continuous
Nominal
• Values characterize discrete units that have no inherent ordering
• Change the order of units does not alter their value

• Example:
• Color
• Blue, White, Green, Red, Yellow, …
• Language
• Portuguese, English, Italian, French, …
Ordinal
• Values characterize discrete and ordered units
• Change the order of units alter their value
• The distance between units is not the same

• Example:
• Level of Education
• Elementary, High School, Undergraduate, Graduate, …
• Level of Expertise
• Low, Medium, High, Expert, …

• Any other ordered category


Discrete
• We speak of discrete data if its values are distinct and separate
• Discrete data can’t be measured but it can be counted

• Example:
• Number of Students
• No decimals are allowed

• Virtually any count…


Continuous
• Continuous Data can’t be counted but can be measured

• Example:
• Height
• 1.70, 1.95, … (in meters)
• Weight
• 50, 70, 100, … (in kilograms)
Why is it important?
• Statistical methods are designed to work with certain types of data

• Many of methods to analyze continuous data are not the same to analyze categorical data

• Knowing a given dataset’s data types are very important for Data Understanding, Data
Preparation and, ultimately, Modeling
Descriptive Statistics
• A single variable, can assume multiple values
• We have a distribution of values

• Central tendency – the location of the distribution


• The center of distribution – does not accurately represent every value in the
distribution, represents the typical value
• Median – the value in the middle
• Mean – the arithmetic mean
• Mode – the most frequent value

• If you roll two dices, which numbers are more likely to be rolled?
• The sum of the ones near the middle (6 and 8)
• One of the well-known strategies for “The Settlers of Catan” board game is to place
your first villages in those positions…
Descriptive Statistics
• Distributions Variability – the dispersion
or spread of values
• Range – the maximum value minus
the minimum value
• Standard deviation – dispersion
relative to the mean
• Interquartile Range – the difference
between the third quartile and the
first quartile
Descriptive Statistics
• We have various alternatives for describing central tendency and variability

• Which one to use?


• Depends on the data…
• We can’t calculate mean of nominal data, but we can determine the mode
• If the set of values to be analyzed contains outliers, too big or too small values, then the
median is likely to be more appropriate than the mean
Descriptive Statistics for Nominal Data
• Summarize the information through:
• Frequencies
• Proportions (also known as
“relative frequency”)
• Percentages
• Display with the pie chart or bar
chart
Descriptive Statistics for Ordinal Data
• Summarize the information (variability) through
• Frequencies
• Proportions (also known as “relative frequency”)
• Percentages
• Mode
• Percentiles
• Interquartile range
• Display with the pie chart or bar chart to illustrate the results
Descriptive Statistics for Continuous Data
• Summarize the information through
• Percentages
• Central tendency
• Median
• Mode
• Mean
• Percentiles
• Interquartile Range
• Display with
• Boxplot
• Histogram
Relationship between two variables
• Describing the relationship between two variables is more complicated

• We must consider the type of data in both variables


• One might be continuous and the other categorical
• Both variables might be categorical
• Both variable might be continuous
• These type combinations lead to different statistical and graphical summaries

• How do values of one variable change as variables on another variable change?


• Do large values of one variable correspond to large values on another variable
• Do low values on one variable correspond to low values on another variable?
Relationship between two variables
• Example (two continuous variables – scatter plot)
Relationship between two variables
• Example (two categorical variables – side-by-side box plot)
Relationship between two variables
• Example (one continuous and one categorical – side-by-side box plot)
Robust Statistics
• When should I use the Interquartile Range instead the Standard Deviation?
• When should I use a boxplot instead of a bar chart?
• When should I use the mean instead of the median?

• The answer depends on what we learn about our data while exploring it with graphs
• Outliers are unusually small or unusually large values
• Easy to spot with box plot
• They are values that extend beyond the fences
• Outliers defined as values outside one and a half times interquartile range
• Extreme values are defined as values more extreme than three interquartile range
Z-Score
• A good way of finding outliers (>3z rule of thumb)
• Assumes normal distribution but it can be used in others…
Robust Statistics
• If you encounter an extreme or outlier value, then
• Check if there is a data entry error
• Correct the error

• If value is a legitimate extreme or outlier value (and because outliers can influence the mean
instead the deviation), then
• Consider the use of robust statistics, such as
• Interquartile Range
• Median

• Mean and Median are similar when there are no outliers


• Interquartile Range and deviation are similar when there are no outliers
Robust Statistics
• Median and Interquartile Range are proper ways to describe central tendency and variability in
the presence of outliers and extreme values

• When analyzing two continuous variables


• Is easy to identify in a scatter plot
• Bivariate outlier – unusually large value or small on both variables
• Can have an adverse impact on the Pearson correlation coefficient
• When building a Correlation Matrix:
• Consider using Spearman Rank Order Correlation instead Pearson Correlation
Correlation
Matrix
In Practice
• https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
• Load data with pandas
• https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
• Data manipulation with pandas cheat sheet
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html
• Useful dataset information – data types, column count, number of nulls, memory usage
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
• Summarize the central tendency, dispersion and shape of a dataset’s distribution
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
• Plotting with pandas
• https://matplotlib.org/stable/gallery/
• Plotting with matplotlib (more variety)
• https://seaborn.pydata.org/examples/many_pairwise_correlations.html
• Correlation Matrix with seaborn

You might also like