0% found this document useful (0 votes)

12 views33 pages

Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views33 pages

Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Machine Learning (1)

INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)

N U N A L @ I S E P. I P P. P T
O M S @ I S E P. I P P. P T
CRISP - DM
• Cross Industry Standard Process for Data Mining

• Methodology or a process that:

• Helps provide a blueprint to conduct a data mining project
• Includes Machine Learning

• Delivers a roadmap to follow while planning and taking out a data mining project
• Provides best practices for faster and better results of using data mining

• Nonproprietary
CRISP - DM
Business Understanding
• Understanding the business problem
• The data scientist must carefully evaluate the end goal of the data mining project

• Characterize business objective

• Evaluate the problem
• Define data mining objectives
• Supply a project plan

• What is the true goal of the project and the most important factors needed to know
about the business?
• Understand the project from business perspective and convert to data mining
subtasks where can be applied modeling technologies
Data Understanding
• Data we have VS Data we need
• Initial collection of data
• Describe the data
• Explore the data
• Exploratory Data Analysis (EDA)
• Evaluate the data
• Quality
• Availability
• Granularity
• Frequency

• Create a hypothesis based on the data quality

Data Preparation
• Define the final dataset for the modeling phase
• Select the data
• Clean the data
• What are the missing attributes
• What records/values are invalid
• Construct the data
• Create new records
• Describe new attributes
• Data integration
• Combine records from multiple data sources
Modeling
• Generate Test Design
• Shapes how models are created and evaluated
• Data Preparation steps should be done according to test design
• If we select a test set to evaluate our model, no process that biases the modelling
phase can have explicit knowledge of such data
• Otherwise, any evaluation would be futile
• Even Data Understanding should be done carefully
• Knowing the probability distribution of the entire dataset can help me chose a
more appropriate algorithm…
• But in the real world? Do we have full information regarding the exact
probability distribution of our data?

• More on that later…

Modeling
• Select the Model
• Which models are more appropriate to fulfill the business goals and to handle the data
we have?
• Neural networks? Other?
• Fine Tune the model
• Determine the best set of hyperparameters of the chosen model(s)

• Create the Model

• As easy as calling “fit(X, y)” method on a sklearn’s Estimator

• Test the Model

• Generate a validation set and evaluate the quality of the model
• Different than Evaluation
Evaluation
• Choose appropriate evaluation metrics to serve the business goals
• Is accuracy enough?

• Result evaluation
• How effective is this model to the business goals?

• Review Process

• Decide/define next Steps

Deployment
• Deployment Plan
• How will the algorithm(s) be used?
• Incorporated in an existing software system? Create another one?

• Plan Monitoring
• How to evaluate how well the solution is responding to the “real world”?
• Will the algorithm be retrained?

• Final Report
• Describing full CRISP-DM process and decisions

• Review Project
Exploratory Data Analysis
• Set of procedures for creating explanatory and graphical summaries of the data

• Permits to analyze the data as they are without making any assumptions

• Useful form to understand relationships among variables

• Permits identify any problems such as data entry errors

Types of Data
• Determining the type of data contained in a variable is a crucial first step in data analysis
• Permits to identify appropriate statistical procedures

• When identifying the type of data for a variable

• Is important to first notice whether the data is categorical or numerical

• Types of Data
• Categorical Data
• Nominal
• Ordinal
• Numerical Data
• Discrete
• Continuous
Nominal
• Values characterize discrete units that have no inherent ordering
• Change the order of units does not alter their value

• Example:
• Color
• Blue, White, Green, Red, Yellow, …
• Language
• Portuguese, English, Italian, French, …
Ordinal
• Values characterize discrete and ordered units
• Change the order of units alter their value
• The distance between units is not the same

• Example:
• Level of Education
• Elementary, High School, Undergraduate, Graduate, …
• Level of Expertise
• Low, Medium, High, Expert, …

• Any other ordered category

Discrete
• We speak of discrete data if its values are distinct and separate
• Discrete data can’t be measured but it can be counted

• Example:
• Number of Students
• No decimals are allowed

• Virtually any count…

Continuous
• Continuous Data can’t be counted but can be measured

• Example:
• Height
• 1.70, 1.95, … (in meters)
• Weight
• 50, 70, 100, … (in kilograms)
Why is it important?
• Statistical methods are designed to work with certain types of data

• Many of methods to analyze continuous data are not the same to analyze categorical data

• Knowing a given dataset’s data types are very important for Data Understanding, Data
Preparation and, ultimately, Modeling
Descriptive Statistics
• A single variable, can assume multiple values
• We have a distribution of values

• Central tendency – the location of the distribution

• The center of distribution – does not accurately represent every value in the
distribution, represents the typical value
• Median – the value in the middle
• Mean – the arithmetic mean
• Mode – the most frequent value

• If you roll two dices, which numbers are more likely to be rolled?
• The sum of the ones near the middle (6 and 8)
• One of the well-known strategies for “The Settlers of Catan” board game is to place
your first villages in those positions…
Descriptive Statistics
• Distributions Variability – the dispersion
or spread of values
• Range – the maximum value minus
the minimum value
• Standard deviation – dispersion
relative to the mean
• Interquartile Range – the difference
between the third quartile and the
first quartile
Descriptive Statistics
• We have various alternatives for describing central tendency and variability

• Which one to use?

• Depends on the data…
• We can’t calculate mean of nominal data, but we can determine the mode
• If the set of values to be analyzed contains outliers, too big or too small values, then the
median is likely to be more appropriate than the mean
Descriptive Statistics for Nominal Data
• Summarize the information through:
• Frequencies
• Proportions (also known as
“relative frequency”)
• Percentages
• Display with the pie chart or bar
chart
Descriptive Statistics for Ordinal Data
• Summarize the information (variability) through
• Frequencies
• Proportions (also known as “relative frequency”)
• Percentages
• Mode
• Percentiles
• Interquartile range
• Display with the pie chart or bar chart to illustrate the results
Descriptive Statistics for Continuous Data
• Summarize the information through
• Percentages
• Central tendency
• Median
• Mode
• Mean
• Percentiles
• Interquartile Range
• Display with
• Boxplot
• Histogram
Relationship between two variables
• Describing the relationship between two variables is more complicated

• We must consider the type of data in both variables

• One might be continuous and the other categorical
• Both variables might be categorical
• Both variable might be continuous
• These type combinations lead to different statistical and graphical summaries

• How do values of one variable change as variables on another variable change?

• Do large values of one variable correspond to large values on another variable
• Do low values on one variable correspond to low values on another variable?
Relationship between two variables
• Example (two continuous variables – scatter plot)
Relationship between two variables
• Example (two categorical variables – side-by-side box plot)
Relationship between two variables
• Example (one continuous and one categorical – side-by-side box plot)
Robust Statistics
• When should I use the Interquartile Range instead the Standard Deviation?
• When should I use a boxplot instead of a bar chart?
• When should I use the mean instead of the median?

• The answer depends on what we learn about our data while exploring it with graphs
• Outliers are unusually small or unusually large values
• Easy to spot with box plot
• They are values that extend beyond the fences
• Outliers defined as values outside one and a half times interquartile range
• Extreme values are defined as values more extreme than three interquartile range
Z-Score
• A good way of finding outliers (>3z rule of thumb)
• Assumes normal distribution but it can be used in others…
Robust Statistics
• If you encounter an extreme or outlier value, then
• Check if there is a data entry error
• Correct the error

• If value is a legitimate extreme or outlier value (and because outliers can influence the mean
instead the deviation), then
• Consider the use of robust statistics, such as
• Interquartile Range
• Median

• Mean and Median are similar when there are no outliers

• Interquartile Range and deviation are similar when there are no outliers
Robust Statistics
• Median and Interquartile Range are proper ways to describe central tendency and variability in
the presence of outliers and extreme values

• When analyzing two continuous variables

• Is easy to identify in a scatter plot
• Bivariate outlier – unusually large value or small on both variables
• Can have an adverse impact on the Pearson correlation coefficient
• When building a Correlation Matrix:
• Consider using Spearman Rank Order Correlation instead Pearson Correlation
Correlation
Matrix
In Practice
• https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
• Load data with pandas
• https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
• Data manipulation with pandas cheat sheet
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html
• Useful dataset information – data types, column count, number of nulls, memory usage
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
• Summarize the central tendency, dispersion and shape of a dataset’s distribution
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
• Plotting with pandas
• https://matplotlib.org/stable/gallery/
• Plotting with matplotlib (more variety)
• https://seaborn.pydata.org/examples/many_pairwise_correlations.html
• Correlation Matrix with seaborn

Understanding The Information System Department
0% (4)
Understanding The Information System Department
9 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
E-Note_33325_Content_Document_20250319114322AM
No ratings yet
E-Note_33325_Content_Document_20250319114322AM
69 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Preparation PDF
No ratings yet
Data Preparation PDF
71 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
01 Data
No ratings yet
01 Data
100 pages
ML U2
No ratings yet
ML U2
62 pages
Data ch2
No ratings yet
Data ch2
16 pages
chapter2-statistical analysis
No ratings yet
chapter2-statistical analysis
86 pages
2-1-Data
No ratings yet
2-1-Data
22 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
machine learning unit 2
No ratings yet
machine learning unit 2
9 pages
Lect 3
No ratings yet
Lect 3
51 pages
02 Data
No ratings yet
02 Data
62 pages
DSOST2
No ratings yet
DSOST2
44 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
CS822-DataMining-Week2 (2)
No ratings yet
CS822-DataMining-Week2 (2)
28 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Module 1
No ratings yet
Module 1
64 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Week2_UnderstandingData
No ratings yet
Week2_UnderstandingData
27 pages
APznzaZmf FjNZzQU2KZGNWcTIMyEPNieeXpEIC4txhLpx IW9aIcijwEdcvmrObIy4gDpcU78AYLsB6msaeqj47x3Fc6z9vdKhe5EnyMTtReSpFg 23R3DG W66DWWysqOW PfB BJrKuEN CsrKXdSrdM OKOdbGKa2ND0ltkJXrievcwimUpSlHEYiQCPleUm8zmyjmaz7 PPZRnRfUuizv
No ratings yet
APznzaZmf FjNZzQU2KZGNWcTIMyEPNieeXpEIC4txhLpx IW9aIcijwEdcvmrObIy4gDpcU78AYLsB6msaeqj47x3Fc6z9vdKhe5EnyMTtReSpFg 23R3DG W66DWWysqOW PfB BJrKuEN CsrKXdSrdM OKOdbGKa2ND0ltkJXrievcwimUpSlHEYiQCPleUm8zmyjmaz7 PPZRnRfUuizv
24 pages
Data Management
No ratings yet
Data Management
36 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
02-KnowYourData
No ratings yet
02-KnowYourData
44 pages
02Data
No ratings yet
02Data
24 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Cheatsheet FDA a4 Full
No ratings yet
Cheatsheet FDA a4 Full
2 pages
02 Data
No ratings yet
02 Data
65 pages
Qunt Data Coding & Analysis
No ratings yet
Qunt Data Coding & Analysis
104 pages
02 Data
No ratings yet
02 Data
64 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Variable: An Item of Data Examples
No ratings yet
Variable: An Item of Data Examples
60 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
54 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
02Data
No ratings yet
02Data
65 pages
02 Data
No ratings yet
02 Data
35 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Analysis for Engineers and Statisticians: A Modern Guide to Statistical Methods and Techniques
From Everand
Data Analysis for Engineers and Statisticians: A Modern Guide to Statistical Methods and Techniques
Pasquale De Marco
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Time-Table (OFFLINE) 6th To 10th From 20 Sep.-Converted-1
No ratings yet
Time-Table (OFFLINE) 6th To 10th From 20 Sep.-Converted-1
1 page
Schools of Hindu Law
100% (1)
Schools of Hindu Law
3 pages
Chennam Koteswara Rao TR
No ratings yet
Chennam Koteswara Rao TR
1 page
Rice 2017
No ratings yet
Rice 2017
6 pages
Python Basics Code Full Notes
No ratings yet
Python Basics Code Full Notes
53 pages
A Practical Guide For The Optimal Use of Your City Pass: Thank You For Purchasing On Veneziaunica - It
No ratings yet
A Practical Guide For The Optimal Use of Your City Pass: Thank You For Purchasing On Veneziaunica - It
99 pages
1905 Psychopathic Characters on the Stage
No ratings yet
1905 Psychopathic Characters on the Stage
4 pages
a61d973d-fa88-4f37-a7f6-abd83bc30811
No ratings yet
a61d973d-fa88-4f37-a7f6-abd83bc30811
3 pages
Misko CUBRINOVSKI 2011
No ratings yet
Misko CUBRINOVSKI 2011
24 pages
Pasta Recipe
No ratings yet
Pasta Recipe
17 pages
Lip Posture and Its Signi Ficance Treatment Plannin G: Indiamapoli., Ind
No ratings yet
Lip Posture and Its Signi Ficance Treatment Plannin G: Indiamapoli., Ind
20 pages
Organizational Safety
No ratings yet
Organizational Safety
6 pages
Why Choose Licton?: More Than 40 Years of Quality Service!
No ratings yet
Why Choose Licton?: More Than 40 Years of Quality Service!
3 pages
Long termbehaviorofFRP
No ratings yet
Long termbehaviorofFRP
9 pages
BAJA RULES 2023 Rev A 2022-08-29
No ratings yet
BAJA RULES 2023 Rev A 2022-08-29
134 pages
Saldariega v. Hon. Panganiban
No ratings yet
Saldariega v. Hon. Panganiban
3 pages
EN Tennant 800 Cat 3.4
No ratings yet
EN Tennant 800 Cat 3.4
94 pages
Multicriteria Analysis in Agriculture 2018 PDF
No ratings yet
Multicriteria Analysis in Agriculture 2018 PDF
328 pages
Bank Credits
100% (1)
Bank Credits
11 pages
Philippine Political Culture and Governance PDF
No ratings yet
Philippine Political Culture and Governance PDF
38 pages
Electrical OM Getting Started
No ratings yet
Electrical OM Getting Started
115 pages
(Ebook) 25 Essential Skills for the Successful Behavior Analyst by Jon S. Bailey, Maryt R. Burch ISBN 9781032208565, 9781032192079, 1032208562, 1032192070 download
100% (1)
(Ebook) 25 Essential Skills for the Successful Behavior Analyst by Jon S. Bailey, Maryt R. Burch ISBN 9781032208565, 9781032192079, 1032208562, 1032192070 download
71 pages
Module in Mathematics 6: (WEEK 8)
No ratings yet
Module in Mathematics 6: (WEEK 8)
5 pages
S.NO. Name of University Address Contact Details: National Institute of Technology, Raipur
No ratings yet
S.NO. Name of University Address Contact Details: National Institute of Technology, Raipur
7 pages
Top 10 Chemical Engineer Interview Questions and Answers
No ratings yet
Top 10 Chemical Engineer Interview Questions and Answers
17 pages
Ss Script
No ratings yet
Ss Script
2 pages
Sistem Respirasi
No ratings yet
Sistem Respirasi
62 pages
11th 2nd SEM Economics Market
No ratings yet
11th 2nd SEM Economics Market
8 pages
Katalog 2019 - EN - v1 - Web PDF
No ratings yet
Katalog 2019 - EN - v1 - Web PDF
120 pages

Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Machine Learning (1)

INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)

• Methodology or a process that:

• Characterize business objective

• Create a hypothesis based on the data quality

• More on that later…

• Create the Model

• Test the Model

• Decide/define next Steps

• Useful form to understand relationships among variables

• Permits identify any problems such as data entry errors

• When identifying the type of data for a variable

• Any other ordered category

• Virtually any count…

• Central tendency – the location of the distribution

• Which one to use?

• We must consider the type of data in both variables

• How do values of one variable change as variables on another variable change?

• Mean and Median are similar when there are no outliers

• When analyzing two continuous variables

You might also like