Lesson 5
Data Collection and Enhancement
1
2
Legal Disclaimers
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by
this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of
merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising
from course of performance, course of dealing, or usage in trade.
Copies of documents which have an order number and are referenced in this document may be obtained
by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
Copyright © 2018 Intel Corporation. All rights reserved.
Learning Objectives
3
▪ Describe data sources and types.
▪ Recognize situations where more data samples or features are needed.
▪ Explain data wrangling, augmentation, and feature engineering.
▪ Describe different data preprocessing methods.
▪ Explain ways to label data.
▪ Identify challenges when working with data.
You will be able to:
4
Data Collection and Preprocessing
Problem Statement
Data Collection
Data Exploration
& Preprocessing
Modeling
Validation
Decision Making
& Deployment
What problem are you trying to solve?
What data do you need to solve it?
How should you clean your data so your model can use it?
Build a model to solve your problem?
Did I solve the problem?
Communicate to stakeholders or put into production?
Data Collection
5
There are several things to consider when collecting data.
• Where does the data come from?
• What type of data is there?
• How much data and what attributes do I need?
6
Data Sources
7
Data is sourced from many different places, for example:
• Human generated
• Internet of Things (IoT) and machine generated
• Public website
• Legacy documents
• Multimedia
Human Generated Data: Social Media
Within the last two years 90% of the world’s data was created.
• Experts believe that about 70% of this data is coming from social media.
• Our current amount of data output is 2.5 quintillion bytes per day
(2.5 exabytes per day).
8
2.5 Exabytes Per Day
9
Human Generated Data: Social Media
There are various APIs to access this data.
• Facebook*: Graph API
• Twitter*: REST API or Streaming API
• Instagram*: Graph API or Platform API
• LinkedIn*: REST API
• Pinterest*: REST API
10*Other names and brands may be claimed as the property of others.
Human Generated Data: Media & Publications
Social media is not the only source of human generated data; every year
humans generate a large amount of media and publications.
• 2.2 million books published every year
• 2 million blog posts published every day
• 269 billion emails published every year
• Though not publicly available, this data contains valuable information for
companies—e.g. powering services like Google Smart Reply*.
11*Other names and brands may be claimed as the property of others.
Internet of Things (IoT) Data
IoT data consists of all web-enabled devices that collect, send and
respond to data they required from their respective environments.
• By 2020 IoT will include a projected 200 billion smart-and-connected
devices.
• The data produced is expected to double every two
years to total 40 zettabytes (40 trillion gigabytes).
• Collected via sensors, cameras, and processors.
12
Internet of Things (IoT) Data: Consumer
Consumer IoT provides new pathways for user experience and interfaces.
• Connected cars, smart home devices,
and wearables.
13
Internet of Things Data: Industrial
Used to monitor and control industrial operations and tools.
• Surgery bots capable of ‘seeing’ during
surgery via cameras.
• Numerous sensors and cameras installed
on autonomous trucks.
• Jet engine sensors providing real-time
feedback
14
Internet of Things Data: How to Access the data
By design, it is difficult and expensive to access data that organizations
maintain and control.
• There are some known open-source datasets.
(Ex: https://old.datahub.io/dataset/knoesis-linked-sensor-data)
• Design your own with development platforms such as Raspberry* PI* or
Netduino* P1.
15*Other names and brands may be claimed as the property of others.
Public Website
Data that is publically available on the web will allow you
access to multiple genres of data as needed.
• Encyclopedia
(Ex: Wikipedia*)
• Stock data
(Ex: Quandl*)
• Entertainment
(Ex: IMDb*)
16*Other names and brands may be claimed as the property of others.
Public Websites : How to Access the Data
Webscraping:
• Use with caution: the website’s stance on crawlers and
webscraping is usually within the terms and conditions
section of their site.
• Make sure your crawler follows the rules defined in a
website’s robots.txt file.
(Ex: don’t exceed request rate limit)
APIs:
• Often easier than webscraping
17
Legacy Documents
Several industries have traditionally used paper forms to collect
data, creating huge potential opportunities to shift to digital.
• Two major industries shifting to more digital
approaches are insurance and medicine.
• Medical records have traditionally been
paper-and-pencil.
• The push for digital promises to lead to better
outcomes for patients.
18
Multimedia
Data is present in more media types than ever before.
• Companies collect text, images, audio,
and video.
• New database technologies have evolved
to store this data.
(MMDBs - multimedia databases)
• New ML techniques (e.g. Deep Learning)
have evolved in part due to the necessity
of analyzing this data.
19
20
Data Types
21
There are different data types.
• Numerical
• Discrete
• Categorical
• Ordinal
• Binary
• Date-time
• Text
• Image
• Audio
Count Data
Integer valued data that comes from counting.
• For example, number of cars in a parking lot.
• The programming term for count data is integer.
• In Python* it is an int.
22*Other names and brands may be claimed as the property of others.
Numerical Data: Continuous
Numerical value that can represent any quantity over a continuous range.
• Can take take on decimal values.
(Ex: engine stroke = 3.40 in)
• Can be reduced to finer levels of precision.
(Ex: engine stroke = 3.3775 in)
• The programming term is floating-point.
• In Python* it is a float.
23*Other names and brands may be claimed as the property of others.
Categorical Data
Data that is restricted to a finite set of known categories.
• Also known as nominal data.
• Raw categorical data can come in the form
of different data types.
(Ex: text data (vehicle color: red)
or numerical data (number of doors: 4)
• The programming term is enumerated type.
24
Ordinal Data
Categorical data that is ordered.
• The distance between categories is not known.
(Ex: car price values of low, medium, and high
or customer reviews of poor, ok, and good)
• A common mistake is to simply convert ordinal data into integers.
(Ex: G, PG, and R to 1, 2, and 3)
• The inherent problem is that this assumes that the distance between
the categories are known and the same.
25
Binary Data
Two mutually exclusive categories.
• Binary data is very common, especially in supervised learning problems.
(Ex: true/false, heads/tails).
• The programming term for binary data is Boolean.
• In Python* it is a bool.
26*Other names and brands may be claimed as the property of others.
Date-time Data
Data that represents date and time information.
• Date, time of day, and fractional seconds based on a 24-hour
clock are combined.
• It is possible to convert to primitive types.
• Can be converted to categorical data, or convert to integer via
Unix* time stamp.
(Ex: to only Date, Month, Year, Time)
• Most languages have a built-in datatype for date-time data.
• In Python* it is also datetime.
27*Other names and brands may be claimed as the property of others.
Text Data
Alphanumeric strings, a sequence of characters.
• Typically human-readable.
• Can be encoded into computer-readable formats such as ASCII or unicode.
• Often is unstructured data.
• The programming term for text data is string.
• In Python* it is a str.
28*Other names and brands may be claimed as the property of others.
Image Data
Still and video images are being generated at a rapid rate.
• They are stored in various ways.
(Ex: PNG and JPEG)
• Images can be very large and computationally intensive.
• They can be multi-dimensional, as in 3D body scans.
• Generated from many types of devices in a diverse set of fields.
(Ex: satellites, self driving cars, surveillance cameras, mobile devices)
• Medical imaging is an increasingly important area.
29
Audio
Audio data is used in consumer devices like Alexa*, as well as
industrial settings like call centers.
• Stored in various compressed and uncompressed ways.
(Ex: WAV and MP4)
30*Other names and brands may be claimed as the property of others.
31
The Shape of Data
Identify the number of samples and features needed to solve the problem.
• The more features you have, the more samples are generally required.
• If your features don’t contain enough information, then adding more
samples wont help - additional features are required.
• If you have too few samples, then adding additional features can
cause overfitting.
• Difficult problems usually require more features and additional samples.
32
There are several ways to determine the number of sample needed.
• A learning curve plots the performance score
by number of samples.
• Use this curve to determine if increasing the
number of samples is helping your model.
• Statistical heuristic: 10x as many samples as
degrees of freedom.
• Peruse similar studies to learn from other
successes and failures.
How Many Samples?
33
Under-fitting vs Overfitting
A learning curve can also help diagnose whether a model is under-fit or overfit.
• Plot the performance score by the model
complexity.
• More features means more complexity.
• When train and test performance are similar
- but low - the model is under-fit.
• When test performance suffers, while train
performance improves, the model is overfit.
Underfit
Overfit
34
Under-fitting vs Overfitting
There is a relationship between under-fit and over-fit
models and the number of samples and features.
• In general, the number of samples should be
greater than your number of features.
• If overfitting:
• adding more features usually hurts
adding more samples usually helps
• If under-fitting:
• adding samples generally doesn’t help
• strong explanatory generally variables helps
35
How to Increase the Number of Features?
Increasing the number of features can help when the model is under-fitting.
• Revisit the data pipeline and add features that had previously been removed.
• Design new features - this is referred to as feature engineering.
• Polynomial features are a common method - this builds new features by
multiplying existing ones. (Ex: x1*x2 or x2*x2)
• Images/Audio: add transformations. (Ex: delta features - a measure of how
color changes from pixel-to-pixel - can be added to an existing image)
• Image augmentation for computer vision1.
• Create new images by shifting, flipping, rotating, existing images.
1Refer to ML 501 for more on this.
36
Reducing Features: Feature Selection
Select a subset of features in order to reduce model complexity
and remove redundant or weak features.
• Filter methods: filter features via a statistical method (p-value).
• Feature is considered on a univariate basis.
• Regularization methods: learn which features best contribute
to accuracy while the model is being created. 1
• Wrapper methods: selects subsets of features as a search problem,
where different combinations are evaluated to capture interactions between
features.
1Refer to ML 501 for more on this.
37
Reducing Features: Dimensionality Reduction
Dimensionality reduction is an unsupervised learning technique used
to reduce the number of features. 1
• Reduce dimensions (compress data) without losing too much
information. (Ex: using the eigenvectors of the covariance matrix as
the reduced dimensional space)
1 Refer to ML 501 for more on this.
38
39
Datasets for AI
There is no standard type of datasets used for AI.
• They can vary in shape and size.
• They can be made up of many different data types.
• They differ depending on the tasks.
40
Open Source Repositories
There are several repositories to get open source datasets.
• UCI Machine Learning Repository: Includes many popular datasets used
to analyze algorithms within the ML community.
• ImageNet: Large database of images.
• KDD Cup: An annual competition organized by the Association for
Computing Machinery. Yearly competition datasets are archived.
• Kaggle*: A platform for data science competitions. Competition datasets
and other datasets are available.
• Data.gov: Open data from the US government.
• And many more…
41*Other names and brands may be claimed as the property of others.
ImageNet
ImageNet is a large image database that is popular in the AI community.
• First presented at the 2009 Conference on Computer Vision and Pattern
Recognition.
• Currently has over 14 million images.
• Hosts annual object detection and classification competition.
• Deep learning models have had breakthrough results that have led to the
current focus on deep learning for modern AI.
• Many of the most popular deep learning models have come from this
competition.
42
Example Image Datasets
Below are some popular image dataset examples.
• MNIST: A popular benchmark for many machine learning models.
• Images of handwritten digits that has served as a
• 70,000 28 x 28 pixel black and white images.
• Cifar-10: A widely used dataset for computer vision research.
• 60,000 32 x 32 pixel color images.
• 10 different classes for classification. 6,000 images for each class.
• ILSVRC: The annual ImageNet competition has several components each
with their own datasets.
• Object localization: 1.2 million training images of 1,000 object classes
• Object detection: 456,567 training images of 200 object classes
43
Example Natural Language Datasets
Below are some popular language datasets.
• Common Crawl: Crawls the web four time a year and makes its archives
and datasets free for the public.
• The archive consists of 145 TB of data from 1.81 billion webpages as of
2015.
• Stanford Question Answering Dataset: A dataset of over 100,000
question-answer pairs.
• Project Gutenberg: Offers over 56,000 free eBooks.
44
Example Datasets
There are many other datasets available for different tasks.
• YouTube*-8M Dataset: Roughly 7 million video URLs containing 450,000
hours of video.
• MovieLens* Dataset: A dataset of roughly 20 million ratings on 27,000
movies by 138,000 users.
• OpenStreetMap: A crowdsourcing project to create a free map of the world.
• Over 2 million users collect data using survey, GPS, aerial photographs,
and more.
• 1000 Genome Project: Human genotype and variation data collected 2008-
2015
• 1,000 genomes with 84.4 million variants from 2,504 individuals.
45*Other names and brands may be claimed as the property of others.
46
Data Preprocessing
Data preprocessing is needed because the quality and structure of data is often
not immediately ready for analysis.
• Data obtained via webscraping or APIs is usually unstructured and must be
manipulated into numerical rows and columns for analysis.
• This often involves low-level manipulation of data objects, turning strings into
numeric data and vice versa.
• There could be missing data values.
• Different ML models require different data formatting.
47
Missing Values
Often certain parts of the data are missing.
• Replace missing values with a reasonable default
• Features with symmetric distribution: input the
missing values with the mean of the feature.
• If not symmetric or categorical, imputing the median
or mode respectively is best.
• If too many features are missing, it is best to
drop the observation.
48
Data Preprocessing
Some examples of ML models that require data to be
formatted differently include:
• Models requiring features to have a similar scale.
• Option 1: Scale each feature so that the
maximum is one and the minimum is zero.
• Option 2: Standardize each feature to have a
mean of zero and a standard deviation of one.
• Others transform the dependent or target so it is not
extremely skewed. (Ex: by using logarithms)
49
50
Labeling Data
We need labeled data for supervised learning tasks, but there is
usually more unlabeled than labeled data.
• Data can be hand-labeled by employees, but this can be
expensive and time consuming.
• Other options include Amazon Mechanical Turk* and
semi-supervised learning techniques.
51*Other names and brands may be claimed as the property of others.
Amazon Mechanical Turk*
Amazon Mechanical Turk* is an online marketplace for tasks.
• Organizations can post tasks on the website that “Turkers” can then choose to
perform.
• Major open source datasets have been complied using Amazon Mechanical
Turk.
• Microsoft Common Objects in Context (COCO) dataset contains images
with 5 short descriptions each - from “Turkers”
52*Other names and brands may be claimed as the property of others.
Semi-supervised
Semi-supervised learning involves using both labeled
and unlabeled data to solve tasks
• Build a classifier on the labeled data, use it to label
some of the unlabelled data, and then proceed
iteratively until all data has been labeled.
Learned Divisions
53
54
Challenges
There are many challenges with the collection, structure, and
usage of data that can lead to poor performance.
• All these can lead to overly optimistic and misleading results or
model failure.
• Some examples include: biases and outliers in the data,
inappropriate validation and testing, class imbalance in
classification problems.
55
Biases in Data
Algorithms will reflect the data they are trained on.
• When trained on biased data, they will reflect those biases.
(Ex: the meaning of the word “man” rather than “woman” may be
associated more with “power”, after being trained on a large
corpus of human-generated data)
• It’s important for the modeler to know how the data was collected
to address any biases that might be present.
• A lot of the time this cannot be corrected during the analysis
stage. There are corrections for some special cases.
56
Data Leakage
Data leakage is any time the train/test split is violated.
• The model doesn’t see test data until evaluation.
• Data leakage can lead to overestimates of how well
the model will generalize to unseen examples.
• There is no single solution to data leakage. The modeler
should be careful when developing validation strategies
and splitting the dataset.
• Data leakage is common when working with time
series data and during preprocessing.
• For example, models that make predictions about the
future based on the past should have train and test
sets organized this way or using test data to help
preprocess the training data.
57
Outlier
Outliers
Outliers are data points far away from other data points.
• They can skew model results by suggesting
performance is much lower than it actually is.
• The modeler should detect outliers and try to
understand why they’re present during the EDA stage.
• There are many outlier detection methods.
• There are many ways to combat outliers.
• Often outliers must be removed from the dataset, or
adjusted to fit the pattern of remaining dataset.
• Use models and/or metrics that are less sensitive to
outliers.
58
Imbalanced Datasets
Imbalanced classes can lead to difficulties with training
and evaluating models.
• For example, if 99% of labels belong to one class, then
a model that always predicts the majority class will be
99% accurate by default.
• Some modeling techniques to address this include
down-sampling the larger class, or using scoring
metrics other than accuracy to assess models.
59
60
Learning Objectives Recap
In this lesson, we worked to:
▪ Describe data sources and types.
▪ Recognize situations where more data samples or features are needed.
▪ Explain data wrangling, augmentation, and feature engineering.
▪ Describe different data preprocessing methods.
▪ Explain ways to label data.
▪ Identify challenges when working with data.
61
Sources for images used in this presentation
https://www.pexels.com/photo/auto-auto-racing-automobile-automotive-355913/
https://www.pexels.com/photo/automobile-automotive-beautiful-car-210013/
https://www.pexels.com/photo/monochrome-photography-of-round-silver-coin-839351/
62

Data collection and enhancement

  • 1.
    Lesson 5 Data Collectionand Enhancement 1
  • 2.
    2 Legal Disclaimers No license(express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others Copyright © 2018 Intel Corporation. All rights reserved.
  • 3.
    Learning Objectives 3 ▪ Describedata sources and types. ▪ Recognize situations where more data samples or features are needed. ▪ Explain data wrangling, augmentation, and feature engineering. ▪ Describe different data preprocessing methods. ▪ Explain ways to label data. ▪ Identify challenges when working with data. You will be able to:
  • 4.
    4 Data Collection andPreprocessing Problem Statement Data Collection Data Exploration & Preprocessing Modeling Validation Decision Making & Deployment What problem are you trying to solve? What data do you need to solve it? How should you clean your data so your model can use it? Build a model to solve your problem? Did I solve the problem? Communicate to stakeholders or put into production?
  • 5.
    Data Collection 5 There areseveral things to consider when collecting data. • Where does the data come from? • What type of data is there? • How much data and what attributes do I need?
  • 6.
  • 7.
    Data Sources 7 Data issourced from many different places, for example: • Human generated • Internet of Things (IoT) and machine generated • Public website • Legacy documents • Multimedia
  • 8.
    Human Generated Data:Social Media Within the last two years 90% of the world’s data was created. • Experts believe that about 70% of this data is coming from social media. • Our current amount of data output is 2.5 quintillion bytes per day (2.5 exabytes per day). 8
  • 9.
  • 10.
    Human Generated Data:Social Media There are various APIs to access this data. • Facebook*: Graph API • Twitter*: REST API or Streaming API • Instagram*: Graph API or Platform API • LinkedIn*: REST API • Pinterest*: REST API 10*Other names and brands may be claimed as the property of others.
  • 11.
    Human Generated Data:Media & Publications Social media is not the only source of human generated data; every year humans generate a large amount of media and publications. • 2.2 million books published every year • 2 million blog posts published every day • 269 billion emails published every year • Though not publicly available, this data contains valuable information for companies—e.g. powering services like Google Smart Reply*. 11*Other names and brands may be claimed as the property of others.
  • 12.
    Internet of Things(IoT) Data IoT data consists of all web-enabled devices that collect, send and respond to data they required from their respective environments. • By 2020 IoT will include a projected 200 billion smart-and-connected devices. • The data produced is expected to double every two years to total 40 zettabytes (40 trillion gigabytes). • Collected via sensors, cameras, and processors. 12
  • 13.
    Internet of Things(IoT) Data: Consumer Consumer IoT provides new pathways for user experience and interfaces. • Connected cars, smart home devices, and wearables. 13
  • 14.
    Internet of ThingsData: Industrial Used to monitor and control industrial operations and tools. • Surgery bots capable of ‘seeing’ during surgery via cameras. • Numerous sensors and cameras installed on autonomous trucks. • Jet engine sensors providing real-time feedback 14
  • 15.
    Internet of ThingsData: How to Access the data By design, it is difficult and expensive to access data that organizations maintain and control. • There are some known open-source datasets. (Ex: https://old.datahub.io/dataset/knoesis-linked-sensor-data) • Design your own with development platforms such as Raspberry* PI* or Netduino* P1. 15*Other names and brands may be claimed as the property of others.
  • 16.
    Public Website Data thatis publically available on the web will allow you access to multiple genres of data as needed. • Encyclopedia (Ex: Wikipedia*) • Stock data (Ex: Quandl*) • Entertainment (Ex: IMDb*) 16*Other names and brands may be claimed as the property of others.
  • 17.
    Public Websites :How to Access the Data Webscraping: • Use with caution: the website’s stance on crawlers and webscraping is usually within the terms and conditions section of their site. • Make sure your crawler follows the rules defined in a website’s robots.txt file. (Ex: don’t exceed request rate limit) APIs: • Often easier than webscraping 17
  • 18.
    Legacy Documents Several industrieshave traditionally used paper forms to collect data, creating huge potential opportunities to shift to digital. • Two major industries shifting to more digital approaches are insurance and medicine. • Medical records have traditionally been paper-and-pencil. • The push for digital promises to lead to better outcomes for patients. 18
  • 19.
    Multimedia Data is presentin more media types than ever before. • Companies collect text, images, audio, and video. • New database technologies have evolved to store this data. (MMDBs - multimedia databases) • New ML techniques (e.g. Deep Learning) have evolved in part due to the necessity of analyzing this data. 19
  • 20.
  • 21.
    Data Types 21 There aredifferent data types. • Numerical • Discrete • Categorical • Ordinal • Binary • Date-time • Text • Image • Audio
  • 22.
    Count Data Integer valueddata that comes from counting. • For example, number of cars in a parking lot. • The programming term for count data is integer. • In Python* it is an int. 22*Other names and brands may be claimed as the property of others.
  • 23.
    Numerical Data: Continuous Numericalvalue that can represent any quantity over a continuous range. • Can take take on decimal values. (Ex: engine stroke = 3.40 in) • Can be reduced to finer levels of precision. (Ex: engine stroke = 3.3775 in) • The programming term is floating-point. • In Python* it is a float. 23*Other names and brands may be claimed as the property of others.
  • 24.
    Categorical Data Data thatis restricted to a finite set of known categories. • Also known as nominal data. • Raw categorical data can come in the form of different data types. (Ex: text data (vehicle color: red) or numerical data (number of doors: 4) • The programming term is enumerated type. 24
  • 25.
    Ordinal Data Categorical datathat is ordered. • The distance between categories is not known. (Ex: car price values of low, medium, and high or customer reviews of poor, ok, and good) • A common mistake is to simply convert ordinal data into integers. (Ex: G, PG, and R to 1, 2, and 3) • The inherent problem is that this assumes that the distance between the categories are known and the same. 25
  • 26.
    Binary Data Two mutuallyexclusive categories. • Binary data is very common, especially in supervised learning problems. (Ex: true/false, heads/tails). • The programming term for binary data is Boolean. • In Python* it is a bool. 26*Other names and brands may be claimed as the property of others.
  • 27.
    Date-time Data Data thatrepresents date and time information. • Date, time of day, and fractional seconds based on a 24-hour clock are combined. • It is possible to convert to primitive types. • Can be converted to categorical data, or convert to integer via Unix* time stamp. (Ex: to only Date, Month, Year, Time) • Most languages have a built-in datatype for date-time data. • In Python* it is also datetime. 27*Other names and brands may be claimed as the property of others.
  • 28.
    Text Data Alphanumeric strings,a sequence of characters. • Typically human-readable. • Can be encoded into computer-readable formats such as ASCII or unicode. • Often is unstructured data. • The programming term for text data is string. • In Python* it is a str. 28*Other names and brands may be claimed as the property of others.
  • 29.
    Image Data Still andvideo images are being generated at a rapid rate. • They are stored in various ways. (Ex: PNG and JPEG) • Images can be very large and computationally intensive. • They can be multi-dimensional, as in 3D body scans. • Generated from many types of devices in a diverse set of fields. (Ex: satellites, self driving cars, surveillance cameras, mobile devices) • Medical imaging is an increasingly important area. 29
  • 30.
    Audio Audio data isused in consumer devices like Alexa*, as well as industrial settings like call centers. • Stored in various compressed and uncompressed ways. (Ex: WAV and MP4) 30*Other names and brands may be claimed as the property of others.
  • 31.
  • 32.
    The Shape ofData Identify the number of samples and features needed to solve the problem. • The more features you have, the more samples are generally required. • If your features don’t contain enough information, then adding more samples wont help - additional features are required. • If you have too few samples, then adding additional features can cause overfitting. • Difficult problems usually require more features and additional samples. 32
  • 33.
    There are severalways to determine the number of sample needed. • A learning curve plots the performance score by number of samples. • Use this curve to determine if increasing the number of samples is helping your model. • Statistical heuristic: 10x as many samples as degrees of freedom. • Peruse similar studies to learn from other successes and failures. How Many Samples? 33
  • 34.
    Under-fitting vs Overfitting Alearning curve can also help diagnose whether a model is under-fit or overfit. • Plot the performance score by the model complexity. • More features means more complexity. • When train and test performance are similar - but low - the model is under-fit. • When test performance suffers, while train performance improves, the model is overfit. Underfit Overfit 34
  • 35.
    Under-fitting vs Overfitting Thereis a relationship between under-fit and over-fit models and the number of samples and features. • In general, the number of samples should be greater than your number of features. • If overfitting: • adding more features usually hurts adding more samples usually helps • If under-fitting: • adding samples generally doesn’t help • strong explanatory generally variables helps 35
  • 36.
    How to Increasethe Number of Features? Increasing the number of features can help when the model is under-fitting. • Revisit the data pipeline and add features that had previously been removed. • Design new features - this is referred to as feature engineering. • Polynomial features are a common method - this builds new features by multiplying existing ones. (Ex: x1*x2 or x2*x2) • Images/Audio: add transformations. (Ex: delta features - a measure of how color changes from pixel-to-pixel - can be added to an existing image) • Image augmentation for computer vision1. • Create new images by shifting, flipping, rotating, existing images. 1Refer to ML 501 for more on this. 36
  • 37.
    Reducing Features: FeatureSelection Select a subset of features in order to reduce model complexity and remove redundant or weak features. • Filter methods: filter features via a statistical method (p-value). • Feature is considered on a univariate basis. • Regularization methods: learn which features best contribute to accuracy while the model is being created. 1 • Wrapper methods: selects subsets of features as a search problem, where different combinations are evaluated to capture interactions between features. 1Refer to ML 501 for more on this. 37
  • 38.
    Reducing Features: DimensionalityReduction Dimensionality reduction is an unsupervised learning technique used to reduce the number of features. 1 • Reduce dimensions (compress data) without losing too much information. (Ex: using the eigenvectors of the covariance matrix as the reduced dimensional space) 1 Refer to ML 501 for more on this. 38
  • 39.
  • 40.
    Datasets for AI Thereis no standard type of datasets used for AI. • They can vary in shape and size. • They can be made up of many different data types. • They differ depending on the tasks. 40
  • 41.
    Open Source Repositories Thereare several repositories to get open source datasets. • UCI Machine Learning Repository: Includes many popular datasets used to analyze algorithms within the ML community. • ImageNet: Large database of images. • KDD Cup: An annual competition organized by the Association for Computing Machinery. Yearly competition datasets are archived. • Kaggle*: A platform for data science competitions. Competition datasets and other datasets are available. • Data.gov: Open data from the US government. • And many more… 41*Other names and brands may be claimed as the property of others.
  • 42.
    ImageNet ImageNet is alarge image database that is popular in the AI community. • First presented at the 2009 Conference on Computer Vision and Pattern Recognition. • Currently has over 14 million images. • Hosts annual object detection and classification competition. • Deep learning models have had breakthrough results that have led to the current focus on deep learning for modern AI. • Many of the most popular deep learning models have come from this competition. 42
  • 43.
    Example Image Datasets Beloware some popular image dataset examples. • MNIST: A popular benchmark for many machine learning models. • Images of handwritten digits that has served as a • 70,000 28 x 28 pixel black and white images. • Cifar-10: A widely used dataset for computer vision research. • 60,000 32 x 32 pixel color images. • 10 different classes for classification. 6,000 images for each class. • ILSVRC: The annual ImageNet competition has several components each with their own datasets. • Object localization: 1.2 million training images of 1,000 object classes • Object detection: 456,567 training images of 200 object classes 43
  • 44.
    Example Natural LanguageDatasets Below are some popular language datasets. • Common Crawl: Crawls the web four time a year and makes its archives and datasets free for the public. • The archive consists of 145 TB of data from 1.81 billion webpages as of 2015. • Stanford Question Answering Dataset: A dataset of over 100,000 question-answer pairs. • Project Gutenberg: Offers over 56,000 free eBooks. 44
  • 45.
    Example Datasets There aremany other datasets available for different tasks. • YouTube*-8M Dataset: Roughly 7 million video URLs containing 450,000 hours of video. • MovieLens* Dataset: A dataset of roughly 20 million ratings on 27,000 movies by 138,000 users. • OpenStreetMap: A crowdsourcing project to create a free map of the world. • Over 2 million users collect data using survey, GPS, aerial photographs, and more. • 1000 Genome Project: Human genotype and variation data collected 2008- 2015 • 1,000 genomes with 84.4 million variants from 2,504 individuals. 45*Other names and brands may be claimed as the property of others.
  • 46.
  • 47.
    Data Preprocessing Data preprocessingis needed because the quality and structure of data is often not immediately ready for analysis. • Data obtained via webscraping or APIs is usually unstructured and must be manipulated into numerical rows and columns for analysis. • This often involves low-level manipulation of data objects, turning strings into numeric data and vice versa. • There could be missing data values. • Different ML models require different data formatting. 47
  • 48.
    Missing Values Often certainparts of the data are missing. • Replace missing values with a reasonable default • Features with symmetric distribution: input the missing values with the mean of the feature. • If not symmetric or categorical, imputing the median or mode respectively is best. • If too many features are missing, it is best to drop the observation. 48
  • 49.
    Data Preprocessing Some examplesof ML models that require data to be formatted differently include: • Models requiring features to have a similar scale. • Option 1: Scale each feature so that the maximum is one and the minimum is zero. • Option 2: Standardize each feature to have a mean of zero and a standard deviation of one. • Others transform the dependent or target so it is not extremely skewed. (Ex: by using logarithms) 49
  • 50.
  • 51.
    Labeling Data We needlabeled data for supervised learning tasks, but there is usually more unlabeled than labeled data. • Data can be hand-labeled by employees, but this can be expensive and time consuming. • Other options include Amazon Mechanical Turk* and semi-supervised learning techniques. 51*Other names and brands may be claimed as the property of others.
  • 52.
    Amazon Mechanical Turk* AmazonMechanical Turk* is an online marketplace for tasks. • Organizations can post tasks on the website that “Turkers” can then choose to perform. • Major open source datasets have been complied using Amazon Mechanical Turk. • Microsoft Common Objects in Context (COCO) dataset contains images with 5 short descriptions each - from “Turkers” 52*Other names and brands may be claimed as the property of others.
  • 53.
    Semi-supervised Semi-supervised learning involvesusing both labeled and unlabeled data to solve tasks • Build a classifier on the labeled data, use it to label some of the unlabelled data, and then proceed iteratively until all data has been labeled. Learned Divisions 53
  • 54.
  • 55.
    Challenges There are manychallenges with the collection, structure, and usage of data that can lead to poor performance. • All these can lead to overly optimistic and misleading results or model failure. • Some examples include: biases and outliers in the data, inappropriate validation and testing, class imbalance in classification problems. 55
  • 56.
    Biases in Data Algorithmswill reflect the data they are trained on. • When trained on biased data, they will reflect those biases. (Ex: the meaning of the word “man” rather than “woman” may be associated more with “power”, after being trained on a large corpus of human-generated data) • It’s important for the modeler to know how the data was collected to address any biases that might be present. • A lot of the time this cannot be corrected during the analysis stage. There are corrections for some special cases. 56
  • 57.
    Data Leakage Data leakageis any time the train/test split is violated. • The model doesn’t see test data until evaluation. • Data leakage can lead to overestimates of how well the model will generalize to unseen examples. • There is no single solution to data leakage. The modeler should be careful when developing validation strategies and splitting the dataset. • Data leakage is common when working with time series data and during preprocessing. • For example, models that make predictions about the future based on the past should have train and test sets organized this way or using test data to help preprocess the training data. 57
  • 58.
    Outlier Outliers Outliers are datapoints far away from other data points. • They can skew model results by suggesting performance is much lower than it actually is. • The modeler should detect outliers and try to understand why they’re present during the EDA stage. • There are many outlier detection methods. • There are many ways to combat outliers. • Often outliers must be removed from the dataset, or adjusted to fit the pattern of remaining dataset. • Use models and/or metrics that are less sensitive to outliers. 58
  • 59.
    Imbalanced Datasets Imbalanced classescan lead to difficulties with training and evaluating models. • For example, if 99% of labels belong to one class, then a model that always predicts the majority class will be 99% accurate by default. • Some modeling techniques to address this include down-sampling the larger class, or using scoring metrics other than accuracy to assess models. 59
  • 60.
    60 Learning Objectives Recap Inthis lesson, we worked to: ▪ Describe data sources and types. ▪ Recognize situations where more data samples or features are needed. ▪ Explain data wrangling, augmentation, and feature engineering. ▪ Describe different data preprocessing methods. ▪ Explain ways to label data. ▪ Identify challenges when working with data.
  • 61.
  • 62.
    Sources for imagesused in this presentation https://www.pexels.com/photo/auto-auto-racing-automobile-automotive-355913/ https://www.pexels.com/photo/automobile-automotive-beautiful-car-210013/ https://www.pexels.com/photo/monochrome-photography-of-round-silver-coin-839351/ 62

Editor's Notes

  • #5 Quick review of the workflow.
  • #8 Data can be collected from many sources. How the data is collected and what it looks like depends on its source.
  • #9 Use of social media has led to a large influx of data in recent years. http://www.iflscience.com/technology/how-much-data-does-the-world-generate-every-minute/ Other fun stuff: https://innovate.reduxio.com/the-worlds-data
  • #10 It can be difficult to get a sense of scale when talking about big data. Use this to help visualize the daily scale. http://www.iflscience.com/technology/how-much-data-does-the-world-generate-every-minute/ Other fun stuff: https://innovate.reduxio.com/the-worlds-data
  • #11 We mentioned social media. The major platforms have provided ways to access that data in a programmatic way. More information about APIs (ref here : https://sproutsocial.com/insights/what-is-an-api/) https://www.programmableweb.com/news/top-10-social-apis-facebook-twitter-and-google-plus/analysis/2015/02/17
  • #12 Highlight how humans generate a lot of data because they’re easily able to publish data on the web and that email, chat, etc are now common forms of communication
  • #13 Reiterate that IoT -> Internet of things Main points: IoT is a more recent source of data Scale and impact are rapidly increasing https://en.wikipedia.org/wiki/Internet_of_things#Enterprise
  • #14 Main point is that consumers data is collected via IoT devices and used to create new experiences, products, etc
  • #15 Main point is that industrial data is collected via IoT devices. This will be used in a wide range of industrial equipment impacting healthcare, transportation, logistics, etc
  • #17 Main point is that data on public websites are available. That doesn’t mean that you have the legal right to use it.
  • #18 Use an API when available. Scraping gives access to data that would otherwise be difficult to acquire.
  • #19 Main point is that a lot of historical data is sourced from legacy documents
  • #20 Main points A large amount of multimedia data is being created The scale and datatypes are non traditional This has led to new methods to store and analyze this type of data
  • #22 These are just a few examples. Optional question: Can you think of any others?
  • #24 Use decimals and finer levels of precision to contrast with count data. Note: To motivate / explain datatypes: I will use features from used used car dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/autos/
  • #28 Dates and times are notoriously tricky to work with. - Need to represent dates in an appropriate way. For example, seasonal categorical data Be careful with validation when working with time series. For example, training on the future and testing on the past Make sure to consider daylight savings status and time zones
  • #29 Text ofter needs to be transformed into other datatypes for analysis. The details are beyond the scope of today but some examples are Word count Tfidf Word embeddings etc
  • #30 Image datasets can be very large and computationally intensive.
  • #31 Both image and audio data are also often *streamed*. So a model doesn’t access a discrete file, but a flow of data generated real-time
  • #33 Main point is that you need to decided the amount of data and number of features to train your model. A common misconceptions is that more rows (volume) is always better. It depends on the the difficulty of the problem, the algorithm being used, and if you’re under or over fitting the model
  • #34 Highlight the several ways to determine the number of samples.
  • #35 Idea here is that you want your cross-validation score to be near your training score.
  • #36 Underfitting means that our model is not performing well on the training data—the data it sees. Overfitting means that our model is not performing well on the validation data—data the model does not see.
  • #37 When our model is doing worse than we expect on the training data, one possible solution is to allow the model to look at more features.
  • #38 When our model is doing poorly on our validation data compared to training data we could be overfitting and reducing the number of features might help.
  • #48 We discussed the shape of data previously. Pre-processing takes data of all kinds and makes it regular
  • #49 Data rarely comes to you perfect. Here’s how to approach missing values.
  • #50 Many models perform better when features have a similar scale. Another problems comes when data doesn’t have the distribution models expect.
  • #52 Some data comes to us labeled. It’s common for data to be unlabeled, though.
  • #53 One option is to pay workers to label the data
  • #54 Another important option is to use both labelled and unlabeled data to improve performance on the task
  • #56 The main point is that proper data collection and preprocessing is one of the most important steps in creating an appropriate model.
  • #57 To combat biases the modeler needs to understand the data collection process. This generally can’t be overcome during the analysis phase. There are some solutions for special cases.
  • #58 This is particularly easy to do with time-series data.
  • #60 Common in medical applications.