0% found this document useful (0 votes)
21 views5 pages

Data Exploration

Uploaded by

panu.sahasra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views5 pages

Data Exploration

Uploaded by

panu.sahasra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1.

Data Exploration

In this stage of project cycle, we try to interpret some useful information out of the data we have acquired. For this purpose,
we need to explore the data and try to put it uniformly for a better understanding. This stage deals with validating or
verification of the collected data and to analyze that:

 The data is according to the specifications decided.


 The data is free from errors.
 The data is meeting our needs This stage
is divided into 2 sub stages.

1) Data Cleaning
2) Data Visualization.

Data Cleaning

Data cleaning helps in getting rid of commonly found errors and mistakes in a data set. These are the 3 commonly found errors
in data.

1) Outliers: Data points existing out of the range.


2) Missing data: Data points missing at certain places.
3) Erroneous data: Incorrect data points.

Outliers

An outlier is a data point in a dataset that is distant from all other observations.

or

An outlier is something that behaves d i f f e r e n t l y f r o m the combination/ collection of the data.

Missing Data

What do these NaN values indicate?They


are the missing values in the data set. We can handle them in two ways:

1. By eliminating the rows of missing values.


(Generally, not recommended as it might reduce
the data set to some extent leading to less data to be
trained)

2. By using an Imputer to find the best possible substitute to replace missing values.

Erroneous Data Student Class


Name
Erroneous data is test data that falls outside of what is RIYA GEORGE XA acceptable and should
JOSHUA SAM XA
be rejected by the system.
APARNA BINU XA
SIDHARDH V R XA
NITHILA M 57
ATHULYA M S XA
ANUJA MS XB
KEERTHI
XB
KRISHNANATH
Data Visualization

Why we need to explore data through visualization?

1) We want to quickly get a sense of the trends, relationships, and patterns contained within the data.
2) It helps us define strategy for which model to use at a later stage.
3) Visual representation is easier to understand and communicate to others

Please draw all the graphs and write description from the T.B as it is

Modelling

It’s the fourth stage of AI project cycle. In previous stage, i.e. graphical representation makes the data understandable for
humans as we can discover trends and patterns out of it. But when it comes to machines accessing and analyzing data, it
needs the data in the most basic form of numbers (which is binary – 0s and 1s) and when it comes to discovering patterns
and trends in data, the machine goes in for mathematical representations of the same.

The ability to mathematically describe the relationship between parameters is the heart of every AI model. Generally, AI
models can be classified as follows:

Rule Based Approach

In this approach, the rules are defined by the developer. The machine follows the rules or instructions mentioned by the
developer and performs its task accordingly. So, it’s a static model. i.e. the machine once trained, does not take into
consideration any changes made in the original training dataset.

Thus, machine learning gets introduced as an extension to this as in that case, the machine adapts to change in data and
rules and follows the updated path only, while a rule-based model does what it has been taught once.

Learning Based Approach

It’s a type of AI modelling where the machine learns by itself. Under the Learning Based approach, the AI model gets
trained on the data fed to it and then is able to design a model which is adaptive to the change in data. That is, if the
model is trained with X type of data and the machine designs the algorithm around it, the model would modify itself
according to the changes which occur in the data so that all the exceptions are handled in this case.

After training, the machine is now fed with testing data. Now, the testing data might not have similar images as the ones
on which the model has been trained. So, the model adapts to the features on which it has been trained and accordingly
predicts the output. In this way, the machine learns by itself by adapting to the new data which is flowing in. This is the
machine learning approach which introduces the dynamicity in the model.
Generally, learning based models can be classified as follows:
I. Supervised Learning

In a supervised learning model, the dataset which is fed to the machine is labelled. In other
words, we can say that the dataset is known to the person who is training the machine only then
he/she is able to label the data. A label is some information which can be used as a tag for data.
For example, students get grades according to the marks they secure in examinations. These
grades are labels which categorize the students according to their marks.

There are two main types of supervised learning models:

a) Classification
In this model, data is classified according to the labels. For example, in the grading system, students are classified on the
basis of the grades they obtain with respect to their marks in the examination. This model works on discrete dataset which
means the data need not be continuous.

b) Regression
This model work on continuous data. For example, if you wish to predict your next salary, then you would put in the data
of your previous salary, any increments, etc., and would train the model. Here, the data which has been fed to the
machine is continuous.
II. Unsupervised Learning

An unsupervised learning model works on unlabeled dataset. This means that the data which is fed to the machine is
random and there is a possibility that the person who is training the model does not have any information regarding it. The
unsupervised learning models are used to identify relationships, patterns and trends out of the data which is fed into it. It
helps the user in understanding what the data is about and what are the major features identified by the machine in it.
For example, you have a random data of 1000 dog images and you wish to understand some pattern out of it, you would
feed this data into the unsupervised learning model and would train the machine on it. After training, the machine would
come up with patterns which it was able
to identify out of it. The Machine might come up with patterns which are already known to the user like colour or it might
even come up with something very unusual like the size of the dogs. There are two main types of unsupervised learning
models:

a) Clustering

It refers to the unsupervised learning algorithm which can cluster the unknown data according to the patterns or
trends identified out of it. The patterns observed might be the ones which are known to the developer or it might
even come up with some unique patterns out of it.

.
b) Dimensionality Reduction

We humans are able to visualize up to 3-Dimensions only but according to a lot of theories and algorithms, there are
various entities which exist beyond 3-Dimensions. For example, in Natural language Processing, the words are
considered to be N-Dimensional entities. Which means that we cannot visualize them as they exist beyond our
visualization ability. Hence, to make sense out of it, we need to reduce their dimensions. Here, dimensionality reduction
algorithm is used.

III. Reinforcement Learning


It a type of machine learning technique that enables an agent(model) to learn in an interactive environment by trial and
error using feedback from its own actions and experiences. Though both supervised and reinforcement learning use
mapping between input and output, unlike supervised learning where feedback provided to the agent(model) is correct
set of actions for performing a task, reinforcement learning uses rewards and punishment as signals for positive and
negative
behavior. Reinforcement learning is all about making decisions sequentially.

5. Evaluation
Evaluation is a process of understanding the reliability of any AI model, based on outputs by feeding the test data set into
the model and comparing it with actual answers. i.e. once a model has been made and trained, it needs to go through
proper testing so that one can calculate the efficiency and performance of the model. Hence, the model is tested with the
help of Testing Data. which was separated out of the acquired data set at Data Acquisition stage.

Accuracy
Accuracy is define as the percentage of correct predictions out of all the observations.

Precision
Precision is defined as the percentage of true positive cases versus all the cases where the prediction is true.

Recall
Recall is defined as the fraction of positive cases that are correctly Identified.

F1 score
The F1 score is a number between 0 and 1 and is the harmonic mean of precision and recall

You might also like