Data Exploration
Data Exploration
Data Exploration
In this stage of project cycle, we try to interpret some useful information out of the data we have acquired. For this purpose,
we need to explore the data and try to put it uniformly for a better understanding. This stage deals with validating or
verification of the collected data and to analyze that:
1) Data Cleaning
2) Data Visualization.
Data Cleaning
Data cleaning helps in getting rid of commonly found errors and mistakes in a data set. These are the 3 commonly found errors
in data.
Outliers
An outlier is a data point in a dataset that is distant from all other observations.
or
Missing Data
2. By using an Imputer to find the best possible substitute to replace missing values.
1) We want to quickly get a sense of the trends, relationships, and patterns contained within the data.
2) It helps us define strategy for which model to use at a later stage.
3) Visual representation is easier to understand and communicate to others
Please draw all the graphs and write description from the T.B as it is
Modelling
It’s the fourth stage of AI project cycle. In previous stage, i.e. graphical representation makes the data understandable for
humans as we can discover trends and patterns out of it. But when it comes to machines accessing and analyzing data, it
needs the data in the most basic form of numbers (which is binary – 0s and 1s) and when it comes to discovering patterns
and trends in data, the machine goes in for mathematical representations of the same.
The ability to mathematically describe the relationship between parameters is the heart of every AI model. Generally, AI
models can be classified as follows:
In this approach, the rules are defined by the developer. The machine follows the rules or instructions mentioned by the
developer and performs its task accordingly. So, it’s a static model. i.e. the machine once trained, does not take into
consideration any changes made in the original training dataset.
Thus, machine learning gets introduced as an extension to this as in that case, the machine adapts to change in data and
rules and follows the updated path only, while a rule-based model does what it has been taught once.
It’s a type of AI modelling where the machine learns by itself. Under the Learning Based approach, the AI model gets
trained on the data fed to it and then is able to design a model which is adaptive to the change in data. That is, if the
model is trained with X type of data and the machine designs the algorithm around it, the model would modify itself
according to the changes which occur in the data so that all the exceptions are handled in this case.
After training, the machine is now fed with testing data. Now, the testing data might not have similar images as the ones
on which the model has been trained. So, the model adapts to the features on which it has been trained and accordingly
predicts the output. In this way, the machine learns by itself by adapting to the new data which is flowing in. This is the
machine learning approach which introduces the dynamicity in the model.
Generally, learning based models can be classified as follows:
I. Supervised Learning
In a supervised learning model, the dataset which is fed to the machine is labelled. In other
words, we can say that the dataset is known to the person who is training the machine only then
he/she is able to label the data. A label is some information which can be used as a tag for data.
For example, students get grades according to the marks they secure in examinations. These
grades are labels which categorize the students according to their marks.
a) Classification
In this model, data is classified according to the labels. For example, in the grading system, students are classified on the
basis of the grades they obtain with respect to their marks in the examination. This model works on discrete dataset which
means the data need not be continuous.
b) Regression
This model work on continuous data. For example, if you wish to predict your next salary, then you would put in the data
of your previous salary, any increments, etc., and would train the model. Here, the data which has been fed to the
machine is continuous.
II. Unsupervised Learning
An unsupervised learning model works on unlabeled dataset. This means that the data which is fed to the machine is
random and there is a possibility that the person who is training the model does not have any information regarding it. The
unsupervised learning models are used to identify relationships, patterns and trends out of the data which is fed into it. It
helps the user in understanding what the data is about and what are the major features identified by the machine in it.
For example, you have a random data of 1000 dog images and you wish to understand some pattern out of it, you would
feed this data into the unsupervised learning model and would train the machine on it. After training, the machine would
come up with patterns which it was able
to identify out of it. The Machine might come up with patterns which are already known to the user like colour or it might
even come up with something very unusual like the size of the dogs. There are two main types of unsupervised learning
models:
a) Clustering
It refers to the unsupervised learning algorithm which can cluster the unknown data according to the patterns or
trends identified out of it. The patterns observed might be the ones which are known to the developer or it might
even come up with some unique patterns out of it.
.
b) Dimensionality Reduction
We humans are able to visualize up to 3-Dimensions only but according to a lot of theories and algorithms, there are
various entities which exist beyond 3-Dimensions. For example, in Natural language Processing, the words are
considered to be N-Dimensional entities. Which means that we cannot visualize them as they exist beyond our
visualization ability. Hence, to make sense out of it, we need to reduce their dimensions. Here, dimensionality reduction
algorithm is used.
5. Evaluation
Evaluation is a process of understanding the reliability of any AI model, based on outputs by feeding the test data set into
the model and comparing it with actual answers. i.e. once a model has been made and trained, it needs to go through
proper testing so that one can calculate the efficiency and performance of the model. Hence, the model is tested with the
help of Testing Data. which was separated out of the acquired data set at Data Acquisition stage.
Accuracy
Accuracy is define as the percentage of correct predictions out of all the observations.
Precision
Precision is defined as the percentage of true positive cases versus all the cases where the prediction is true.
Recall
Recall is defined as the fraction of positive cases that are correctly Identified.
F1 score
The F1 score is a number between 0 and 1 and is the harmonic mean of precision and recall